Almost done with the base review

This commit is contained in:
RichardG867
2025-02-03 13:13:32 -03:00
parent 351d5932e8
commit 1bef1fe60e

View File

@@ -164,7 +164,7 @@ The chip's RAMDAC handles final conversion of the digital image generated by the
### Memory mapping
Before we can discuss any part of how the RIVA 128 works, the memory architecture must be explained, since this is a fundamental requirement to even access the graphics card's registers in the first place. NVIDIA picked a fairly strange memory mapping architecture, at least for cards of that time. The exact setup of the memory mapping changed numerous times as NVIDIA's architecture evolved, so only NV3-based GPUs will be analyzed.
Before we can discuss any part of how the RIVA 128 works, its memory architecture must be explained, since this is a fundamental requirement to even access the graphics card's registers in the first place. NVIDIA picked a fairly strange memory mapping architecture, at least for cards of that time. The exact setup of the memory mapping changed numerous times as NVIDIA's architecture evolved, so only NV3-based GPUs will be analyzed.
The memory mapping is split into three primary components, all exposed via memory-mapped I/O through Base Address Registers (BAR) in PCI configuration space; there is no port I/O support outside of the Weitek core's registers for SVGA compatibility. The RIVA 128 uses two BARs, both 16 MB in size: BAR0 holding the main GPU registers, and BAR1 holding the `DFB` and `RAMIN` areas (which really refer to overlapping areas of memory).
@@ -233,139 +233,132 @@ This MMIO area has numerous functional subsystems of the GPU mapped into it, wit
This area is effectively the last megabyte of VRAM (regardless of VRAM size), but organized as 16-byte blocks which are then stored from the top down. A `RAMIN` address can be converted to a real VRAM address with the formula `ramin_address ^ (vram_size - 16)`. I'm not entirely sure why they did this, but I assume it was for providing a more convenient interface to the user and for general efficiency reasons.
#### Interrupts
### Interrupts
A traditional interrupt system is implemented, supporting interrupts issued by different GPU components. `PMC` contains an interrupt status register and an interrupt enable register, with one bit for each component (including the eventually-removed `PAUDIO`), as well as a software interrupt represented by bit 31; components also have a local status register and enable register, with each bit representing an individual interrupt from that block. If the `PMC` interrupt status and enable bits for a given component are both 1, with some minor exceptions to be explained in later parts, an interrupt is declared to be pending and a PCI IRQ is sent.
Interrupts can be turned off globally (or just component interrupts, or just the software interrupt) via the `PMC_INTR_EN` register at `0x0140`.
Interrupts can be turned off globally (or just component interrupts, or just the software interrupt) via the `PMC_INTR_EN` register.
#### Programmable interval timer
### Programmable interval timer
Time-sensitive functions are provided by a relatively simple programmable interval timer `PTIMER` that fires an interrupt whenever the threshold value (set by the `PTIMER_ALARM`) is exceeded in nanoseconds. This is how the drivers internally keep track of many actions that they need to perform, and is the first functional block which must be done right if you ever hope to emulate the RIVA 128.
The least straightforward part of this timer is the counter, a 56-bit value split across two 32-bit registers: the lower 27 bits are stored in bits [31:5] of `PTIMER_TIME0`, and the upper 29 bits are stored in bits [28:0] of `PTIMER_TIME1`.
#### Graphics commands and DMA engine
### Graphics commands and DMA engine
What may be called *graphics commands* in other GPU architectures are instead called *graphics objects* in the NV3 and all other NVIDIA architectures. Objects are submitted into the GPU core via a custom direct memory access engine with its own translation lookaside buffer and other memory management structures, although programmed I/O can also be used.
What may be called *graphics commands* in other GPU architectures are instead called *graphics objects* in the NV3 and all other NVIDIA architectures. Objects are submitted into the GPU core via a custom direct memory access engine with its own translation lookaside buffer and other memory management structures, although programmed I/O can also be used as a slower alternative.
There are 8 DMA channels, with the default being channel 0, but only one can be used at a time; using other channels requires a *context switch*, which entails writing the current channel ID to to PGRAPH registers for every class. All DMA channels use 64 KB of RAMIN memory (to be explained later), further divided into 8 KB subchannels; the meaning of what is in those subchannels depends on the type (or *class* to use NVIDIA terminology) of object submitted into them, with the attributes of each object being called a *method*.
There are 8 DMA channels, with the default being channel 0 (also the only channel accessible through PIO?), but only one can be used at a time; using other channels requires a *context switch*, which entails writing the current channel ID to to PGRAPH registers for every class. All DMA channels use 64 KB of RAMIN memory (to be explained later), further divided into 8 KB subchannels; the meaning of what is in those subchannels depends on the type (or *class* to use NVIDIA terminology) of object submitted into them, with the attributes of each object being called a *method*.
All objects have a *context*, consisting of a 32-bit "name" and another 32-bit value storing its class, associated channel and subchannel ID, where it is relative to the start of `RAMIN`, and whether it's a software-injected or hardware graphical rendering object (bit 31). Contexts are stored in an area of RAM called `RAMFC` if the object's channel is not being used; otherwise, they are stored in `RAMHT`, a *hash table* where the hash key is a single byte calculated by XORing each byte of the object's name[^htdriver] as well as the channel ID. Objects are stored in `RAMHT` as structures consisting of their 8-byte context followed by the *methods* mentioned earlier; an object's byte offset in `RAMHT` is its hash multiplied by 16.
[^htdriver]: Object names below 4096 are reserved on NVIDIA's drivers, which also have the duty to prevent the hash table area from getting full with only basic error handling from the hardware itself.
The exact methods of every graphics object are incredibly long and often shared between several different types of objects (although the first `0x100` bytes are shared and usually the first bytes after that are shared too) and won't be listed in part 1, but an overall list of graphics objects (note - these are the graphics objects defined by the *hardware*, the *drivers* implement their own, much larger set of graphics objects that do not map exactly to the ones in the GPU; furthermore, as you will see later, due to the large - 8KB - size of each object, *only one object does not mean only one - or even any - single object is drawn!*):
The exact methods of every graphics object are incredibly long and often shared between several different types of objects (although the first 256 bytes and usually a few more after that are shared), and thus won't be listed in part 1. An overall list of graphics objects can be found in the next section, but note that these are the ones defined by the hardware, while the drivers implement a much larger set of objects that do not map exactly to the ones in the GPU; furthermore, as you will see later, as each object is quite large at 8 KB, only one object does not mean only one (or even any) single object is drawn. Objects can also be connected together with a special type of object called a "patchcord"; the name is a remnant from the old NV1 quad patching days.
**`0x01` (Beta factor)**: The beta factor used for blending operations (combining an output pixel with another pixel to produce a final image).
Graphics objects are sent via DMA or PIO to one of two caches within the `PFIFO` subsystem: `CACHE0` which holds a single entry (really intended for the notifier engine - more on it later - to be able to inject graphics commands), or `CACHE1` which holds 32 entries on revisions A-B and 64 on revision C onwards. What these critical components actually do will be explored in full in later parts, but they effectively just store object names and contexts as they are waiting to be sent to `RAMIN`; a "pusher" pushes objects in from the bus and a "puller" pulls them out of the bus and sends them where they need to be inside of the VRAM (or to `RAMRO` if they are invalid).
**`0x02` (ROP5 operation)**: The Render OPeration used for blending (e.g. XOR)
Once objects are pulled out, the GPU will simply manipulate the various registers in the `PGRAPH` subsystem in order to draw them. Objects do not "disappear" on frame refresh; instead, it would simply appear that they are simply drawn over, and most likely, any renderer will simply clear the entire screen (with a *Rectangle* object for instance) before resubmitting any graphics objects they need to render.
**`0x03` (Chroma Key)**: Similar to a color key used in video editing.
Both `RAMFC` and `RAMHT` can have their sizes, and to some extent their location within RAMIN, configured by registers within the `PFIFO` block. `RAMHT` can be 4 KB (of questionable usefulness as that cannot fill `CACHE1`), 8 KB, 16 KB, or 32 KB in size, while RAMFC is either 512 bytes or 8 KB.
**`0x04` (Plane mask)**: Seems to be implemented similar to chroma key, unsure what it has to do with planes (bitplane? 2d plane?)
#### Object list
**`0x05` (Clipping rectangle)**: A rectangle used for enabling/disabling render operations within a specific region
Any class values not listed here are invalid; in theory, the 5-bit value in the object context allows for 32 classes, but NVIDIA did not implement the full amount, and moved to a different approach (where the classes are somewhat more constructed in software) with the NV4 architecture.
**`0x06` (Pattern)**: Pattern used for bitblit and other blits
**`0x01` (Beta factor):** The beta factor used for blending operations. (combining an output pixel with another pixel to produce a final image)
**`0x07` (Rectangle)**: Up to 16 rectangles with size and position represented as a 32-bit value (Y as high 16 bits, X as low 16)
**`0x02` (ROP5 operation):** The Render OPeration used for blending. (e.g. XOR)
**`0x08` (Point)**:
**`0x03` (Chroma Key):** Similar to a color key used in video editing.
An arbitrary point on the screen. Depending on the methods used to submit the object, this object can take the form of:
**`0x04` (Plane mask):** Seems to be implemented similarly to *Chroma Key*, not sure what it has to do with planes. (bitplane? 2D plane?)
* Up to 32 points, each with a single arbitrary 32-bit colour (probably BGRA format) and 16-bit size and position values.
* Up to 16 points, each with a single arbitrary 32-bit colour (probably BGRA format) and 32-bit size and position values.
**`0x05` (Clipping rectangle):** A rectangle used for enabling/disabling render operations within a specific region.
**`0x06` (Pattern):** Pattern used for bitblit and other blits.
**`0x07` (Rectangle):** Up to 16 rectangles with size and position represented as a 32-bit value. (low 16 bits are X, high 16 bits are Y)
**`0x08` (Point):** An arbitrary point on the screen. Depending on the methods used to submit the object, this object can take the form of:
* Up to 32 points, each with a single arbitrary 32-bit colour (probably BGRA format) and 16-bit size and position values;
* Up to 16 points, each with a single arbitrary 32-bit colour (probably BGRA format) and 32-bit size and position values;
* Up to 16 points, making up a polygon, with an arbitrary 32-bit colour for each polygon line (probably BGRA format) and 16-bit size and position values.
**`0x09` (Line)**:
**`0x09` (Line):** An arbitrary line on the screen. Depending on the methods used to submit the object, this object can take the form of:
An arbitrary line on the screen. Depending on the methods used to submit the object, this object can take the form of:
* Up to 16 lines, each with a single arbitrary 32-bit colour (probably BGRA format) and 16-bit size and position values.
* Up to 8 lines, each with a single arbitrary 32-bit colour (probably BGRA format) and 32-bit size and position values.
* Up to 32 lines, each making up a polygon, with a single arbitrary 32-bit colour (probably BGRA format) and 16-bit size and position values.
* Up to 16 lines, each making up a polygon, with a single arbitrary 32-bit colour (probably BGRA format) and 32-bit size and position values.
* Up to 16 lines, each with a single arbitrary 32-bit colour (probably BGRA format) and 16-bit size and position values;
* Up to 8 lines, each with a single arbitrary 32-bit colour (probably BGRA format) and 32-bit size and position values;
* Up to 32 lines, each making up a polygon, with a single arbitrary 32-bit colour (probably BGRA format) and 16-bit size and position values;
* Up to 16 lines, each making up a polygon, with a single arbitrary 32-bit colour (probably BGRA format) and 32-bit size and position values;
* Up to 16 lines, each making up a polygon, with an arbitrary 32-bit colour for each polygon line (probably BGRA format) and 16-bit size and position values.
**`0x0A` (Lin)**: The exact same as the Line object, but the starting and ending pixels are not drawn for each line.
**`0x0A` (Lin):** Same as *Line*, but the starting and ending pixels are not drawn for each line.
**`0x0B` (Triangle)**:
**`0x0B` (Triangle):** A basic (presumably pre-transformed?) 2D triangle. Depending on the methods used to submit the object, this object can take the form of:
A basic (presumably pre-transformed...?) 2D triangle. Depending on the methods used to submit the object, this object can take the form of:
* A single triangle with a single arbitrary 32-bit colour and three 16-bit position values for each of the triangle's vertexes.
* A single triangle with a single arbitrary 32-bit colour for the entire mesh, and three 16-bit position values for each of the triangle's vertexes.
* A part of a mesh of up to 32 triangles with a single arbitrary 32-bit colour and two 16-bit position values for each of the points on the mesh.
* A part of a mesh of up to 16 triangles with a single arbitrary 32-bit colour and two 32-bit position values for each of the points on the mesh.
* A set of up to 8 triangles with a single arbitrary 32-bit colour for the entire mesh, and three 16-bit position values for each of the triangle's vertexes.
* A single triangle with a single arbitrary 32-bit colour and three 16-bit position values for each of the triangle's vertexes;
* A single triangle with a single arbitrary 32-bit colour for the entire mesh, and three 16-bit position values for each of the triangle's vertexes;
* A part of a mesh of up to 32 triangles with a single arbitrary 32-bit colour and two 16-bit position values for each of the points on the mesh;
* A part of a mesh of up to 16 triangles with a single arbitrary 32-bit colour and two 32-bit position values for each of the points on the mesh;
* A set of up to 8 triangles with a single arbitrary 32-bit colour for the entire mesh, and three 16-bit position values for each of the triangle's vertexes;
* A part of a mesh of up to 16 triangles with a 32-bit colour and two 32-bit position values for each of the points on the mesh.
**`0x0C` (Windows 95 GDI Text Acceleration)**: A piece of hardware functionality intended to accelerate the manner by which Windows 95's GDI (and its DIB Engine?) renders text. This is a very complicated set of clipping logic that won't be covered until Part 3 - it's too long for this part, and I don't fully understand it yet.
**`0x0C` (Windows 95 GDI Text Acceleration):** A specialized hardware accelerator for the manner by which Windows 95's GDI (and its DIB Engine?) renders text. This is a very complicated set of clipping logic that won't be covered until Part 3; it's too long for this part, and I don't fully understand it yet.
**`0x0D` (Memory to memory format)**: Changes the format of a set of pixels in VRAM. Allows changing the line (vertical size) length, count and pitch of the image.
**`0x0D` (Memory to memory format):** Changes the format of a set of pixels in VRAM. Allows for changing the line (vertical size) length, count and pitch of the image.
**`0x0E` (Scaled image from memory)**: Obtain an image from VRAM and scale it before displaying it to the screen. It may be in YUV or RGB format. Performs a bit of differentiation to achieve this; takes an output position and size for the final screen as well as an input position or size.
**`0x0E` (Scaled image from memory):** Obtains an image from VRAM (in YUV or RGB format) and scales it before displaying it to the screen, performing a bit of differentiation to achieve this. Parameters an output position and size for the final screen as well as an input position or size.
**`0x10` (Blit)**: Blit something (a final image made up of 3D polygons, or a 2D image) between two different parts of the screen. Has an input and output position and a size.
**`0x10` (Blit):** Blits an image (a final one made up of 3D polygons or a 2D one) between two different parts of the screen. Has an input and output position and a size.
**`0x11` (Image from CPU)**: Take an image from "CPU" (main memory?), optionally scale it, and then display it on the screen. Takes an input size, set of 32-bit colour values and output position and size.
**`0x11` (Image from CPU):** Takes an image from "CPU" (main memory?), optionally scales it, and then displays it on the screen. Parameters are an input size, set of 32-bit colour values and output position and size.
**`0x12` (Bitmap)**: Similar to 0x11, but deals with monochrome or two-colour bitmaps instead (possibly as an optimisation).
**`0x12` (Bitmap):** Similar to *Image from CPU*, but deals with monochrome or two-colour bitmaps instead, possibly as an optimisation.
**`0x14` (Transfer to Memory)**: Take an image from the screen (?) and transfer it to memory. Takes a start position offset from VRAM and a pitch, as well as a position and size for the image.
**`0x14` (Transfer to Memory):** Takes an image from the screen (?) and transfers it to memory. Parameters are a start position offset from VRAM and a pitch, as well as a position and size for the image.
**`0x15` (Stretched image from CPU)**: Take an image from "CPU" (main memory?), stretch it (using an optional clip region and a little bit of differentiation) and then use it. Takes an input size and a clip region using the same 16-bit coordinate format used by the basic primitive drawing silicon.
**`0x15` (Stretched image from CPU):** Takes an image from "CPU" (main memory?), stretches it using an optional clip region and a little bit of differentiation, and then uses it. Parameters are an input size and a clip region using the same 16-bit coordinate format used by the basic primitive drawing silicon.
**`0x17` (Direct3D 5.0 accelerated triangle with zeta buffer)**:
Seemingly an attempt to implement the Direct3D 5.0 specification to the letter in silicon.
Allows for up to 128 triangles to be submitted at a time, with six coordinates:
**`0x17` (Direct3D 5.0 accelerated triangle with zeta buffer):** Seemingly an attempt to implement the Direct3D 5.0 specification to the letter in silicon. Allows for up to 128 triangles to be submitted at a time, with six coordinates:
* The traditional X, Y and Z coordinates used for representing vector values in 3D space
* U and V coordinates for textures. Textures may be *uploaded* at sizes up to 2048x2048 (only power of two textures are allowed!), but are scaled down to 256x256 during upload, if they are larger.
* An "M" coordinate, apparently a "measurement dimension" used for more precise measurement of real-world distances
Each triangle may have a 32-bit colour value as well.
**Note:** The RIVA 128 is not a multitexture-capable GPU! You can only apply one texture to each batch of 128 triangles. So the implementation of Direct3D in the drivers should attempt to, as close as possible, send as many triangles with the same texture to the GPU as the GPU can fit, and you should try to have objects with the same texture have close to a multiple of 128 triangles for each texture if you write applications targeting this GPU and the D3D driver implements this optimisation, because this will improve the efficiency of your renderer!
Each triangle may have a 32-bit colour value as well. Note that the RIVA 128 is not a multitexture-capable GPU; you can only apply one texture to each batch of 128 triangles, so the implementation of Direct3D in the drivers should attempt to send as many triangles with the same texture to the GPU as the GPU can fit, as closely as possible. If you write applications targeting this GPU, you should try ensuring objects with the same texture add up to close to a multiple of 128 triangles, as the D3D driver's implementation of this optimisation will improve the efficiency of your renderer.
These triangles, as a group, may have the following effects applied to them:
* "Zeta buffer" (may be similar to the Z-buffer used for polygon ordering...or for mipmapping?)
* "Zeta buffer" (may be similar to the Z-buffer used for polygon ordering, or for mipmapping?)
* "Alpha buffer" (probably for alpha blending)
* Specular highlighting
* Vertex fog (of any 32-bit colour)
* Interpolation between vertex positions (using a zero-order hold, "Microsoft" variant of zero-order hold, or full-order hold implementation)
* Frustum culling clockwise or counterclockwise (discarding triangles, although presumably this would only work in the batch of 128 triangles sent to the hardware for processing)
* Texture UV coordinate wrapping for seamless textures (they can wrap cleanly, be clamped to their "last" pixels or mirror themselves)
* Texture UV coordinate wrapping for seamless textures (coordinates can wrap cleanly, be clamped to their "last" pixels or mirror themselves)
**`0x18` (Point with zeta buffer)**:
The same as `0x08` (point), but the zeta and alpha buffer can be applied to it too.
**`0x18` (Point with zeta buffer):** Similar to *Point*, but the zeta and alpha buffer can be applied to it too.
Any values not listed are invalid. In theory, since there are 5 bits in the FIFO object context reserved for classes, there can be up to 32 classes, but NVIDIA did not implement 32 classes and moved to a different approach (one where the classes are somewhat more constructed in software) with the NV4 architecture.
#### When you screw up: RAMRO
These graphics objects are then sent (via one of two methods - Parallel I/O, which is basically DMA but only using Channel 0(?) and slower, or using the full DMA engine) to one of two caches within the `PFIFO` subsystem, the single-entry `CACHE0` (which is really intended for the aforementioned notifier engine to be able to inject graphics commands) or the multi-entry (32 on revision A or B cards; 64 on revision C or higher) `CACHE1`. These effectively - a full exploration of what these critical components actually do will be later parts of this - just store object names and contexts as they are waiting to be sent to `RAMIN`; a "pusher" pushes them in from the bus and a "puller" pulls them out of the bus and sends them where they need to be inside of the VRAM (or if they are invalid, to `RAMRO`). Once they are pulled out, the GPU will simply manipulate the various registers in the `PGRAPH` subsystem in order to draw the object. Objects do not "disappear" on frame refresh - in fact, it would simply appear that they are simply drawn over. Most likely, any renderer will simply clear the entire screen - e.g. with a Rectangle object, before resubmitting any graphics objects that they need to render.
Aside from the previously-covered `RAMFC` and `RAMAU`, another important structure is stored in `RAMIN`. `RAMRO` saves the day and prevents the GPU from blowing up if a graphics object you submit is invalid, because after all, nothing is perfect and there are always bugs in code.
Objects are connected together with a special type of object called a "patchcord" (a name leftover from the old NV1 quad patching days).
During object submission, if the GPU detects that the cache ran out, was turned off, or any kind of illegal access was performed, the submission is not processed; instead, it is sent to a special area of `RAMIN` known as `RAMRO` (always half the size of `RAMHT`), which stores the object, what went wrong, and whether a read or write operation was involved in the error. Additionally, an interrupt is fired so that any drivers running on the system can catch the error and (hopefully) correct it.
*Both `RAMFC` and `RAMHT` can have their sizes, and to some extent their location within RAMIN, configured by configuration registers within the `PFIFO` block. RAMHT can be 4KB (which is rather useless since it can't fill up PFIFO CACHE1), 8KB, 16KB, or 32KB. RAMFC is either 512 bytes or 8 KB in size.
#### When You Screw Up: RAMRO
We already covered RAMFC and RAMAU. But there is another important structure stored in RAMIN - after all, not every single graphics object you submit is going to be valid. There are always bugs in code, and when you fuck up, RAMRO (Ram RunOut) is here to save the day and prevent the GPU from blowing up.
If the GPU detects either that the cache ran out during submission, that the cache was turned off, or any kind of illegal access that it doesn't like, your graphics object submission will not be processed, but will instead be sent to a special area of `RAMIN` known as `RAMRO` (which is always half the size of `RAMHT`) that will store the object, what went wrong, if you were trying to write or read when it happened, and report an error by firing an interrupt (the `PFIFO_RUNOUT_STATUS` register also holds the current state of the `RAMRO` region, and if any errors occurred) so that any drivers running on the system can catch the error and (hopefully) correct it.
The `PFIFO_RUNOUT_STATUS` register holds the current state of the `RAMRO` region, including whether or not any errors have occurred.
#### RAMAU
Not really sure what this is for but I assume it's a spare area for random stuff.
#### Interrupts 2.0: Notifiers
I'm not really sure what `RAMAU` is for, but I assume it's a spare area for random stuff.
### Interrupts 2.0: Notifiers
However, some people at NVIDIA decided that they were too cool for interrupts. Why have an interrupt that tells the GPU to do something, when *you could have an interrupt that has the GPU tell the drivers to do something!*. So they implemented the incredible "notifier" system. It appears to have been implemented to allow the drivers to manage the GPU resources when the silicon could not implement them. Every single subsystem in the GPU has a notifier enable register alongside its interrupt enable register (some have multiple different notifier enable registers for different types of notifiers!) Notifiers appear to be intended to work with the object class system (although they may also exist within GPU subsystems, they mostly exist within `PGRAPH`, `PME` and `PVIDEO`) and are actually different *per-class of object* - with each object having a set of "notification parameters" that can be used to trigger a notification and are triggered by the `SetNotify` method at `0x104` within an object when it is stored inside of RAMHT. There is also the `SetNotifyCtxDma` method, usually but not always at `0x0`, which is used for the aforementioned context switching. Notifiers appear to be "requested" until the GPU processes them, and PGRAPH can take up to 16 software and 1 hardware notifier type.
More research is ongoing. It seems most notifiers are generated by the driver in order to manage hardware resources that they would not otherwise be capable of managing, such as the PFIFO caches.
#### PRAMDAC
### PRAMDAC
The final part of the GPU that handles the intricacies of generating a video signal, sets the resolution, and holds a color lookup table for the various modes. I haven't looked into this part as much, so expect more information in an update on this part or in future parts of this series. It's not really super critical to emulate anyway, other than the fact it actually controls the aforementioned clocks - but the actual video-generation part mostly does not apply to emulation as we don't need to generate an analog video signal.
---