From eebeb8bc62cc8a6fb250101eaa8d075c1380d5f1 Mon Sep 17 00:00:00 2001
From: starfrost013 <mario64crashed@gmail.com>
Date: Thu, 6 Feb 2025 21:17:22 +0000
Subject: [PATCH] various fixes/changes

---
 _posts/2025-01-22-riva128-part-1.md | 34 ++++++++++++++---------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/_posts/2025-01-22-riva128-part-1.md b/_posts/2025-01-22-riva128-part-1.md
index 8b3fe10..d07aad6 100644
--- a/_posts/2025-01-22-riva128-part-1.md
+++ b/_posts/2025-01-22-riva128-part-1.md
@@ -68,23 +68,23 @@ NVIDIA lost $6.4 million in 1995 on a revenue of $1.1 million, and $3 million on
 
 Nevertheless, NVIDIA grew to close to a hundred employees, including sales and marketing teams. The company, and especially its cofounders, remained confident in their architecture and overall prospects of success. They had managed to solidify a business relationship with Sega, to the point where they had initially won the contract to provide the graphics hardware for the successor to the Sega Saturn, at that time codenamed "V08". The GPU was codenamed "Mutara" (after the nebula critical to the plot in *Star Trek II: The Wrath of Khan*) and the overall architecture was the **NV2**. It maintained many of the functional characteristics of the NV1 and was essentially a more powerful successor to that card. According to available sources, this would have been the only NVIDIA chip manufactured by the then-just founded Helios Semiconductor.
 
-However, problems started to emerge almost immediately. Game developers, especially Sega's internal teams, were not happy with having to use a GPU with such a heterodox design; for example, porting games to or from the PC, which Sega did do at the time, would be made far harder. This position was especially championed by Yu Suzuki, head of one of Sega's most prestigious internal development teams Sega-AM2, responsible for the *Daytona USA*, *Virtua Racing*, *Virtua Fighter*, and *Shenmue* series among others, who sent his best graphics programmer to interface with NVIDIA and push for triangles. At this point, the story diverges: some tellings claim that NVIDIA simply refused to accede to Sega's request and this severely damaged their relationship; others that the NV2 wasn't killed until it failed to produce any video during a demonstration, and Sega still paid NVIDIA for developing it to prevent bankruptcy, with one engineer apparently getting it working for the sole purpose of receiving a milestone payment.
+However, problems started to emerge almost immediately. Game developers, especially Sega's internal teams, were not happy with having to use a GPU with such a heterodox design; for example, porting games to or from the PC, which Sega did do at the time, would be made far harder. This position was especially championed by Yu Suzuki, head of one of Sega's most prestigious internal development teams Sega-AM2, responsible for the *Daytona USA*, *Virtua Racing*, *Virtua Fighter*, and *Shenmue* series among others, who sent his best graphics programmer to interface with NVIDIA and push for NVIDIA to change the rendering method to a more traditional triangle-based approach. At this point, the story diverges: some tellings claim that NVIDIA simply refused to accede to Sega's request and this severely damaged their relationship irrepairably, leading to the NV2's cancellation; others that the NV2 wasn't killed until it failed to produce any video during a demonstration, and Sega still paid NVIDIA for developing it to prevent bankruptcy, with a single engineer apparently assigned to (and succeeding at) getting the card working for the sole purpose of receiving a milestone payment.
 
-At some point, Sega, as a traditional Japanese company, couldn't simply kill the deal, so the NV2 was officially relegated to be used in the successor to the educational toddler-aimed Sega Pico, while in reality, Sega of America had already been told to "not worry" about NVIDIA anymore. NVIDIA got the hint, and the NV2 was cancelled. With NV1 and NV2 out of the picture, NVIDIA had no sales, no customers, and barely any money; at some point in late 1996, the company had $3 million and was burning through $330,000 a month, and most of the NV2 team had been redeployed to the next-generation NV3. No venture capital funding was going to be forthcoming due to the failure to actually create any products people wanted to buy, at least not without extremely unfavourable terms on things like ownership. The company was effectively almost a complete failure and a waste of years of the employees' time.
+At some point, Sega, as a traditional Japanese company, couldn't simply kill the deal, so the NV2 was officially relegated to be used in the successor to the educational toddler-aimed Sega Pico, while in reality, Sega of America had already been told to "not worry" about NVIDIA anymore. NVIDIA got the hint, and the NV2 was cancelled. With both NV1 and NV2 out of the picture, NVIDIA had no sales, no customers, and barely any money; at some point in late 1996, the company had $3 million and was burning through $330,000 a month, and most of the NV2 team had been redeployed to the next-generation NV3. No venture capital funding was going to be forthcoming due to the failure to actually create any products people wanted to buy, at least not without extremely unfavourable terms on things like ownership. The company was effectively almost a complete failure and a waste of years of the employees' time.
 
 ### Near destruction of the company
 
-By the end of 1996, things had gotten infinitely worse, with the competition heating up; despite NV1 being the first texture-mapped consumer GPU ever released, they had been fundamentally outclassed by their competition. It was a one-two punch: initially, Rendition - founded around the same time as NVIDIA in 1993 - released its V1000 chip based on a custom RISC architecture, and while not particularly fast, it was, for a few months, the only card that could run Quake (the hottest game of 1996) in hardware accelerated mode. The V1000 was an early market leader, alongside S3's laughably bad ViRGE (Video and Rendering Graphics Engine) which was infamously slower than software rendering on high-end CPUs at launch, and was reserved for high-volume OEM bargain-bin disaster machines. 
+By the end of 1996, things had gotten infinitely worse, with the competition heating up extraordinarily fast; despite NV1 being the first texture-mapped consumer GPU ever released, they had been fundamentally outclassed by their competition. It was a one-two punch: initially, Rendition - founded around the same time as NVIDIA in 1993 - released its V1000 chip based on a custom RISC architecture, and while not particularly fast, it was, for a few months, the only card that could run Quake (the hottest game of 1996) in hardware accelerated mode. The V1000 was an early market leader, alongside S3's laughably bad ViRGE (Video and Rendering Graphics Engine) which was infamously slower than software rendering on high-end CPUs at launch, and was reserved for high-volume OEM bargain-bin disaster machines. 
 
-However, this was nothing compared to the body blow about to hit the entire industry, NVIDIA included. At a conference in early 1996, an $80,000 machine from SiliconGraphics, then the world leader in accelerated graphics, crashed during a demo by the then-CEO Ed McCracken. If accounts of the event are to be believed, while the machine rebooted, people who had heard rumors left the room and headed downstairs to another demo by a then-tiny company made up of ex-SGI employes calling itself "3D/fx" (later shortened to 3dfx), claiming comparable graphics quality for $250... with demos to prove it. As with many cases of supposed "wonder innovations" in the tech industry, it was too good to be true, but when their card, the "Voodoo Graphics" was first released in the form of the "Righteous 3D" by Orchid in October 1996, it turned out to be true. Despite the fact that it was a 3D-only card and required a 2D card to be installed, and the fact it could not accelerate graphics in a window (which almost all other cards could do), performance was so high relative to other products (including the NV1) that it not only had rave reviews on its own but also kicked off a revolution in consumer 3D graphics, which especially caught fire when GLQuake was released in January 1997.
+However, this was nothing compared to the body blow about to hit the entire industry, NVIDIA included. At a conference in early 1996, an $80,000 machine from SiliconGraphics, then the world leader in accelerated graphics, crashed during a demo by the then-CEO Ed McCracken. If accounts of the event are to be believed, while the machine rebooted, people who had heard rumors left the room and headed downstairs to another demo by a then-tiny company made up of ex-SGI employes calling themselves "3D/fx" (later shortened to 3dfx), claiming comparable graphics quality for $250... with demos to prove it. As with many cases of supposed "wonder innovations" in the tech industry, it was too good to be true, but when their card, the "Voodoo Graphics" was first released in the form of the "Righteous 3D" by Orchid in October 1996, it turned out to be true. Despite the fact that it was a 3D-only card and required a 2D card to be installed, and the fact it could not accelerate graphics in a window (which almost all other cards could do), performance was so high relative to other products (including the NV1) that it not only had rave reviews on its own but also kicked off a revolution in consumer 3D graphics, which especially caught fire when GLQuake was released in January 1997.
 
-The reasons for 3dfx being able to design such an effective GPU when all others failed were numerous. The price of RAM plummeted by 80% through 1996, allowing the Voodoo's estimated retail price to be cut from $1000 to $300; many of their staff members came from SiliconGraphics, perhaps the most respected and certainly the largest company in the graphics industry of that time[^sgi]; and while 3dfx used the proprietary Glide API, it also supported OpenGL and Direct3D. Glide was designed to be very similar to OpenGL while allowing for 3dfx to approximate standard graphical techniques, which, as well as their driver design - the Voodoo only accelerates edge interpolation[^edge], texture mapping and blending, span interpolation[^span], and final presentation of the rendered 3D scene - the rest was all done in software. All of these factors were key in what proved to be an exceptionally low price for what was considered to be an exceptionally high quality for the time of the card.
+The reasons for 3dfx being able to design such an effective GPU when all others failed were numerous. The price of RAM plummeted by 80% throughout 1996, allowing the Voodoo's estimated retail price to be cut from $1000 to $300; many of their staff members came from SiliconGraphics, perhaps the most respected and certainly the largest company in the graphics industry of that time[^sgi]; and while 3dfx used the proprietary Glide API, it also supported OpenGL and Direct3D. Glide was designed to be very similar to OpenGL while allowing for 3dfx to approximate standard graphical techniques, which, as well as their driver design - the Voodoo only accelerates edge interpolation[^edge], texture mapping and blending, span interpolation[^span], and final presentation of the rendered 3D scene - the rest was all done in software. All of these factors were key in what proved to be an exceptionally low price for what was considered to be an exceptionally high quality for the time of the card.
 
-[^sgi]: By 1997, SGI had over 15 years of experience in developing graphical hardware, while also suffering from rampant mismanagement in what would prove to be their terminal decline.
+[^sgi]: By 1997, SGI had over 15 years of experience in developing graphical hardware, while also suffering from rampant mismanagement and experiencing the start of what would later prove to be their terminal decline.
 
 [^edge]: Where a triangle is converted into "spans" of horizontal lines, and the positions of nearby vertexes are used to determine the span's start and end positions.
 
-[^span]: To simplify a complex topic, in a GPU of this era, span interpolation generally involves Z-buffering (also known as depth buffering), sorting polygons back to front, and color buffering, storing the color of each pixel sent to the screen in a buffer which allows for blending and alpha transparency.
+[^span]: To simplify a complex topic, in a GPU of this era, span interpolation generally involves Z-buffering (also known as depth buffering), sorting polygons back to front, and color buffering, storing the color of each pixel sent to the screen in a buffer which allows for blending and alpha transparency. Some GPUs do not implement a Z-buffer, with examples including the NV1, original ATI 3D Rage and PS1 Geometry Transformation Engine, so sorting of polygons has to be handled by the programmer. 
 
 Effectively, NVIDIA had to design a graphics architecture that could at the very least get close to 3dfx's performance, on a shoestring budget and with very little resources, as 60% of their staff (including the entire sales and marketing teams) had been laid off to preserve money. They could not do a complete redesign of the NV1 from scratch if they felt the need to, as it would take two years (time they simply didn't have) and any design that came out of this effort would be immediately obsoleted by competitors, such as 3dfx's Voodoo series, and ATI's Rage which was initially rather pointless but rapidly advancing in performance and driver stability. The chip would also have to work reasonably well on the first tapeout, as there was no capital to produce more revisions of the chip. The fact NVIDIA were able to achieve a successful design in the form of the NV3 under such conditions was a testament to the intelligence, skill and luck of their designers; we will explore how they managed to achieve this later on this write-up.
 
@@ -92,7 +92,7 @@ Effectively, NVIDIA had to design a graphics architecture that could at the very
 
 It was with these financial, competitive and time constraints in mind that design on the NV3 began in 1996. This chip would eventually be commercialised as the RIVA 128, standing for "Real-time Interactive Video and Animation accelerator" followed by a nod to its 128-bit internal bus width which was very large at the time. NVIDIA retained SGS-Thomson (soon to be STMicroelectronics) as their manufacturing partner, in exchange for SGS-Thomson cancelling their competing STG-3001 GPU. In a similar vein to the NV1, NVIDIA was to sell the chip as "NV3" and SGS-Thomson was to white-label it as STG-3000, once again separated by audio functionality; however, NVIDIA convinced SGS-Thomson to cancel their own part and stick to manufacturing the NV3 instead, which would prove to be a terrible decision when NVIDIA dropped them in favor of TSMC for manufacturing of the RIVA 128 ZX due to both yield issues and pressure from venture capital funders. STMicro went on to manufacture PowerVR chips for a few more years, before dropping out of the market entirely by 2001.
 
-After the NV2 disaster, the company made several calls on the NV3's design that turned out to be very good decisions. First, they acquiesced to Sega's advice (which they might have already done to save the Mutara V08/NV2 but it was too late) and moved to an inverse texture mapping triangle-based model, although some remnants of the original quad patching design remain. The unused DRM functionality was also remove, which may have been assisted by David Kirk[^dkirk] taking over from Curtis Priem as chief designer, as Priem insisted on including the DRM functionality with the NV1, citing piracy issues with the game he had written as a demo of the Malachowsky-designed GX GPU back when he worked at Sun.
+After the NV2 disaster, the company made several calls on the NV3's design that turned out to be very good decisions. First, they acquiesced to Sega's advice (which they might have already done to save the Mutara V08/NV2, but it was too late) and moved to an inverse texture mapping triangle-based model, although some remnants of the original quad patching design remain. The unused DRM functionality was also remove, which may have been assisted by David Kirk[^dkirk] taking over from Curtis Priem as chief designer, as Priem insisted on including the DRM functionality with the NV1, citing piracy issues with the game he had written as a demo of the Malachowsky-designed GX GPU back when he worked at Sun.
 
 [^dkirk]: David Kirk is perhaps notable as a "Special Thanks" credit on *Gex* and the producer of the truly unparalleled *3D Baseball* on the Sega Saturn during his time at Crystal Dynamics.
 
@@ -126,7 +126,7 @@ After all of this history and exposition, we are finally ready to actually explo
 
 NV3 is the third-generation NV architecture designed by NVIDIA in 1997, commercialised as the RIVA 128 family. It implements a fixed-function 2D and 3D render path primarily aimed at desktop software and video games, with hardware acceleration best described as partial by modern standards, but one of the more complete, fully-featured solutions for 1997. It can be attached through the legacy PCI 2.1 bus or AGP 1X (2X on the RIVA 128 ZX), a higher-speed superset of PCI designed for graphics which was brand new at the time but ultimately proved successful.
 
-The primary goals of this architecture were low manufacturing cost, short development time (due to NVIDIA's dire financial condition at the time), and beating the 3dfx Voodoo1 in raw pixel pushing performance. It generally achieved these goals with caveats, with a bulk cost of $15 per chip, a design period of around 9 months (excluding Revision B), and performance generally better than that of the Voodoo, in spite of 3dfx's more integrated Glide API, and NVIDIA's smaller performance advantage with large triangles.
+The primary goals of this architecture were low manufacturing cost, short development time (due to NVIDIA's dire financial condition at the time), and beating the 3dfx Voodoo1 in raw pixel pushing performance. It generally achieved these goals with caveats, with a bulk cost of $15 per chip, a design period of around 9 months (excluding Revision B), and performance generally better than that of the Voodoo, in spite of 3dfx's more integrated Glide API, and NVIDIA's smaller performance advantage with large triangles as compared to smaller ones.
 
 While the focus of study has been the Revision B card, efforts have been made to understand the A and C revisions as well. Each revision has different values for the GPU ID in the framebuffer boot configuration register in MMIO space (at offset `0x100000`) and the PCI configuration space Revision ID register:
 
@@ -229,7 +229,7 @@ This MMIO area has numerous functional subsystems of the GPU mapped into it, wit
 
 #### RAMIN
 
-`RAMIN` is also located in BAR1. It's a somewhat complicated area, but also the most important one to understand when it comes to actual operation of the GPU, as it's the part of video RAM where graphics objects and structures containing references to them are stored.
+`RAMIN` is also located in BAR1. It's a somewhat complicated area, but also the most important one to understand when it comes to the actual operation of the GPU, as it's the part of video RAM where graphics objects and structures containing references to them are stored.
 
 This area is effectively the last megabyte of VRAM (regardless of VRAM size), but organized as 16-byte blocks which are then stored from the top down. A `RAMIN` address can be converted to a real VRAM address with the formula `ramin_address ^ (vram_size - 16)`. I'm not entirely sure why they did this, but I assume it was for providing a more convenient interface to the user and for general efficiency reasons. 
 
@@ -243,23 +243,23 @@ Interrupts can be turned off globally (or just component interrupts, or just the
 
 Time-sensitive functions are provided by a relatively simple programmable interval timer `PTIMER` that fires an interrupt whenever the threshold value (set by the `PTIMER_ALARM`) is exceeded in nanoseconds. This is how the drivers internally keep track of many actions that they need to perform, and is the first functional block which must be done right if you ever hope to emulate the RIVA 128.
 
-The least straightforward part of this timer is the counter, a 56-bit value split across two 32-bit registers: the lower 27 bits are stored in bits [31:5] of `PTIMER_TIME0`, and the upper 29 bits are stored in bits [28:0] of `PTIMER_TIME1`.
+The least straightforward part of this timer is the counter itself, a 56-bit value split across two 32-bit registers: the lower 27 bits are stored in bits [31:5] of `PTIMER_TIME0`, and the upper 29 bits are stored in bits [28:0] of `PTIMER_TIME1`.
 
 ### Graphics commands and DMA engine
 
-What may be called *graphics commands* in other GPU architectures are instead called *graphics objects* in the NV3 and all other NVIDIA architectures. Objects are submitted into the GPU core via a custom direct memory access engine with its own translation lookaside buffer and other memory management structures, although programmed I/O can also be used as a slower alternative.
+What may be called *graphics commands* in other GPU architectures are instead called *graphics objects* in the NV3 and all other NVIDIA architectures. Objects are submitted into the GPU core via writing into the `NV_USER` section of the MMIO BAR0 region using programmed I/O. Despite the fact that a custom memory access engine with its own translation lookaside buffer and other memory management structures was implemented for types of graphics objects that perform memory transfers, it does not seem to be used for graphics object submission until the NV4 architecture. Existing documentation is contradictory on if this exists on the NV3, but drivers do not seem to use DMA to submit graphics objects; if a DMA submission method exists, it certainly works very differently to later versions of the architecture. 
 
-There are 8 DMA channels, with the default being channel 0 (also the only channel accessible through PIO?), but only one can be used at a time; using other channels requires a *context switch*, which entails writing the current channel ID to to PGRAPH registers for every class. All DMA channels use 64 KB of RAMIN memory (to be explained later), further divided into 8 KB subchannels; the meaning of what is in those subchannels depends on the type (or *class* to use NVIDIA terminology) of object submitted into them, with the attributes of each object being called a *method*.
+There are 8 DMA channels, with the default being channel 0 (also the only channel accessible through PIO?), but only one can be used at a time; using other channels requires a *context switch*, which entails writing the current channel ID to to PGRAPH registers for every class. All DMA channels use 64 KB of RAMIN memory (to be explained later), further divided into 8 KB (`0x2000`) subchannels, effectively representing one object; the meaning of what is in those subchannels depends on the type (or *class* to use NVIDIA terminology) of the object submitted into them, with the attributes of each object being called a *method*. A simple way to program the GPU is to simply create subchannels for specific objects (such as one for text, one for rectangle, etc...) and change their data and methods as the program runs in order to create a graphical effect. However, this is a severely limited way of programming the GPU (although Nvidia did successfully deploy it for simpler projects, such as the NT 3.x miniport driver and early versions of the NT 4.0 miniport driver, before their full Resource Manager was able to be ported), and you are intended to use context switches between DMA channels, as well as additional classes defined in the drivers, to program the card to its full potential.
 
 All objects have a *context*, consisting of a 32-bit "name" and another 32-bit value storing its class, associated channel and subchannel ID, where it is relative to the start of `RAMIN`, and whether it's a software-injected or hardware graphical rendering object (bit 31). Contexts are stored in an area of RAM called `RAMFC` if the object's channel is not being used; otherwise, they are stored in `RAMHT`, a *hash table* where the hash key is a single byte calculated by XORing each byte of the object's name[^htdriver] as well as the channel ID. Objects are stored in `RAMHT` as structures consisting of their 8-byte context followed by the *methods* mentioned earlier; an object's byte offset in `RAMHT` is its hash multiplied by 16.
 
 [^htdriver]: Object names below 4096 are reserved on NVIDIA's drivers, which also have the duty to prevent the hash table area from getting full with only basic error handling from the hardware itself.
 
-The exact methods of every graphics object are incredibly long and often shared between several different types of objects (although the first 256 bytes and usually a few more after that are shared), and thus won't be listed in part 1. An overall list of graphics objects can be found in the next section, but note that these are the ones defined by the hardware, while the drivers implement a much larger set of objects that do not map exactly to the ones in the GPU; furthermore, as you will see later, as each object is quite large at 8 KB, only one object does not mean only one (or even any) single object is drawn. Objects can also be connected together with a special type of object called a "patchcord"; the name is a remnant from the old NV1 quad patching days.
+The exact set of methods of every graphics object in the architecture is incredibly long and often shared between several different types of objects (although the first 256 bytes and usually a few more after that are shared), and thus won't be listed in part 1. An overall list of graphics objects can be found in the next section, but note that these are the ones defined by the hardware, while the drivers implement a much larger set of objects that do not map exactly to the ones in the GPU; furthermore, as you will see later, as each object is quite large at 8 KB, only one object does not mean only one (or even any at all - some are used to represent DMA objects, for example) graphics objects are drawn once the object is process. Objects can also be connected together with a special type of object called a "patchcord" constructed by the Resource Manager; the name is a remnant from the old NV1 quad patching days.
 
-Graphics objects are sent via DMA or PIO to one of two caches within the `PFIFO` subsystem: `CACHE0` which holds a single entry (really intended for the notifier engine - more on it later - to be able to inject graphics commands), or `CACHE1` which holds 32 entries on revisions A-B and 64 on revision C onwards. What these critical components actually do will be explored in full in later parts, but they effectively just store object names and contexts as they are waiting to be sent to `RAMIN`; a "pusher" pushes objects in from the bus and a "puller" pulls them out of the bus and sends them where they need to be inside of the VRAM (or to `RAMRO` if they are invalid).
+Graphics objects, after they are written to `NV_USER`, are sent to one of two caches within the `PFIFO` subsystem: `CACHE0` which holds a single entry (really intended for the notifier engine - more on it later - to be able to inject graphics commands from software), or `CACHE1` which holds 32 entries on revisions A-B and 64 on revision C onwards. What these critical components actually do will be explored in full in later parts, but they effectively just store object names and contexts as they are waiting to be sent to `RAMIN`; a "pusher" pushes objects in from the bus as they are written into `NV_USER`, and a "puller" pulls them out of the bus and sends them where they need to be inside of the VRAM (or to `RAMRO` if they are invalid).
 
-Once objects are pulled out, the GPU will simply manipulate the various registers in the `PGRAPH` subsystem in order to draw them. Objects do not "disappear" on frame refresh; instead, it would simply appear that they are simply drawn over, and most likely, any renderer will simply clear the entire screen (with a *Rectangle* object for instance) before resubmitting any graphics objects they need to render.
+Once objects are pulled out, the GPU will simply manipulate the various registers in the `PGRAPH` subsystem in order to draw the object (if the object is actually rendered), and/or perform any DMA operations the graphics object may require using the DMA engine. Objects do not appear to "disappear" on frame refresh; instead, it would simply appear that they are simply drawn over, and most likely, any renderer will simply clear the entire screen (with a *Rectangle* object for instance) before resubmitting any graphics objects they need to render.
 
 Both `RAMFC` and `RAMHT` can have their sizes, and to some extent their location within RAMIN, configured by registers within the `PFIFO` block. `RAMHT` can be 4 KB (of questionable usefulness as that cannot fill `CACHE1`), 8 KB, 16 KB, or 32 KB in size, while RAMFC is either 512 bytes or 8 KB.
 
@@ -351,7 +351,7 @@ The `PFIFO_RUNOUT_STATUS` register holds the current state of the `RAMRO` region
 
 #### RAMAU
 
-`RAMAU` was an area used on NV1 cards and revision A NV3 cards for storing audio data being streamed into the CPU.
+`RAMAU` was an area used on NV1 cards and revision A NV3 cards for storing audio data being streamed into the CPU. On Revision B and later cards, the area is still mapped to MMIO space, but its functionality has been entirely removed and it is dummied out.
 
 ### Interrupts 2.0: Notifiers