Alice 4 FPGA Rasterizer
The Alice 4 rasterizer is broken into two main parts:
- A software library linked with the C application program. The library implements the IrisGL API. It performs transformations, lighting, clipping, and retained mode (display lists). Its output is a list of commands (clear screen, draw triangle, swap buffers, etc.) that it writes to memory outside Linux's range.
- An FPGA configuration that reads these commands from SDRAM and executes them. For triangles it interpolates color (RGB) and depth (Z). The FPGA also scans out the image buffer to the LCD and performs other minor tasks.
There are three image buffers in shared memory:
- A front color buffer, which is displayed to the user via the LCD. The scanning is continuous because the LCD has no memory of its own. The timing and wire format is similar to VGA.
- A back color buffer, which is being rasterized into.
- A depth buffer, used along with the back buffer during rasterization.
The first two buffers are only virtually “front” and “back”. Those two labels switch every frame as the back buffer becomes the new front buffer and is shown to the user.
The Verilog code is broken into about a dozen modules:
- Main: Integrates all the other modules.
- soc_instance: Module generated by Qsys to interface with SDRAM and I2C.
- LCD_control: Generates LCD control signals (H-Sync, V-Sync, pixel enable, next frame signal, X coordinate, Y coordinate). This runs at 25 MHz (about 50 FPS). It could run faster (up to 40 MHz) but the board clock is 50 MHz, so it was convenient to divide that by two and avoid having asynchronous clocks.
- LCD_text: Given screen pixel X and Y, returns character X and Y (text column and row) and sub-character X and Y.
- LCD_debug: Given character X and Y and three 32-bit debug values,
returns which character to draw.
- Binary_to_hex: Converts a nybble to a hex ASCII value. Used for each nybble of the debug values to generate hex output.
- Frame_buffer: Sends the front frame buffer to the LCD.
- LCD_font: Given ASCII character and sub-character X and Y,
returns whether the pixel is on or off.
- Font_ROM: The font pixels.
- Command_reader: Pre-fetches drawing commands from SDRAM into a FIFO.
- Rasterizer: Reads drawing commands and executes them.
- Read_FIFO: Queues rasterized pixels, waiting for Z read to complete.
- Write_FIFO: Queues rasterized and Z-compared pixels, waiting for pixel writes to complete. See the “Write FIFO” section below for more details.
- Prescaler: Generates a tick every N clocks.
- PWM: Generates a PWM signal for LCD backlight brightness.
- Debouncer: Debounces the Home button.
The Altera Cyclone V SoC has a wonderful memory controller for accessing the synchronous dynamic RAM (SDRAM). It has a port for the ARM and six ports for the FPGA. Each FPGA port can be configured for input or output, and their relative priorities (including the ARM port) can be set. The priorities were critical for making sure the front-buffer scan-out was never starved of pixels. The SDRAM itself ran so fast (400 MHz DDR) that all the ports could be active and not stall too often. The ports were set up as follows:
- Reading the front-buffer for LCD scan-out. This was configured to be the highest priority.
- Reading the graphics command buffer.
- Reading the depth buffer for the pixel being rasterized.
- Writing the color of the pixel just rasterized.
- Writing the depth of the pixel just rasterized.
All five ports were hooked up to FIFOs to minimize the effects of memory latency.
Rasterization uses the edge-equation technique. The idea is to test every pixel to see whether it's inside the triangle. “Inside” is defined as “on the same side of every edge”. Only pixels in the bounding box of the triangle are tested. This technique wastes at least 50% of its time on pixels outside the triangle, but it's simpler to implement than edge-walkers.
The state machine in Rasterizer.v reads commands from SDRAM (indirectly through the FIFO) and executes them. Because the SDRAM interface is (logically) 64 bits wide, and each pixel takes 32 bits (8 bits each of red, green, and blue, with 8 bits wasted), we always rasterize two pixels at a time. At 50 MHz, that's 100 million pixels per second, but with (at least) half of them wasted, that's at most 50 million drawn pixels per second.
To minimize SDRAM latency stalls, we use three FIFOs in the rasterization process:
- The Command FIFO queues the drawing commands so that the state machine need not block too long.
- The Read FIFO queues drawn pixels while we wait for the depth read to return. After determining that a pixel is inside the triangle, we initiate a read of its corresponding depth value (if depth-comparison is enabled for this triangle). This can take some time (tens of clocks) and we don't want to block the rasterizer. Instead, we queue up all the information we have about this pixel (depth memory address, depth value, color memory address, and color) and move on to the next pixel. Another module (Read_FIFO.v) waits for the SDRAM read to return. Since SDRAM reads return in the order they were made, the module then gets the next item in our FIFO, compares the depth values, and if the new pixel is closer to the camera than the existing pixel, enqueues the same pixel information into the Write FIFO. (All of this is done two pixels at a time.)
- The Write FIFO queues pixels that must be written back to SDRAM. The pixel's color must be written, and optionally the depth value must be written (if enabled for this triangle). This is a complicated module because we must take into account whether there are pixels in the FIFO to write, whether the depth memory controller is ready to accept another write, and whether the color memory controller is ready to accept another write. See the section “Write FIFO” below for details.
There's very little stalling in this pipeline, so we end up with a rasterization rate of about 50 million Gouraud (color-interpolated) Z-buffered pixels per second. The triangle overhead lets us do almost 2 million (empty) triangles per second. It's hard to compare these numbers to real SGI machines, but we seem to be matching the performance of machines built in the early 1990s.
For each triangle, the rasterizer computes its on-screen area, then takes the reciprocal of the area. This is necessary for the normalization of the barycentric coordinates used to interpolate color and depth. Scratchapixel has a great explanation of how this works; scroll down to the “Barycentric Coordinates” section.
To compute the reciprocal we use the built-in lpm_divide module:
lpm_divide #(.LPM_WIDTHN(32), .LPM_WIDTHD(32), .LPM_NREPRESENTATION("UNSIGNED"), .LPM_DREPRESENTATION("SIGNED"), .LPM_PIPELINE(6)) area_divider( .clock(clock), .clken(area_reciprocal_enabled), .numer(32'h7FFF_FFFF), .denom(tri_area), .quotient(tri_area_recip_result) );
The module is configured to have six pipeline stages, which means that the result will come out six clocks after the denominator was put in. We don't pipeline (overlap) our reciprocals (we only need one per triangle), but our state machine must wait six clocks for this result. We found the number 6 by trying various values until the compiler stopped complaining about timing violations.
The Write FIFO, which writes pixel data to the back color buffer and to the depth buffer, was one of the most difficult modules to write in this project. Conceptually the state machine should perform these steps in a loop:
- Wait for a new pixel to be available in the Write FIFO.
- Write it to the back color buffer and to the depth buffer.
- Wait for both SDRAM controller ports to acknowledge that they had accepted the writes.
Remember that wherever we talk about “a pixel” here, we mean two side-by-side pixels that are handled in parallel. The FIFOs include two bits to specify which of the two pixels (or both) are valid, since either (but not both) could be outside the triangle.
This sequential version is much too slow. It would introduce several wait states, destroying our throughput. It is implemented in the !PIPELINED sections of the Write_FIFO.v module, but this code is disabled in favor of the PIPELINED code described below.
There are several difficulties involved in doing the above steps concurrently:
- We can initiate a read of the FIFO, but it's not until the next cycle that we know whether the FIFO had anything to give us, and the cycle after that that we get the data.
- We can write to a memory port, but it's not until the next cycle that we're told whether the port can accept our write. If not, we must hold our write until the port accepts it.
- It is therefore not until three clocks after initiating a FIFO read that we know whether we're stalled by the SDRAM! In that time we must keep reading from the FIFO (to maintain our bandwidth). Any FIFO reads we had initiated in the meantime will complete and the results must be stored somewhere: a tiny two-entry FIFO made of two registers (called “slots”).
- We're writing to two memory ports (color and depth) and either or both could stall.
To solve all these problems we put the current state of the system into a five-bit value and switch on this value to determine what to do. The value has the following bits:
- Bit 4: Whether we currently have data back from the FIFO read. This is the logical “AND” of whether we asked for a read two clocks ago and were told one clock ago that the FIFO wasn't empty.
- Bit 3: Whether the color memory port is asking us to wait. This is the logical “AND” of whether we asked for a write last clock and are told this clock to wait.
- Bit 2: Whether the depth memory port is asking us to wait.
- Bit 1: Whether the first slot is full. We have a bit to keep track of this.
- Bit 0: Whether the second slot is full.
There are 32 combinations of these five bits, but they map to only 11 different behaviors, of which two are error cases (e.g., data in slot 2 but no data in slot 1). A few examples:
- For the value 10000, we have new data from the FIFO, the memory ports aren't blocked, and there's nothing in the slots. We can write the FIFO's data to memory immediately. This is the most common case.
- For values 11100, 10100, and 11000, we have new data from the FIFO but one or both of the memory ports are blocked, and none of the slots are used. Put the FIFO data into slot 1.
- For value 00011, we have no new FIFO data and the memory is not blocked, but both slots are full. Write slot 1 to memory and move slot 2 to slot 1.
See the casez statement in the Write_FIFO.v file for all cases.
The amazing WaveDrom package will be used to illustrate the common cases. Only the color memory port is shown, but the same logic would apply to depth. First, a single fetch from the FIFO, written to memory:
At clock 1 we initiate the FIFO read, at clock 2 we find that it succeeded (the FIFO is not empty), and at clock 3 we can read the data and simultaneously write it to the SDRAM. In the next example we write two (pairs of) pixels sequentially, and neither blocks:
For the next example, one cycle after our memory write the controller tells us to wait. We must hold the data and the write signal indefinitely (though in this case only one extra cycle):
We can now combine our previous two examples. We read three pixels from the FIFO, but the first is stalled one cycle when writing to memory. We must write the second pixel to slot 1, then the next cycle simultaneously write slot 1 to memory and replace it with the third pixel:
This last example uses both slots because the first pixel stalls for two cycles. This is the worst-case scenario because at cycle 4 we realize that the first pixel's write has stalled and we stop fetching from the FIFO. Even if the stall lasts longer than two cycles, we don't have any more than two pixels to write to the slots: