Alice 4 FPGA Rasterizer

Overview

The Alice 4 rasterizer is broken into two main parts:

Buffers

There are three image buffers in shared memory:

The first two buffers are only virtually “front” and “back”. Those two labels switch every frame as the back buffer becomes the new front buffer and is shown to the user.

Modules

The Verilog code is broken into about a dozen modules:

Memory controller

The Altera Cyclone V SoC has a wonderful memory controller for accessing the synchronous dynamic RAM (SDRAM). It has a port for the ARM and six ports for the FPGA. Each FPGA port can be configured for input or output, and their relative priorities (including the ARM port) can be set. The priorities were critical for making sure the front-buffer scan-out was never starved of pixels. The SDRAM itself ran so fast (400 MHz DDR) that all the ports could be active and not stall too often. The ports were set up as follows:

All five ports were hooked up to FIFOs to minimize the effects of memory latency.

Rasterization

Rasterization uses the edge-equation technique. The idea is to test every pixel to see whether it's inside the triangle. “Inside” is defined as “on the same side of every edge”. Only pixels in the bounding box of the triangle are tested. This technique wastes at least 50% of its time on pixels outside the triangle, but it's simpler to implement than edge-walkers.

The state machine in Rasterizer.v reads commands from SDRAM (indirectly through the FIFO) and executes them. Because the SDRAM interface is (logically) 64 bits wide, and each pixel takes 32 bits (8 bits each of red, green, and blue, with 8 bits wasted), we always rasterize two pixels at a time. At 50 MHz, that's 100 million pixels per second, but with (at least) half of them wasted, that's at most 50 million drawn pixels per second.

To minimize SDRAM latency stalls, we use three FIFOs in the rasterization process:

There's very little stalling in this pipeline, so we end up with a rasterization rate of about 50 million Gouraud (color-interpolated) Z-buffered pixels per second. The triangle overhead lets us do almost 2 million (empty) triangles per second. It's hard to compare these numbers to real SGI machines, but we seem to be matching the performance of machines built in the early 1990s.

Reciprocal

For each triangle, the rasterizer computes its on-screen area, then takes the reciprocal of the area. This is necessary for the normalization of the barycentric coordinates used to interpolate color and depth. Scratchapixel has a great explanation of how this works; scroll down to the “Barycentric Coordinates” section.

To compute the reciprocal we use the built-in lpm_divide module:

lpm_divide
    #(.LPM_WIDTHN(32),
      .LPM_WIDTHD(32),
      .LPM_NREPRESENTATION("UNSIGNED"),
      .LPM_DREPRESENTATION("SIGNED"),
      .LPM_PIPELINE(6)) area_divider(
        .clock(clock),
        .clken(area_reciprocal_enabled),
        .numer(32'h7FFF_FFFF),
        .denom(tri_area),
        .quotient(tri_area_recip_result)
    );
            

The module is configured to have six pipeline stages, which means that the result will come out six clocks after the denominator was put in. We don't pipeline (overlap) our reciprocals (we only need one per triangle), but our state machine must wait six clocks for this result. We found the number 6 by trying various values until the compiler stopped complaining about timing violations.

Write FIFO

The Write FIFO, which writes pixel data to the back color buffer and to the depth buffer, was one of the most difficult modules to write in this project. Conceptually the state machine should perform these steps in a loop:

  1. Wait for a new pixel to be available in the Write FIFO.
  2. Write it to the back color buffer and to the depth buffer.
  3. Wait for both SDRAM controller ports to acknowledge that they had accepted the writes.

Remember that wherever we talk about “a pixel” here, we mean two side-by-side pixels that are handled in parallel. The FIFOs include two bits to specify which of the two pixels (or both) are valid, since either (but not both) could be outside the triangle.

This sequential version is much too slow. It would introduce several wait states, destroying our throughput. It is implemented in the !PIPELINED sections of the Write_FIFO.v module, but this code is disabled in favor of the PIPELINED code described below.

There are several difficulties involved in doing the above steps concurrently:

To solve all these problems we put the current state of the system into a five-bit value and switch on this value to determine what to do. The value has the following bits:

There are 32 combinations of these five bits, but they map to only 11 different behaviors, of which two are error cases (e.g., data in slot 2 but no data in slot 1). A few examples:

See the casez statement in the Write_FIFO.v file for all cases.

The amazing WaveDrom package will be used to illustrate the common cases. Only the color memory port is shown, but the same logic would apply to depth. First, a single fetch from the FIFO, written to memory:

At clock 1 we initiate the FIFO read, at clock 2 we find that it succeeded (the FIFO is not empty), and at clock 3 we can read the data and simultaneously write it to the SDRAM. In the next example we write two (pairs of) pixels sequentially, and neither blocks:

For the next example, one cycle after our memory write the controller tells us to wait. We must hold the data and the write signal indefinitely (though in this case only one extra cycle):

We can now combine our previous two examples. We read three pixels from the FIFO, but the first is stalled one cycle when writing to memory. We must write the second pixel to slot 1, then the next cycle simultaneously write slot 1 to memory and replace it with the third pixel:

This last example uses both slots because the first pixel stalls for two cycles. This is the worst-case scenario because at cycle 4 we realize that the first pixel's write has stalled and we stop fetching from the FIFO. Even if the stall lasts longer than two cycles, we don't have any more than two pixels to write to the slots:

« Back