In the second set of tests where the application is run under BRAM, we see more
variation between the arbitration policies as can be seen in Figure 11. In this test,
as instructions are being fetched from memory, each processor can have up to two
outstanding read requests at a time (one from the instruction side and one from the
data side). In addition, as stores are non-blocking, a processor may have any number
of outstanding store operations in progress at any given time. However, as during
the majority of this test the application will be looping over a large number (1,000)
consecutive load instructions, the impact of additional stores during the loop overhead
should be small.
Again, we see the biggest impact from the arbitration mechanism when memory
frequency is lowest and closest to the processor frequency, and thus, has the highest
memory latencies. Here, though, there is a greater spread in the latencies then when
the processors were issuing only one request at a time. In this case, the oldest-request-
first arbitration mechanism is able to reduce the spread and average in latencies for
some cases, although the impact is still small (around 1–2% improvement on average
latency). We expect the impact would be greater if the memory latency was yet higher,
but for an FPGA-based system where DDR memory is often operating at a higher
frequency than the FPGA logic, the benefits of more complex arbitration are not as
significant as they would be in a typical workstation processor system. As the additional
complexity of the oldest-first arbitration lowers the maximum operating frequency of
the system, we would recommend its use only in the cases where the system may be
very sensitive to latency