OS runs in DDR and not BRAM, we have rerun the earlier bandwidth test in DDR
as well to allow for a more direct comparison, the results of which are presented in
Figure 9. In this test, the results for systems B and C are combined as there was no
appreciable difference between the two configurations. We can immediately note that
the maximum achievable application bandwidth has dropped significantly and is no
longer saturated, even with eight cores. Single-core bandwidth has been cut in half
and the bandwidth increases at a lower rate than when the application is run from
BRAM. Comparing the results of the tests run in a stand-alone environment versus
running with an OS, we see a further reduction in the bandwidth achievable when running
with the OS. While we expect some additional overhead while running under an
OS, we expect the impact is magnified here as there are no caches in the system. In
future work, we would like to measure the impact again with a system with level one
caches to see if the overhead of the OS remains as high.
In addition to investigating system bandwidth, while conducting the bandwidth tests
we captured the latencies of all memory read requests in the system. As the arbiter
supports two different arbitration methods (round-robin and oldest-request-first), we
ran the tests with each configuration. The results for the three systems when running
from BRAM are presented in Figure 10. Running under BRAM, each core will have, at
most, one read request queued at a time as read requests block in the processor until
they return. As such, the maximum number of requests the arbiter can be servicing at
any time is equal to the number of cores being tested in the system. The boxplot format
presents the min and max values for a given test as the upper and lower stems, the
lower edge of the filled box represents the 1st quartile, and the upper edge of the filled
box represents the 3rd quartile, with the bar within the box indicating the average.
Presented this way, we can readily see the impact on maximum latency, average, and
the spread as we increase the number of cores or change arbitration policies.
Across all three configurations we see that the average latency does not increase
significantly until the bandwidth of the system has been saturated. This demonstrates
that the arbiter scales effectively and is not introducing any significant additional