In the FCUDA flow, each hardware core has
private on-chip memory and computation logic, and multiple
cores are instantiated to improve throughput and latency.
This throughput-oriented synthesis allows fine-grained scaling
of the parallelism but also places stress on on-chip communication
and external memory bandwidth. When instantiating
many cores, they must share access to external
memory ports. Furthermore, the cores may process overlapping
data; thus, the opportunity to share data on-chip
can reduce off-chip bandwidth pressure. For example, with
cores accelerating matrix multiplication (Fig. 1), independent
blocks process overlapping input data that can be shared
on-chip.