Typically, the latency of the
CNUin [12] is much longer. In addition, this architecture carried
out the forward process first. Then the backward and merging
are done simultaneously. The complexity of the CNU using the
architecture in [12] for processing one layer at a time is also listed
in Table II. Even if the best case latency is considered for this
architecture, the CNU proposed in this paper can still achieve
seven times the speed with 25% of the area.