Abstract—Silicon technology scaling is continuously enabling denser
integration capabilities. However, this comes at the expense of higher
variability and susceptibility to wear-out. With an escalating number of
on-chip components expected to be defective in near-future chips, modern
parallel systems, such as Chip Multi-Processors (CMP), become especially
vulnerable to these faults. Just a single link failure in the underlying
Network on-Chip (NoC) may cause inter-tile communication to halt and
even deadlock, rendering the chip useless. While fault-tolerant routing
schemes do exist, they can only handle a finite number of link faults.
In this paper, we address permanent wire failures which can occur in
on-chip parallel links at manufacture-time or while in operation. Instead
of marking the entire link as faulty, we present a methodology where the
Partially Faulty Link (PFL) can still be used to transfer data between NoC
routers, thus maintaining network connectivity, extending the yield and
lifetime of the chip, and allowing for graceful performance degradation.
To achieve this, we devise architectural augmentations both to the router
and link micro-architectures, along with link fault detection, diagnosis,
and re-configuration at the level of wire granularity. Statistical link-level
fault models present the usability of PFLs, while relevant load-balancing
routing algorithms and low-cost re-transmission mechanisms are also
presented and coupled to the proposed architecture. Hardware synthesis
demonstrates the feasibility of the proposed extensions to the base NoC
architecture. Results obtained from full-system simulations show that
high-performance NoCs are realizable in the presence of PFLs.