The preceding sequence explains how data are read from a remote memory
using hardware mechanisms that make the transaction transparent to the processor.
On top of these mechanisms, some form of cache coherence protocol is needed.Various systems differ on exactly how this is done.We make only a few general remarks
here. First, as part of the preceding sequence, node 1’s directory keeps a record that
some remote cache has a copy of the line containing location 798. Then, there needs
to be a cooperative protocol to take care of modifications. For example, if a modification is done in a cache, this fact can be broadcast to other nodes. Each node’s directory that receives such a broadcast can then determine if any local cache has that
line and, if so, cause it to be purged. If the actual memory location is at the node receiving the broadcast notification, then that node’s directory needs to maintain an
entry indicating that that line of memory is invalid and remains so until a write back
occurs. If another processor (local or remote) requests the invalid line, then the local
directory must force a write back to update memory before providing the data.
NUMA Pros and Cons
The main advantage of a CC-NUMA system is that it can deliver effective performance at higher levels of parallelism than SMP, without requiring major software
changes. With multiple NUMA nodes, the bus traffic on any individual node is limited to a demand that the bus can handle. However, if many of the memory accesses
are to remote nodes, performance begins to break down. There is reason to believe
that this performance breakdown can be avoided. First, the use of L1 and L2 caches
is designed to minimize all memory accesses, including remote ones. If much of the
software has good temporal locality, then remote memory accesses should not be
excessive. Second, if the software has good spatial locality, and if virtual memory is
in use, then the data needed for an application will reside on a limited number of
frequently used pages that can be initially loaded into the memory local to the running application.The Sequent designers report that such spatial locality does appear
in representative applications [LOVE96]. Finally, the virtual memory scheme can be
enhanced by including in the operating system a page migration mechanism that
will move a virtual memory page to a node that is frequently using it; the Silicon
Graphics designers report success with this approach [WHIT97].
Even if the performance breakdown due to remote access is addressed, there
are two other disadvantages for the CC-NUMA approach.Two in particular are discussed in detail in [PFIS98]. First, a CC-NUMA does not transparently look like an
SMP; software changes will be required to move an operating system and applications from an SMP to a CC-NUMA system. These include page allocation, already
mentioned, process allocation, and load balancing by the operating system. A
second concern is that of availability. This is a rather complex issue and depends
on the exact implementation of the CC-NUMA system; the interested reader is
referred to [PFIS98].