4.5 Scaling Geographically
The leader and followers configuration allows TAO to
scale to handle a high workload, since read throughput
scales with the total number of follower servers in all
tiers. Implicit in the design, however, is the assumption
that the network latencies from follower to leader and
leader to database are low. This assumption is reasonable
if clients are restricted to a single data center, or even to
a set of data centers in close proximity. It is not true,
however, in our production environment.
As our social networking application’s computing and
network requirements have grown, we have had to expand
beyond a single geographical location: today, follower
tiers can be thousands of miles apart. In this configuration,
network round trip times can quickly become
the bottleneck of the overall architecture. Since read
misses by followers are 25 times as frequent as writes in
our workloads, we chose a master/slave architecture that
requires writes to be sent to the master, but that allows
read misses to be serviced locally. As with the leader/-
follower design, we propagate update notifications asynchronously to maximize performance and availability, at the expense of data freshness.
The social graph is tightly interconnected; it is not possible
to group users so that cross-partition requests are
rare. This means that each TAO follower must be local
to a tier of databases holding a complete multi-petabyte
copy of the social graph. It would be prohibitively expensive to provide full replicas in every data center.
Our solution to this problem is to choose data center
locations that are clustered into only a few regions, where
the intra-region latency is small (typically less than 1 millisecond).
It is then sufficient to store one complete copy
of the social graph per region. Figure 2 shows the overall
architecture of the master/slave TAO system.
Followers behave identically in all regions, forwarding
read misses and writes to the local region’s leader tier.
Leaders query the local region’s database regardless of
whether it is the master or slave. Writes, however, are
forwarded by the local leader to the leader that is in the
region with the master database. This means that read
latency is independent of inter-region latency.
The master region is controlled separately for each
shard, and is automatically switched to recover from the
failure of a database. Writes that fail during the switch
are reported to the client as failed, and are not retried.
Note that since each cache hosts multiple shards, a server
may be both a master and a slave at the same time. We
prefer to locate all of the master databases in a single region. When an inverse association is mastered in a different region, TAO must traverse an extra inter-region link to forward the inverse write.
TAO embeds invalidation and refill messages in the
database replication stream. These messages are delivered in a region immediately after a transaction has been replicated to a slave database. Delivering such messages earlier would create cache inconsistencies, as reading from the local database would provide stale data. At Facebook TAO and memcache use the same pipeline for delivery of invalidations and refills [21].
If a forwarded write is successful then the local leader
will update its cache with the fresh value, even though
the local slave database probably has not yet been updated by the asynchronous replication stream. In this
case followers will receive two invalidates or refills from
the write, one that is sent when the write succeeds and
one that is sent when the write’s transaction is replicated
to the local slave database.
TAO’s master/slave design ensures that all reads can
be satisfied within a single region, at the expense of potentially returning stale data to clients. As long as a user
consistently queries the same follower tier, the user will
typically have a consistent view of TAO state. We discuss
exceptions to this in the next section.
5 Implementation
Previous sections describe how TAO servers are aggregated to handle large volumes of data and query rates. This section details important optimizations for performance
and storage efficiency.
5.1 Caching Servers
TAO’s caching layer serves as an intermediary between
clients and the databases. It aggressively caches objects
and associations to provide good read performance.
TAO’s memory management is based on Facebook’s
customized memcached, as described by Nishtala et
al. [21]. TAO has a slab allocator that manages slabs of
equal size items, a thread-safe hash table, LRU eviction
among items of equal size, and a dynamic slab rebalancer
that keeps the LRU eviction ages similar across all types
of slabs. A slab item can hold one node or one edge list.
To provide better isolation, TAO partitions the available
RAM into arenas, selecting the arena by the object
or association type. This allows us to extend the cache
lifetime of important types, or to prevent poor cache citizens from evicting the data of better-behaved types. So
far we have only manually configured arenas to address
specific problems, but it should be possible to automatically
size arenas to improve TAO’s overall hit rate.
For small fixed-size items, such as association counts,
the memory overhead of the pointers for bucket items in
the main hash table becomes significant. We store these
items separately, using direct-mapped 8-way associative
caches that require no pointers. LRU order within each bucket is tracked by simply sliding the entries down. We
achieve additional memory efficiency by adding a table
that maps the each active atype to a 16 bit value. This
lets us map (id1, atype) to a 32-bit count in 14 bytes; a
negative entry, which records the absence of any id2 for
an (id1, atype), takes only 10 bytes. This optimization
allows us to hold about 20% more items in cache for a
given system configuration.
5.2 MySQL Mapping
Recall that we divide the space of objects and associations
into shards. Each shard is assigned to a logical
MySQL database that has a table for objects and a table
for associations. All of the fields of an object are serialized
into a single ‘data‘ column. This approach allows
us to store objects of different types within the same table,
Objects that benefit from separate data management
polices are stored in separate custom tables.
Associations are stored similarly to objects, but to support
range queries, their tables have an additional index
based on id1, atype, and time. To avoid potentially expensive
SELECT COUNT queries, association counts
are stored in a separate table.
5.3 Cache Sharding and Hot Spots
Shards are mapped onto cache servers within a tier using
consistent hashing [15]. This simplifies tier expansions
and request routing. However, this semi-random assignment
of shards to cache servers can lead to load imbalance:
some followers will shoulder a larger portion of
the request load than others. TAO rebalances load among
followers with shard cloning, in which reads to a shard
are served by multiple followers in a tier. Consistency
management messages for a cloned shard are sent to all
followers hosting that shard.
In our workloads, it is not uncommon for a popular
object to be queried orders of magnitude more often than
other objects. Cloning can distribute this load across
many followers, but the high hit rate for these objects
makes it worthwhile to place them in a small client-side
cache. When a follower responds to a query for a hot
item, it includes the object or association’s access rate.
If the access rate exceeds a certain threshold, the TAO
client caches the data and version. By including the version
number in subsequent queries, the follower can omit
the data in replies if the data has not changed since the
previous version. The access rate can also be used to
throttle client requests for very hot objects.
5.4 High-Degree Objects
Many objects have more than 6,000 associations with the
same atype emanating from them, so TAO does not cache
the complete association list. It is also common that assoc
get queries are performed that have an empty result
(no edge exists between the specified id1 and id2). Unfortunately, for high-degree objects these queries will always go to the database, because the queried id2 could
be in the uncached tail of the association list.
We have addressed this inefficiency in the cache implementation by modifying client code that is observed
to issue problematic queries. One solution to this problem
is to use assoc count to choose the query direction,
since checking for the inverse edge is equivalent. In
some cases where both ends of an edges are high-degree
nodes, we can also leverage application-domain knowledge to improve cacheability. Many associations set the time field to their creation time, and many objects include their creation time as a field. Since an edge to a
node can only be created after the node has been created,
we can limit the id2 search to associations whose time
is ≥ than the object’s creation time. So long as an edge
older than the object is present in cache then this query
can be answered directly by a TAO follower.
6 Consistency and Fault Tolerance
Two of the most important requirements for TAO are
availability and performance. When failures occur we
would like to continue to render Facebook, even if the
data is stale. In this section, we describe the consistency
model of TAO under normal operation, and how TAO
sacrifices consistency under failure modes.
6.1 Consistency
Under normal operation, objects and associations in TAO
are eventually consistent [33, 35]; after a write, TAO
guarantees the eventual delivery of an invalidation or refill
to all tiers. Given a sufficient period of time during
which external inputs have quiesced, all copies of data
in TAO will be consistent and reflect all successful write
operations to all objects and associations. Replication
lag is usually less than one second.
In