6 Consistency and Fault Tolerance
Two of the most important requirements for TAO are
availability and performance. When failures occur we
would like to continue to render Facebook, even if the
data is stale. In this section, we describe the consistency
model of TAO under normal operation, and how TAO
sacrifices consistency under failure modes.
6.1 Consistency
Under normal operation, objects and associations in TAO
are eventually consistent [33, 35]; after a write, TAO
guarantees the eventual delivery of an invalidation or refill
to all tiers. Given a sufficient period of time during
which external inputs have quiesced, all copies of data
in TAO will be consistent and reflect all successful write
operations to all objects and associations. Replication
lag is usually less than one second.
In normal operation (at most one failure encountered
by a request) TAO provides read-after-write consistency
within a single tier. TAO synchronously updates the
cache with locally written values by having the master
leader return a changeset when the write is successful.
This changeset is propagated through the slave leader (if
any) to the follower tier that originated the write query.
If an inverse type is configured for an association, then
writes to associations of that type may affect both the
id1’s and the id2’s shard. In these cases, the changeset
returned by the master leader contains both updates, and
the slave leader (if any) and the follower that forwarded
the write must each send the changeset to the id2’s shard
in their respective tiers, before returning to the caller. The changeset cannot always be safely applied to the
follower’s cache contents, because the follower’s cache
may be stale if the refill or invalidate from a second follower’s
update has not yet been delivered. We resolve
this race condition in most cases with a version number
that is present in the persistent store and the cache. The
version number is incremented during each update, so
the follower can safely invalidate its local copy of the
data if the changeset indicates that its pre-update value
was stale. Version numbers are not exposed to the TAO
clients. In slave regions, this scheme is vulnerable to
a rare race condition between cache eviction and storage
server update propagation. The slave storage server
may hold an older version of a piece of data than what
is cached by the caching server, so if the post-changeset
entry is evicted from cache and then reloaded from the
database, a client may observe a value go back in time
in a single follower tier. Such a situation can only occur
if it takes longer for the slave region’s storage server
to receive an update than it does for a cached item to be
evicted from cache, which is rare in practice.
Although TAO does not provide strong consistency for
its clients, because it writes to MySQL synchronously
the master database is a consistent source of truth. This
allows us to provide stronger consistency for the small
subset of requests that need it. TAO reads may be marked
critical, in which case they will be proxied to the master
region. We could use critical reads during an authentication
process, for example, so that replication lag doesn’t
allow use of stale credentials.
6.2 Failure Detection and Handling
TAO scales to thousands of machines over multiple geographical locations, so transient and permanent failures
are commonplace. Therefore, it is important that
TAO detect potential failures and route around them.
TAO servers employ aggressive network timeouts so as
not to continue waiting on responses that may never arrive.
Each TAO server maintains per-destination timeouts,
marking hosts as down if there are several consecutive
timeouts, and remembering downed hosts so that
subsequent requests can be proactively aborted. This
simple failure detector works well, although it does not
always preserve full capacity in a brown-out scenario,
such as bursty packet drops that limit TCP throughput.
Upon detection of a failed server, TAO routes around the
failures in a best effort fashion in order to preserve availability and performance at the cost of consistency. We
actively probe failed machines to discover when (if) they
recover. Database failures: Databases are marked down in a global configuration if they crash, if they are taken offline
for maintenance, or if they are replicating from a
master database and they get too far behind. When a master database is down, one of its slaves is automatically
promoted to be the new master.
When a region’s slave database is down, cache misses
are redirected to the TAO leaders in the region hosting the
database master. Since cache consistency messages are
embedded in the database’s replication stream, however,
they can’t be delivered by the primary mechanism. During
the time that a slave database is down an additional
binlog tailer is run on the master database, and the refills
and invalidates are delivered inter-regionally. When
the slave database comes back up, invalidation and refill
messages from the outage will be delivered again.
Leader failures: When a leader cache server fails,
followers automatically route read and write requests
around it. Followers reroute read misses directly to
the database. Writes to a failed leader, in contrast,
are rerouted to a random member of the leader’s tier.
This replacement leader performs the write and associated
actions, such as modifying the inverse association
and sending invalidations to followers. The replacement
leader also enqueues an asynchronous invalidation to the
original leader that will restore its consistency. These
asynchronous invalidates are recorded both on the coordinating node and inserted into the replication stream,
where they are spooled until the leader becomes available.
If the failing leader is partially available then followers
may see a stale value until the leader’s consistency
is restored. Refill and invalidation failures: Leaders send refills and invalidations asynchronously. If a follower is unreachable, the leader queues the message to disk to be
delivered at a later time. Note that a follower may be
left with stale data if these messages are lost due to permanent leader failure. This problem is solved by a bulk
invalidation operation that invalidates all objects and associations from a shard id. After a failed leader box is
replaced, all of the shards that map to it must be invalidated in the followers, to restore consistency.
Follower failures: In the event that a TAO follower
fails, followers in other tiers share the responsibility of
serving the failed host’s shards. We configure each TAO
client with a primary and backup follower tier. In normal
operations requests are sent only to the primary. If
the server that hosts the shard for a particular request has
been marked down due to timeouts, then the request is
sent instead to that shard’s server in the backup tier. Because failover requests still go to a server that hosts the
corresponding shard, they are fully cacheable and do not
require extra consistency work. Read and write requests
from the client are failed over in the same way. Note that
failing over between different tiers may cause read-afterwrite consistency to be violated if the read reaches the
failover target before the write’s refill or invalidate.