The classical view of a distributed DBMS is that it should behave just like a centralized DBMS from the point of view of a user; issues arising from distribution of data should be transparent to the user, although, of course, they must be addressed at the implementation level.
With respect to queries, this view of a distributed DBMS means that users should be able to ask queries without worrying about how and where relations are stored; we have already seen the implications of this requirement on query evaluation.
With respect to updates, this view means that transactions should continue to be atomic actions, regardless of data fragmentation and replication. In particular, all copies of a modied relation must be updated before the modifying transaction commits.This replication is known as synchronous replication.
An alternative approach to replication, called asynchronous replication, has come to be widely used in commercial distributed DBMSs. Copies of a modied relation are updated only periodically in this approach, and a transaction that reads dierent copies of the same relation may see dierent values. Thus, asynchronous replication compromises distributed data independence, but it can be more eciently implemented than synchronous replication.
Synchronous Replication
There are two basic techniques for ensuring that transactions see the same value regardless of which copy of an object they access. In the rst technique, called voting, a transaction must write a majority of copies in order to modify an object and read at least enough copies to make sure that one of the copies is current.
if there are
For example,
transactions, then at least 4 copies must be read. Each copy has a version number, and the copy with the highest version number is current. This technique is not attractive in most situations because reading an object requires reading multiple copies; in most applications, objects are read much more frequently than they are updated, and ecient performance on reads is very important.
In the second technique, called read-any write-all, to read an object, a transaction can read any one copy, but to write an object, it must write all copies. Reads are fast, especially if we have a local copy, but writes are slower, relative to the rst technique. This technique is attractive when reads are much more frequent than writes, and it is usually adopted for implementing synchronous replication.
Asynchronous Replication
Synchronous replication comes at a signicant cost. Before an update transaction can commit, it must obtain exclusive locks on all copies|assuming that the read-any writeall technique is used|of modied data. The transaction may have to send lock requests to remote sites, and wait for the locks to be granted, and during this potentially long period,
communication links fail, the transaction cannot commit until all sites at which it has modied data recover and are reachable. Finally, even if locks are obtained readily and there are no failures, committing a transaction requires several additional messages to be sent as part of a commit protocol.
Primary Site versus Peer-to-Peer Replication
Asynchronous replication comes in two flavors. In primary site asynchronous replication, one copy of a relation is designated as the primary or master copy. Replicas of the entire relation or of fragments of the relation can be created at other sites; these are secondary copies, and, unlike the primary copy, they cannot be updated. A common mechanism for setting up primary and secondary copies is that users rst register or publish the relation
subsequently subscribe to a fragment of a registered relation from another (secondary) site.
In peer-to-peer asynchronous replication, more than one copy (although perhaps not all) can be designated as being updatable, that is, a master copy. In addition to propagating changes, a conflict resolution strategy must be used to deal with conflicting changes made at dierent sites.
changed to 35 at one site and to 38 at another. Which value is `correct'? Many more subtle kinds of conflicts can arise in peer-to-peer replication, and in general peer-to-peer replication
situations in which peer-to-peer replication does not lead to conflicts arise quite often, and it is in such situations that peer-to-peer replication is best utilized. For example:
Each master is allowed to update only a fragment (typically a horizontal fragment) of the relation, and any two fragments updatable by dierent masters are disjoint. Updating rights are held by only one master at a time. For example, one site is designated as a backup to another site. Changes at the master site are propagated to other sites and updates are not allowed at other sites (including the backup). But if the master site fails, the backup site takes over and updates are now permitted at (only) the backup site.
Data Warehousing: An Example of Replication
Complex decision support queries that look at data from multiple sites are becoming very important. The paradigm of executing queries that span multiple sites is simply inadequate for performance reasons. One way to provide such complex query support over data from multiple sources is to create a copy of all the data at some one location and to use the copy rather than going to the individual sources. Such a copied collection of data is called a data warehouse. Specialized systems for building, maintaining, and querying data warehouses have become important tools in the marketplace.
Data warehouses can be seen as one instance of asynchronous replication, in which copies are updated relatively infrequently. When we talk of replication, we typically mean copies maintained under the control of a single DBMS, whereas with data warehousing, the original data may be on dierent software platforms (including database systems and OS le systems) and even belong to dierent organizations. This distinction, however, is likely to become blurred as vendors adopt more `open' strategies to replication. For example, some products already support the maintenance of replicas of relations stored in one vendor's DBMS in another vendor's DBMS.