
Data Synchronization: A Complete Theoretical Solution for Filesystems 3 of 22
discussed in related academic works” (page 65). By presenting, for the first time, a complete
analysis of the synchronization process for two replicas of a theoretical filesystem model, we
fill these gaps.
1.1. Our contribution
This paper is a complete analysis of the synchronization process of two replicas of a
theoretical filesystem. Our filesystem model together with the commands considered – such as
create,rmdir, or edit – are discussed in Secion 3. The changes between the original replica and
the locally updated versions are captured by a special command set as defined in Section 4,
which also defines algorithms generating this update information.
The important question of which filesystems can, in general, be considered to be a synchro-
nized state of two divergent replicas is tackled in Section 5. Our definition captures this notion
in its full generality without prescribing how the synchronized state can be reached. Providing
such a declaration-based definition of the synchronized state is one of our main contributions.
Section 6presents the generic synchronization algorithm based on conflict resolution. By using
different strategies to resolve conflicts, any of the synchronized states allowed by our definition,
and only those, can be the result of the algorithm. Finally, Section 7summarizes the results and
lists open problems and directions for future research.
2. Methodology
Our filesystem model, defined in Section 3, is arguably simple, but it retains all important
structural properties of real filesystems. It is this simplicity that allows us to exploit a rich and
intriguing algebraic structure [
3
,
4
] which eventually leads to the claimed theoretical results. In
Section 7we discuss some possible extensions of the filesystem model.
Depending on the data communicated by the replicas, synchronizers are categorized as
either state-based or operation-based [
13
,
16
]. In state-based synchronization replicas send their
current versions of the filesystem, or merely the differences (called diff s) between the current
states and the last known synchronized state [
1
]. Operation-based synchronizers transmit the
complete log (or trace) of all operations performed by the user [
7
]. The synchronization method
of this paper is reminiscent of operation-based one, but can also be considered, with similar
overhead, to be state-based. A set of virtual filesystem operations (commands) will be defined
in Section 3. Each of these commands have a clear and intuitive operational meaning, but are
not necessarily available to the end-user. The current state of the filesystem is described by
a special – called canonical – sequence of virtual commands, which transforms the original
filesystem to the actual replica. The update detector can generate this canonical sequence from
the operations performed by the user on the replica (operation-based) as well as from the
differences between the original and final state (state-based). On Figure 1these sequences
correspond to the information in αand βpassed to the reconciler.
Filesystem synchronization must resolve all conflicts between the replicas. Conflict resolu-
tion, however, should be “intuitively correct,” i.e. discarding all changes made by both replicas
is not a viable alternative. The majority of commercially or theoretically available synchroniz-
ers do not present a rationale to explain their concrete conflict resolution approach [
16
]. Two
notable exceptions are [
20
] and [
10
] which describe high-level consistency philosophies. In
[
20
] the main principles are 1) no lost update: preserve all updates on all replicas because these
updates are equally valid; and 2) no side effects: such as a merge where objects unexpectedly
disappear. While these principles intuitively make sense, it is easy to see that neither could
possibly be upheld for every conflict; even the authors provide counterexamples. In [
10
] the
relevant consistency requirements are worded as follows:
R1
intention-confined effect: operations applied to the replicas by the synchronizer must be
based on operations generated by the end-user; and