ACID 2.0 - Associative, Commutative, Idempotent, Distributed

ACID 2.0 - Associative, Commutative, Idempotent, Distributed

In 2011, Shapiro et al formally introduced the term “Conflict-free Replicated Data Types” (CRDT). Back then, they described two CRDT varieties: Convergent RDT (CvRDT, state-based) and Commutative RDT (CmRDT, op-based). These days, we have a variety of other CRDT flavours: delta-enabled, pure op-based, log-structured (RON) etc. While working with an eventually-consistent data store, we had to mix those different types, still obeying their properties and requirements.

The solution was to mark each type according to its properties. So, we adopted a formal model.

Either it is a CRDTs, CvRDT, CmRDT, a RON frame, or any other piece of data, be it serializable, eventually consistent, causally consistent or anything else, we may see it as a chain of operations: s=oabc... where s is the resulting state, o is the zero/empty/default state and a, b, c is a chain of updates. Using the verbose notation, the same renders as: s = (((o*a)*b)*c) It does not really matter whether “updates” themselves are ops, state snapshots or deltas/patches. These are pieces of data, operands. And the data type is the * operation.

Algebraically, an operation may have properties, such as:

  • Associative
  • Commutative
  • Idempotent
  • Distributed

Some may wonder about “distributed” and whether it was picked for the nice acronym. Originally, “ACID 2.0” was a tongue-in-cheek term by Pat Helland in a distributed systems article named “Building on Quicksand”. But, eventually I needed an algebraic property that was not in the textbook, so it got that name. Let’s call our system ACID’.

ACID’ properties

So, let’s discuss our four properties:

Associativity abc=(ab)c=a(bc) Alone, associativity gives us the ability to aggregate (compact) random spans of data. Suppose, we have a chain of updates bc to some data a we don’t have. Still, we can compact bc to apply the result later, because a(bc)=abc. Practically that means we can do patches.

Commutativity ab=ba Alone, algebraic commutativity is a rather weak feature. People’s intuition about algebraic commutativity is often off-mark cause most textbook examples of commutative operations are also associative. In fact, given (ab)c we can only flip operands in both of the operations. Either a with b or ab with c. That gives us four equal permutations: abc=bca=cba=acb Still, there are 3! six permutations total. The remaining two permutations bac and cab may not be equal to the other four. Only with associativity, commutativity gives us equality of all permutations.

Idempotency aa=a Again, idempotency alone does not give us universal tolerance of repeated updates. But, in combination with other features, it does. Obviously, it is difficult to patch up data loss, but at least data duplication can’t mess things up.

Distributedness (ab)c = (ac)b for concurrent c, b. This feature indicates whether an operation is tolerant to partial (causal) order. For example, in some datatype a “delete” operation must address a symbol introduced by an “insert” operation earlier. These operations are causally related. We might not be able to apply the delete before the insert. On the other hand, if two inserts happened at different replicas concurrently, they are not causally related. If we can apply those inserts in any order, then the data type is OK with partial causal ordering. Essentialy, this feature separates “single-writer” from “distributed”.

Who is who

At this point, we may look at the zoo of available replicated datatypes to see who is who. We have 4 binary qualities, hence 16 combinations: 0, D, I, IDACID. Full commutativity AC also causes partial order tolerance D, so ACI=ACID, AC=ACD. We must be able to sort all the replicated data types into 14 bins.

The more features an RDT has, the more careless the sync system can be. Full-ACID datatypes need no care whatsoever: as long as all the necessary information reaches us, we’re good.

For example, Last-Write-Wins is ACID, so Cassandra can be quite anarchic on the inside. That also applies to Cassandra’s advanced data types, such as maps and sets. I am not sure about lists though.

Op-based CRDTs (CmRDT) are required to be D. Although, some particular CmRDTs may be ID orAID. Anything that runs on top of a causal broadcast has to end in D.

State-based CRDTs (CvRDT) are full ACID (like those Riak used).

Although full ACID may look like our Holy Grail, there is always the other side of the coin. The general rule is that full-featured RDTs are more expensive, while less-featured types have higher requirements to the delivery system.

If we consider MySQL or virtually any classic SQL database, they have none of the ACID’ properties. Hence, their replication system needs to deliver all the changes carefully, exactly once and in the same exact order. On the good side, MySQL needs very thin metadata, compared to Cassandra or anything CRDT-based.

While classic databases score 0 on the ACID’ scale, I should probably reserve the value of -1 for one data sync system based on Operational Transforms. Because of the way its OT mechanics worked, even minor glitches messed up the entire database through offset corruption. That was probably the worst case I observed in the wild. Some may build on quicksand, others need solid bedrock… but that system needed a diamond plate to stay still.

Another interesting case is git, the source code management system. Technically it is ID if only we consider manual (human) interference as a valid merge algorithm. Otherwise, it is probably I. If we count both human interference AND plain patches as legal means then git could be AI/AID. Except, plain patches forfeit some essential metadata.

Conclusion

This ACID’ classification gives us some practical means of labeling Replicated Data Types for use in distributed environments.