June 4, 2014

On Google Spanner

Some time ago when I heard about the new Spanner database from Google I read some of their official paper and watched the OSDI presentation. But somehow it did not answer my burning questions: What problems does it solve? Where is it better that other databases? After some time I then found another presentation. And this one did clear it up for me quite a bit, since it was a lot more about why things are done and less about how they are done.

(No)SQL

First there was SQL and it was good. Then data grew and things got complicated. Classical SQL is often what you want as programmer. It gives you ACID transactions, so strong reliability guarantees for your data. But these get difficult to hold when data grows beyond the limits of what one running database instance can handle. And when you scale out there also comes the slow network between instances. And with many instances comes the increased probability that parts of the system will fail and you want to have data replicated.

So you relax some of the strong SQL guarantees to make it easier to scale. By far not the only change in NoSQL, but a particular interesting one in my opinion. In classical SQL when someone wants to write to some data everything is locked, other writers and also readers on that same data. If these locks are on data distributed over multiple instances and the replication is synchronous, all gets very slow. In NoSQL one approach is to just forget about synchronous replication. Just write to one and within time it will eventually synchronize to others and be consistent over all replicas.

But somehow SQL transactions are still a beloved thing and it would be nice to somehow combine them with the great scalability of NoSQL. So how does Spanner do it?

Versioning

One key idea is that writers do not block readers. Because readers can still read an old version of the data. When the write is complete, replicated and consistent, it becomes the newest version of the data. (This reminds me of ZFS' copy-on-write and Clojure's immutable data structures.) It gives back a good part of performance. And blocking writes in a synchronously replicated system seem unavoidable.

So eventually you get a multi-versioned database. All data has a timestamp associated with it. How much does reading an old version of the data in parallel to a new one being written relax classical SQL semantics? I don't know. Not really? The next tricky thing is how to guarantee external consistency with distributed versioned data? So if I make transaction T1 on one server and the related transaction T2 on another one, a snapshot of the data at any time should never contain only T2. Because from an external view T1 was first and T2 depends on it, so whenever T2 is in the snapshot T1 must also be there. This is not so easy to ensure. It might be easy to guarantee that T1 is written before T2 but how do you know which one was first when reading them later on?

Distributed Snapshots

There are two easy approaches to guarantee consistent distributed snapshots. Either explicitly order the transactions or use timestamps. The former would mean to somehow connect distributed transactions, possibly over multiple servers and clients. Have something like a counter that you pass on. This gets complicated and is error-prone. Timestamps are easy to implement, each server has a clock. But not all servers do have the same time, clocks are always off by some amount.

The novelty of Google Spanner is to make time very accurate between servers and explicitly account for the uncertainty of the clocks. The less the differences are between the clocks are, the faster the system gets.

Time

A mechanism called TrueTime guarantees very accurate clocks over distributed servers in different data centers with very low derivation between all clocks. This is important because when some data is written in a transaction one can calculate the amount of time to wait until locks are released to ensure that any following transaction on any other server will definitely have a later timestamp. That again guarantees external consistency and consistent snapshots in time.

So the whole thing really seems to stand and fall with accurate time. Wonder what happens if some day a clock is completely off. Will you notice it? Can you repair the data?