Anselm's Blog: 2014

June 4, 2014

On Google Spanner

Some time ago when I heard about the new Spanner database from Google I read some of their official paper and watched the OSDI presentation. But somehow it did not answer my burning questions: What problems does it solve? Where is it better that other databases? After some time I then found another presentation. And this one did clear it up for me quite a bit, since it was a lot more about why things are done and less about how they are done.

(No)SQL

First there was SQL and it was good. Then data grew and things got complicated. Classical SQL is often what you want as programmer. It gives you ACID transactions, so strong reliability guarantees for your data. But these get difficult to hold when data grows beyond the limits of what one running database instance can handle. And when you scale out there also comes the slow network between instances. And with many instances comes the increased probability that parts of the system will fail and you want to have data replicated.

So you relax some of the strong SQL guarantees to make it easier to scale. By far not the only change in NoSQL, but a particular interesting one in my opinion. In classical SQL when someone wants to write to some data everything is locked, other writers and also readers on that same data. If these locks are on data distributed over multiple instances and the replication is synchronous, all gets very slow. In NoSQL one approach is to just forget about synchronous replication. Just write to one and within time it will eventually synchronize to others and be consistent over all replicas.

But somehow SQL transactions are still a beloved thing and it would be nice to somehow combine them with the great scalability of NoSQL. So how does Spanner do it?

Versioning

One key idea is that writers do not block readers. Because readers can still read an old version of the data. When the write is complete, replicated and consistent, it becomes the newest version of the data. (This reminds me of ZFS' copy-on-write and Clojure's immutable data structures.) It gives back a good part of performance. And blocking writes in a synchronously replicated system seem unavoidable.

So eventually you get a multi-versioned database. All data has a timestamp associated with it. How much does reading an old version of the data in parallel to a new one being written relax classical SQL semantics? I don't know. Not really? The next tricky thing is how to guarantee external consistency with distributed versioned data? So if I make transaction T1 on one server and the related transaction T2 on another one, a snapshot of the data at any time should never contain only T2. Because from an external view T1 was first and T2 depends on it, so whenever T2 is in the snapshot T1 must also be there. This is not so easy to ensure. It might be easy to guarantee that T1 is written before T2 but how do you know which one was first when reading them later on?

Distributed Snapshots

There are two easy approaches to guarantee consistent distributed snapshots. Either explicitly order the transactions or use timestamps. The former would mean to somehow connect distributed transactions, possibly over multiple servers and clients. Have something like a counter that you pass on. This gets complicated and is error-prone. Timestamps are easy to implement, each server has a clock. But not all servers do have the same time, clocks are always off by some amount.

The novelty of Google Spanner is to make time very accurate between servers and explicitly account for the uncertainty of the clocks. The less the differences are between the clocks are, the faster the system gets.

Time

A mechanism called TrueTime guarantees very accurate clocks over distributed servers in different data centers with very low derivation between all clocks. This is important because when some data is written in a transaction one can calculate the amount of time to wait until locks are released to ensure that any following transaction on any other server will definitely have a later timestamp. That again guarantees external consistency and consistent snapshots in time.

So the whole thing really seems to stand and fall with accurate time. Wonder what happens if some day a clock is completely off. Will you notice it? Can you repair the data?

April 9, 2014

Storage Statics

In a physical server, storage such as RAM and persistent block storage is treated in a rather static way by the operating system. This makes sense since usually new storage is not added on the fly but involves a reboot, when you open the chassis and plug the new hardware in. There are some expensive machines that can do it while running, but I have never seen somebody buying one. For a virtual server this does not change a lot. I still spent a lot of time scheduling a maintenance window of a server to just shut it down, bump its virtual memory and power it up again. Or trying to find out if and how I can resize that disk on the fly through the endless layers of storage protocols, block devices, volume managers, partition tables and filesystems.

Virtual Storage

So for virtual servers this static view of storage resources makes much less sense. There is no physical task involved when adding storage and it is shared among multiple guests on the same host. But its only natural if you know that the implementation comes from the design for physical servers. The main goals I see are:

Provide additional storage to a guest when it needs it.
Give back unneeded storage so other guests in need can use it.

There are a few hacks that make this easier in the virtualized world:

Lazily initialize storage. So you can give a guest much more than it needs now without really occupying the space. Later on it will not be necessary to reboot it for more storage. Of course once the Kernel sees all that free RAM it will gladly use it as I/O cache and fill it up. And your not so lazy initializing filesystem will take a very long time writing all that meta information at mkfs time for the mega-sized disk. To be fair I/O caching is tuneable and there are lazily initializing filesystems. But once the storage is written its initialized and there is no more chance to free it. There is no giving back of free RAM or unused blocks on disk. And the host system can not know from the outside what blocks are actually free. Although there is something like memory ballooning, but that looks a bit hacky to me. So this somehow works around goal 1 and does nothing for 2.
Hot add/grow storage. Seems to work fairly good these days. For memory this works fine. For block storage you still have to kick the OS a bit so that it sees the new blocks and resizes the volumes and filesystems. Goal 1 achieved. But giving back storage is still a no go, I have never seen a hot remove.
Deduplication. Does not really achieve any of the two goals. But it helps so you can overcommit storage on the host and give the guests more resources in the first place. So you do not have to worry about adding later on.

The malloc/free for system resources

In the same way resources are shared between processes within one system, why can one not share resources between virtual guests on a host? Reserve and guarantee certain amounts to a guest, request additional resources from the host and release them back to the host if not needed anymore. I wonder what it would take to implement this ...