September 24, 2010

ZFS Wishlist

Create a new data set from existing data online

Often I start with one data set per pool and when the system evolves I create more and more data sets as needed. First I mount the new data set on a temporary path, move all data, then remove the old directory and remount the new data set to its designated path. The downside is that I can not really do this online, since a move does mostly not work for open files. Also the data must effectively be copied which can take some time. On a first thought it would be nice if there was one command that directly converts an existing folder into a new data set. On second thought, why does ZFS actually need to copy anything? Can it not just relink all znodes to the new data set? But I guess it has also something to do with the concept of mounting. How can you change an open file from one mount point to another ...

Control over data allocation on vdevs

This is something that I always liked in LVM, but somehow also made it more complicated and maybe more error prone compared to ZFS. In LVM you can exactly specify which physical extents are allocated to what logical volume. While in ZFS it is possible for example to replace single disks through the RAID algorithm or add vdevs to a pool, it automatically allocates extents to data sets and this can not be changed manually. And why should you have to make a new pool just to be able to control which data lies on what vdev? Having more that one pool always costs you some flexibility. Moving or splitting up data between pools always causes problems. A related topic is dynamically changing a RAID. One not so far reached example is to extend a mirror with additional disks to a RAIDZ for more capacity. Or to shrink a pool. Another interesting idea is the use of different tiers within a pool. There could for example be one vdev with large but slow SATA disks, and a second with small but fast SSDs. The database should lie on the fast tier, while the software repository mirror is better suited for the large capacity tier.

Integration of additional layers

One functionality that has always been present is the possibility to create a block device instead of a data set. It is basically a data set without the ZFS filesystem layer. The volume can then be formatted with an arbitrary filestem type. That is what is mostly done for swap. But no need to stop here. There are two examples where work is already underway. One is the integration of a database layer, so the database server directly uses ZFS. After all, is it really ideal to create a file in the filesystem (which also is a simple form of a database) and then in this file create a MySQL database, actually making a database within a database? The other example is network. In Lustre 3.0 it should be possible to use ZFS as backend. So speaking more clearly, the data you put on the network mount is distributed with Lustre to multiple servers that store it directly to their ZFS pool.

Update data on properties change

Like a defragmentation or RAID rebuild that runs in the background it would be nice to have the option to apply property changes to existing data. I am thinking of the case where a pool runs out of space. Then you enable compression for a data set, let it run in the background and with time some space gets free again.

No comments:

Post a Comment