June 24, 2010

A Storage Nightmare

Recently, I faced the following storage setup for a virtual server, and instantly I got a headache from it:

  1. An enterprise storage system that abstracts almost everything presents a simple SCSI device with a virtual LUN on the SAN.
  2. VMWare binds the LUN and formats it with VMFS, its proprietary cluster filesystem.
  3. Inside the VMFS one VMDK file is created that uses all the space.
  4. The VMDK file is attached as virtual disk to the guest system.
  5. Inside Linux, the disk is partitioned into one primary partition, which is formatted as an LVM physical volume. It belongs to one volume group, from which multiple logical volumes are created.
  6. The logical volumes are formatted with ext3.
To summarize this setup: The data from the application goes over 2 filesystem layers (ext3, VMFS) and 3 virtualization layers (LVM, VMWare, storage system) until it finally reaches the disk.

As if this wasn't enough to cause a headache, I further discovered some special setups:

  • Multiple LUNs are aggregated to one device in VMWare to work around the maximum size limit of the storage system.
  • The Linux LVM layer stripes a logical volume over multiple LUNs, which are directly passed through by VMWare from the SAN. Apparently the intention was to use LUNs from different RAID groups in the storage system for performance increase.

Consider a simple question like "This filesystem is too slow, which volume in the storage system do we have to move to a faster tier?". To answer it you might get yourself into quite some typing: Look up the device of the filesystem's mount point, map the logical volume to the right volume group, look up the physical volume for it, remember its LUN, in VMWare map the LUN to the right virtual disk, identify the VMFS where the disk's VMDK file lies, and finally look up the LUN for the VMFS. Oh my!

The different layers in this setup surely have their reasons. The storage system hides the fact that you actually have a lot of small disks instead of a huge pool. It gives you reliability and flexibility when migrating and expanding your data volumes. Without the VMWare layer you loose snapshot functionality. But then again, snapshots are also something you find in the LVM layer. As you do with logical volume management. It is somehow present in both, the storage system and LVM. This redundancy between layers and the missing integration between them leads to really bad understanding and probably also bad performance, but I would not want to have to debug the latter.

Another fundamental problem in this respect is the block layer. There are such things like adding a SCSI device on a running system. But all dynamic functions that go beyond this are implemented in layers on top of that. Wouldn't it be nice to e.g. implement dynamic resizing in the block layer? You just resize the volume in the SAN and instantly the system sees more space and adds it to the filesystem. No manual work is needed like adding a new disk in VMWare or extending the volume group with a new physical volume and growing the logical volume.

ZFS got the integration of layers right. Inside a pool you do not have a static partitioning that allocates blocks to a filesystem in the first place. Resizing volumes is merely a matter of setting quotas and reservations that are totally dynamic. But besides working in the direction of applications and integrating network filesystems like Lustre and databases into ZFS, it is also interesting to go into the other direction of integrating and enhancing the block layer. One way would be to integrate it better with existing SCSI and Fibrechannel protocols to better cooperate with existing SAN solutions. Yet a more interesting way would be to eliminate the classic SAN and let ZFS do all the work. What ZFS currently lacks is fine grained control over how data is arranged on disks. E.g. the definition of different tiers with different speeds and moving filesystems dynamically between them. And then there are all the pretty things that come with storage networks: multipathing, replication, server based RAIDs, clustering, et cetera, et cetera.

No comments:

Post a Comment