ZFS may be Yet Another File System. To me, ZFS is Ze File System brought by Sun Solaris. It is remarkable because it comes with numerous interesting features. To name a few: redundancy, checksum, compression, deduplication, snapshots.
This article is a kind of cheat sheet on what is ZFS, how it is organized and how to build a consistent storage area using this particular file system. It is mostly an attempt to clarify and write down what I read about ZFS and how it has to be used to achieve various storage goals.
About storage and file system
The most common way of storing data is to get some storage hardware, present it to some operating system and organize the date from there.
Most of people have a single internal disk drive with Windows installed on their PC, running over NTFS. Sometimes, they plug in/out removable storage on USB, still managing their data from NTFS.
On the industry side, you often see several disks plugged into a customized PC, named a server, through a dedicated piece of hardware that enables redundancy ; a RAID controller. In some rare case, the RAID is done on the software side. Anyway, at some point, the storage is seen as a global protected area to the operating system in which the data is formatted in a dedicated manner: the file system. On a Windows Server, you’ll get a NTFS partition. On Linux, you’ll get an EXT? partition. On OpenBSD, you’ll get an UFS slice. Etc.
There are cases when the storage zone is not attached inside the server. Sometimes, a huge set of disks is presented to a numerous number of servers in an independent manner. Either unformatted (Fibre Channel, iSCSI…) through a storage network (NAS) or externally directly attached (DAS). Sometimes, it is presented through the network in an already organized manner (NAS) via a network filesystem (NFS, CFIS…)
ZFS is a kind of a mix. It provides redundancy, which RAID provides, and ensure disk failure doesn’t corrupt the data. It provides compression, which some third-party software provides, in a transparent on-the-fly manner so that applications using the storage don’t have to deal with it. It provides deduplication so that redundant stored data is organized to use less disk space. It provides snapshots and replication, which SAN provides, so that data states can be kept and stored on some external data space. There is no NAS feature in ZFS. But since your operating system can publish the ZFS volumes over the network, you get that feature too.
The ZFS way of managing data
First of all, you’ll need an operating system that knows about ZFS. You can check Solaris, FreeBSD or (probably) any Linux distributions. In my example, I’m going to use Solaris 11 64-bit in trial mode. The OS is booted in VMware Fusion using 3 extra virtual disks of 20GB each.
Should you wonder which ZFS feature you have, you can go with:
# zpool upgrade -v This system is currently running ZFS pool version 31. (...) For more information on a particular version, including supported releases, see the ZFS Administration Guide.
Regarding the file system, ZFS is like onions… Onions have layers.
ZFS Storage Pool
pool is the first layer of ZFS. It groups the most basic components of storage: disks. Whether there are real disks, slices or files, a pool will be defined on top of a group of such objects to create a redundant storage space.
If you know about RAID system, think about pools as RAID groups.
Pool creation is just a matter a selecting a bunch of disks and grouping them in your preferred manner. First of all, get the list of the available disks:
# format Searching for disks...done AVAILABLE DISK SELECTIONS: 0. c8t0d0
/pci@0,0/pci15ad,1976@10/sd@0,0 1. c8t1d0 /pci@0,0/pci15ad,1976@10/sd@1,0 2. c8t2d0 /pci@0,0/pci15ad,1976@10/sd@2,0 3. c8t3d0 /pci@0,0/pci15ad,1976@10/sd@3,0 4. c8t4d0 /pci@0,0/pci15ad,1976@10/sd@4,0 5. c8t5d0 /pci@0,0/pci15ad,1976@10/sd@5,0 Specify disk (enter its number): ^D
You can concatenate every disks into a single pool (RAID-0 like):
# sudo zpool create pool0 c8t1d0 c8t2d0 c8t3d0 zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT pool0 59,6G 91K 59,6G 0% 1.00x ONLINE - (...) # zpool status pool0 pool: pool0 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM pool0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 errors: No known data errors # zpool destroy pool0
You can secure the pool using mirrored disks at the cost of “loosing” half the disks set space (RAID-1 or RAID-10 like):
# zpool create pool1 mirror c8t1d0 c8t2d0 # zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT pool1 19,9G 97K 19,9G 0% 1.00x ONLINE - (...) # zpool status pool1 pool: pool1 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 errors: No known data errors # zpool destroy pool1
You can finally get a big secured storage array using RAID-Z. Basically RAID-Z is an improved RAID-5 system that solves the “write hole” issue. You can get RAID-Z2 or RAID-Z3 that adds more spare disks at the cost of loosing the space size:
# zpool create puddle raidz c8t1d0 c8t2d0 c8t3d0 # zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT puddle 59,5G 174K 59,5G 0% 1.00x ONLINE - (...) # zpool status puddle pool: puddle state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM puddle ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 errors: No known data errors # zpool destroy puddle
There are many options to pools. You can “improve” pools reliability with spare, log or cache disks. A nice thing to do is to get really fast storage, like vRam or SSD, for log and caching. I’m not sure about how much is required to get decent performance upgrade. But if you look at WD Enterprise-class disks, they have 32MB of cache for 2TB of data. So I guess getting an extra 16GB of RAM or 64GB SSD coupled with 10Krpm disks rather than “only” 15Krpm disks would be a nice deal.
The cache is used to accelerate read operations:
# zpool add puddle cache c8t4d0 # zpool status puddle pool: puddle state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM puddle ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 cache c8t4d0 ONLINE 0 0 0 errors: No known data errors
The log is used to ack write operations quickly. This enable faster synchronous write operations:
# zpool add puddle log c8t5d0 # zpool status puddle pool: puddle state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM puddle ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c8t1d0 ONLINE 0 0 0 c8t2d0 ONLINE 0 0 0 c8t3d0 ONLINE 0 0 0 logs c8t5d0 ONLINE 0 0 0 cache c8t4d0 ONLINE 0 0 0 errors: No known data errors
Should you required a lot of space for log operations, you may use cheap disk mirror rather than single expensive SSD. Or use SSD protected stripes for massive performance needs.
Finally, let’s have a look at all the available options of the pool:
# zpool get all puddle NAME PROPERTY VALUE SOURCE puddle size 59,5G - puddle capacity 0% - puddle altroot - default puddle health ONLINE - puddle guid 10123729866998312523 default puddle version 31 default puddle bootfs - default puddle delegation on default puddle autoreplace off default puddle cachefile - default puddle failmode wait default puddle listsnapshots off default puddle autoexpand off default puddle dedupditto 0 default puddle dedupratio 1.00x - puddle free 59,5G - puddle allocated 184K - puddle readonly off -
A pool is a ZFS file system, useable as-is. So you can get other properties:
# zfs get all puddle NAME PROPERTY VALUE SOURCE puddle type filesystem - puddle creation lun. oct. 10 11:12 2011 - puddle used 123K - puddle available 39,0G - puddle referenced 34,6K - puddle compressratio 1.00x - puddle mounted yes - puddle quota none default puddle reservation none default puddle recordsize 128K default puddle mountpoint /puddle default puddle sharenfs off default puddle checksum on default (...)
In the ZFS literature, you’ll often read about “datasets” ; and it was a quite an opaque layer to me.
I didn’t get what this would exactly refer too. In the Sun’s documentation (not page 3 ;-), you can read “A generic name for the following ZFS entities: clones, file systems, snapshots, or volumes”. I went thinking of datasets as a abstracted ZFS storage objects. Either a pool, a file system, a clone, a snapshot. To me, “a dataset” would mean “a ZFS object”.
In ZFS, a filesystem is a subsystem that inherits properties from its pool and can override some particular ones. A ZFS pool can contain multiple filesystem. Each filesystem is independent from the other and can share or use different options. A filesystem can deal with quotas, compression, deduplication, mount points… and can be shared over the network. By default, a filesystem is automatically mounted and inherit the root of its pool.
Creation of a filesystem is quite straight forward:
# zfs create puddle/mudd # zfs list -r puddle NAME USED AVAIL REFER MOUNTPOINT puddle 170K 39,0G 36,0K /puddle puddle/mudd 34,6K 39,0G 34,6K /puddle/mudd # zfs destroy puddle/mudd
Configuring features is a matter of (un)setting parameters. Some parameters, like compression and deduplication, don’t apply on already existing data. For example, if you copy a data on “puddle/mudd”, then activate compression, only new data will be compressed ; not the ones present before compress activation.
# zfs create puddle/basic # zfs create -o compress=gzip puddle/compress # zfs create puddle/dedup # zfs set dedup=on puddle/dedup # zfs list NAME USED AVAIL REFER MOUNTPOINT puddle 265K 39,0G 38,6K /puddle puddle/basic 34,6K 39,0G 34,6K /puddle/basic puddle/compress 34,6K 39,0G 34,6K /puddle/compress puddle/dedup 34,6K 39,0G 34,6K /puddle/dedup # zfs list -o name,used,dedup,compression,dedup NAME USED DEDUP COMPRESS DEDUP puddle 265K off off off puddle/basic 34,6K off off off puddle/compress 34,6K off gzip off puddle/dedup 34,6K on off on # df -h Filesystem Size Used Avail Use% Mounted on (...) puddle/basic 40G 35K 40G 1% /puddle/basic puddle/compress 40G 35K 40G 1% /puddle/compress puddle/dedup 40G 35K 40G 1% /puddle/dedup
By default, a filesystem will require minimum storage and will grow up to the poll size. This means that fill-in a filesystem with data can prevent another filesystem on the same poll from being filled-in. Should you want to carefully manage you disk space, you may use the “quota” and “reservation” features from ZFS:
- “quota” ensures that a dataset won’t grow more in size than the applied quota ;
- “reservation” ensures that a dataset will get at least that particular storage size.
Note that sub-datasets will inherits disk management from their parents.
A snapshot is a (read-only) image of a filesystem at a particular time. It happens at the filesystem layer. At any time, regarding that you own enough disk space on the pool, you can keep a consistent copy of the filesystem and happen modifications on it that can be rolled back.
Create a snapshot from a dataset:
# zfs list puddle/data NAME USED AVAIL REFER MOUNTPOINT puddle/data 19,7M 12,8G 19,7M /puddle/data # zfs snapshot puddle/data@1632 # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT puddle/data@1632 0 - 19,7M -
From here, the snapshot uses no additional data since no modification has been done.
If you add data to the filesystem, you can see that both datasets are changed:
# zfs list puddle/data NAME USED AVAIL REFER MOUNTPOINT puddle/data 20,3M 12,8G 20,3M /puddle/data root@solaris:~# zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT puddle/data@1632 20,6K - 19,7M -
The small amount of data that have changed in the snapshot dataset is (AFAIK) a list of pointers to data that a new to “puddle/data” and that should be deleted if the snapshot was to be rolled-back.
If you delete your data from your file system, you’ll see that:
# zfs list puddle/data NAME USED AVAIL REFER MOUNTPOINT puddle/data 19,7M 12,8G 34,6K /puddle/data root@solaris:~# zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT puddle/data@1632 19,7M - 19,7M -
As you can see, the filesystem dataset still uses the space but refers to nearly nothing whereas the snapshot refers to the initial amount of data of the filesystem.
If you copy some more data to the filesystem, you’ll get:
# zfs list puddle/data NAME USED AVAIL REFER MOUNTPOINT puddle/data 129M 12,7G 109M /puddle/data # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT puddle/data@1632 19,7M - 19,7M -
You can see that I just added 109MB of data to puddle/data which is what the dataset refers too. You can also see that it refers to a bit more that this (129MB) which is the old data, referenced by the snapshots, plus the new data. Of course, if I copy data from “/puddle/data”, only the “actual” data will be copied.
Should you want to restore a single file that exists in the snapshot, you may browse to “/puddle/data/.zfs/snapshot/1632/” which is named according to the snapshot name.
In the case you plan to create a nightly snapshot, you can also recall to the last one by calling with a generic name:
# zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT puddle/data@1632 19,7M - 19,7M - # zfs snapshot puddle/data@1656 # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT puddle/data@1632 19,7M - 19,7M - puddle/data@1656 0 - 110M - # zfs rename puddle/data@1632 puddle/data@yesterday # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT puddle/data@yesterday 19,7M - 19,7M - puddle/data@1656 21,3K - 110M -
When you have enough snapshots to go back to, you can delete old ones using:
# zfs destroy puddle/data@yesterday
Finally, to restore the whole content of the filesystem at the time of the snapshot (AKA rolling-back):
# zfs list puddle/data NAME USED AVAIL REFER MOUNTPOINT puddle/data 129M 12,7G 129M /puddle/data # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT puddle/data@1701 28,0K - 110M - # zfs rollback puddle/data@1701 # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT puddle/data@1701 1,33K - 110M - # zfs list puddle/data NAME USED AVAIL REFER MOUNTPOINT puddle/data 110M 12,7G 110M /puddle/data
The data have been restored and the snapshot is kept up. Should you want not to use it anymore, you’d have to destroy it manually.
A clone is a (read-write) image of a filesystem at a particular time. Based on a snapshot, a clone is a filesystem that can be used as a “normal” filesystem. It’s initial data are shared from its parent snapshot and can evolve independently. A clone may be use to either happen modifications different from the snapshot filesystem or to access the data from usual software, like backups. A clone can be promoted so that it replaces the initial filesystem from which the snapshot was run.
# zfs snapshot puddle/basic@now # zfs clone puddle/basic@now puddle/clone # zfs list -r puddle NAME USED AVAIL REFER MOUNTPOINT puddle 26,3G 12,7G 40,0K /puddle puddle/basic 14,0G 12,7G 14,0G /puddle/basic puddle/clone 1,33K 12,7G 14,0G /puddle/clone (...)
From here, “/puddle/basic” and “/puddle/clone” content is the same but may evolve independently. One could backup “puddle/clone” on tape or low price storage to keep an image of “puddle/basic” at the time of the snapshot. One could also attach “puddle/clone” to some development system to run some testings on production data.
Cleaning happens as usual:
# zfs destroy puddle/clone # zfs destroy puddle/basic@now
Data transfer is done using standard UNIX tools and network protocols. But ZFS can also replicates its data as streams. The send/receive operations are based on snapshots and can be done either locally or though the network. The main difference with a clone is that:
- the replicated data is not linked to the snapshot anymore ;
- the replicated data can be located on some other pool.
One can send the data to some local hardware, for example a tape system:
# zfs snapshot puddle/data@now # zfs send puddle/data@now > /dev/tape0
One can restore the data from the local hardware onto a non-existent dataset:
# zfs receive puddle/restored < /dev/tape0 # zfs list puddle/data@now puddle/restored puddle/restored@now NAME USED AVAIL REFER MOUNTPOINT puddle/data@now 0 - 110M - puddle/restored 110M 12,5G 110M /puddle/restored puddle/restored@now 0 - 110M -
Note that the restore process created a snapshot instance that you may wish to delete.
If the destination already exists, the “
-F” flag can be used to force update.
One can duplicate or move a dataset to another local pool:
# zfs send puddle/data@now | zfs recv rpool/data_exported # zfs list puddle/data puddle/data@now rpool/data_exported rpool/data_exported@now NAME USED AVAIL REFER MOUNTPOINT puddle/data 110M 12,5G 110M /puddle/data puddle/data@now 0 - 110M - rpool/data_exported 110M 7,74G 110M /rpool/data_exported rpool/data_exported@now 0 - 110M -
One can send the data, through the network, to another ZFS system. No matter what the storage hardware is, as soon as both ZFS versions are compatible. This can be use to secure the data to another location:
# zfs send puddle/data@now | ssh remotehost zfs recv pool/data
The previous command lets you send the whole content of the snapshots. This would initiate the disaster recovery data copy. But as time passes and modifications occur to the initial storage zone, you may want to update the remote dataset. To optimize the data transfer, you would only send modifications between the previous synchronization, previous snapshot, and now, current snapshot:
# zfs snapshot puddle/data@monday # zfs send puddle/data@monday | zfs recv puddle/disaster (... modifications happen on /puddle/data ...) # zfs snapshot puddle/data@tuesday (... modifications happen on /puddle/data ...) # zfs send -i puddle/data@monday puddle/data@tuesday | zfs recv puddle/disaster # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT puddle/data@now 0 - 110M - puddle/data@monday 0 - 110M - puddle/data@tuesday 0 - 787M - puddle/disaster@monday 290K - 110M - puddle/disaster@tuesday 0 - 787M - # zfs list puddle/data puddle/disaster NAME USED AVAIL REFER MOUNTPOINT puddle/data 787M 11,1G 787M /puddle/data puddle/disaster 787M 11,1G 787M /puddle/disaster
Send the incremental data over the network and you can daily replicate the data to another datacenter, achieving disaster data protection ; since you have enough bandwidth to get the data between two snapshots.
The ZFS solutions
Now that we got how ZFS works, let’s try to match I.T. issues with ZFS solutions.
Securing storage data
When you have storage for sensitive data, you need to ensure that hardware failure won’t trash your data. The easy way to do that is to use RAID-Z. Depending on your needs, you may use mirror or stripping method. When using stripping method, remember that RAID-Z improves RAID-5 write penalty. When using mirroring, don’t forget that trashy data will be properly copied on the mirrored disk.
To ensure loosing a disk won’t impact your service and data, use RAID-Z.
Provide storage to applications
ZFS and Solaris will allow you to remotely provide storage as iSCSI, CIFS or NFS. Depending on the application that needs to access the data, you have to choose the best option. On the server itself, applications will access the data as a ZFS filesystem.
To provide storage to VMware ESX servers, export ZFS filesystems using iSCSI or NFS.
To provide storage to Windows servers, export ZFS filesystems using iSCSI or CIFS.
To provide storage to Windows workstations, export ZFS filesystems using CIFS.
To provide storage to UNIX or Linux hosts, export ZFS filesystems using iSCSI or NFS.
Freeze data state
Before upgrading software (patches, updates…) or apply modifications to data (upgrade version…), one may wish to keep the data safe. The usual way is to backup the data, apply the modifications and revert the data or delete the backup (depending on the operation’s result). With ZFS, you should make use of the snapshot feature. Providing that you have the storage available and that you can tell you application to dump a stable state of your data (to ensure the snapshot is coherent), just start a snapshot on the dataset, apply the modifications, check the results and commit or roll-back.
To keep data state to be able to roll-back in a fast way, use the ZFS snapshot feature.
Provide independent working data sets
Either for testing or developing purpose, you may have to provide access to the same data sets to various people or team but should ensure the modifications won’t overlap each other. The usual way of doing it is to give access to the data repository that users will locally copy to apply their modifications on ; hence using (too) much storage. Using ZFS, and the clone feature, you will be able to freeze a data set content and provide it to various teams so that they use it without duplicating too much storage. On the application level, you may provide the clone filesystem so that a development instance of your application runs with real production data in an isolated environment.
To provide data copy without duplicating the storage, use the ZFS clone feature.
Minimize backup impact
There are times when backup requires to stop an application from working during the backup length time. Should you want to minimize the offline duration, you may want to use the snapshot and clone features of ZFS. Put you application in backup mode, run a snapshot, take the application back into business ; this should take less than a minute. While in snapshot mode, clone the data and provide it to the backup system. Your application will continue to work while the backup system deals with the data. When the backup is done, delete the clone and the snapshot.
To minimize the impact of backup on application’s availability, combine ZFS snapshot and clone features.
Have the data remotely secured
In the “old” times, critical data were backed up on tapes ; those tapes were sent to some other location to ensure that massive injury on the datacenter wouldn’t impact the backups. Nowadays, with a remote connected site, a secondary ZFS storage system and a well sized network connection, you can automatically send your data from one site to the other using the ZFS replication feature. In the same manner you would run backups, make sure you data are coherent, run a snapshot from their storage dataset and replicate them to the remote ZFS secondary system. When done, delete the snapshot. There are two important things to keep in mind : the overall volume and the modification rate. The initial data replication will represent the initial total volume of data ; every replication will depend on the modification rate from the previous replication step.
Keep in mind that, transferring 100MB of data over a 1Mbps network link would take about 15 minutes when it would take 4 days to transfer 50GB ; transferring 1TB of data would take about 2 hours on a 1Gbps link when it would take about a month on a 1Mbps network line.
To remotely secure your data, use the ZFS snapshot and replication features.
Optimize data storage
Depending on the data type you’re storing and the server’s CPU, you may wish to on-line compress and/or deduplicate the data. Using those ZFS features, your mileage may vary. To store flat text files, you may use compression to achieve a massive storage economy. To store virtual machine disk images, you should use deduplication. An efficient way to setup a file server for Windows or UNIX workstation is to configured a ZFS dataset with the compression option set. To provide storage to your virtualization environment, it is recommended to configure you dataset with deduplication option set. There are no magical spells ; only testings will tell you what’s the best options.
To optimize storage for flat data, use the ZFS on-line compression feature.
To optimize storage for virtualized environment, use the ZFS on-line deduplication feature.
ZFS is quite a complete filesystem. It is really feature-full and can be used in various situations. I hope this little ZFS tour will bring you interest for the filesystem.