How the Lustre Developer Community is Advancing ZFS as a Lustre Back-end File System

Intel

5.00/5 (1 vote)

Jun 15, 2017

CPOL

10 min read

15521

Several open source projects that are being integrated into open source Lustre are designed to improve reliability, flexibility, and performance, align the enterprise-grade features built into ZFS with Lustre, and enhance functionality that eases Lustre deployments on ZFS.

Click here to register and download your free 30-day trial of Intel® Parallel Studio XE

Lustre, the high-performance parallel file system in nine out of ten of the world’s fastest supercomputers, has been gaining traction in enterprise IT. Lustre’s popularity is not only as a scientific and technical computing storage foundation, but also as the core to the business’ converged technical and office infrastructures (see Bank of Italy). And, continuing work in the open source developer community with advancements in Lustre on the Zetta File System (ZFS) is strengthening Lustre’s position as an enterprise solution.

Lustre has supported ZFS for a long time. However, several open source projects that are being integrated into open source Lustre are designed to improve reliability, flexibility, and performance, align the enterprise-grade features built into ZFS with Lustre, and enhance functionality that eases Lustre deployments on ZFS.

80X Faster RAIDZ3 Reconstruction

RAIDZ is the ZFS software RAID implementation. It has three levels of recovery: RAIDZ1 can recover up to a single hard drive failure; RAIDZ2 can recover up to two hard drive failures; and RAIDZ3 can recover up to three hard drive failures. RAIDZ1 and RAIDZ2 are similar to RAID5 and RAID6.

The level of RAIDZ coverage determines the processing demand by the calculations for parity generation and reconstruction of drives, from RAIDZ1, the simplest, to RAIDZ3, the most complex. RAIDZ uses three equations called P, Q, and R. P is used for RAIDZ1; P and Q are used in RAIDZ2, and P, Q, and R are used in RAIDZ3.[1] Reconstruction is the most compute-demanding operation, because the amount of processing depends on how many drives failed. For RAIDZ3, three drive failures require seven calculations of different combination of P, Q, and R for each stored block in able to rebuild the data. These are equations over Galois fields. Operations in the Galois fields must be emulated, and this is prime territory for vectorization of the algorithms.

Gvozden Nešković, working with the Gesellschaft fuer Schwerionenforschung (GSI) and Facility for Antiproton and Ion Research (FAIR), has committed a patch to the ZFS master branch that adds a framework for vectorizing RAIDZ parity generation and reconstruction for all three RAIDZ levels. "Codes are based on the vector extensions of the binary Galois field," stated Nešković. "This is finite field mathematics with 256 elements." He had to first rewrite and optimize the equations where necessary to vectorize them.

Original equations for P, Q, and R

While the RAIDZ1 equation did not need rewriting because it uses carryless addition, which is done using vector XOR operations, his rewriting of the RAIDZ2 and RAIDZ3 equations allow the computations to be done using vector operations. "Modern multi-core CPUs do not have built-in instructions for Galois Field multiplication," said Mr. Nešković, "so I had to revise and optimize the operations to take advantage of the vector instruction sets in today’s CPUs."

Modified Q and R for vectorization

These equations can be solved directly using the compute-intensive matrix inversion method, which was done in the original implementation of RAIDZ3 reconstruction, or they can be optimized using syndromes, which is how RAIDZ2 reconstruction was done in the original implementation. Mr. Nešković used syndromes for both RAIDZ2 and RAIDZ3 in order to speed up the calculations.

Running his new codes on benchmarks returned over 80X speedup on RAIDZ3 reconstruction of three hard drives using the Intel® AVX2 instruction set on Intel® processors, such as Intel® Xeon® processors E5 v4 family. Mr. Nešković’s work covered both optimizing all the equations and implementing new constructs using syndromes. The table shows speedup results compared to the original, unoptimized code running his optimized versions of the original code on a scalar processor and running vectorized operations on SSE and Intel AVX2-enabled processors.

Used with permission.

The reduction in processing time forRAIDZ3 parity generation (or PQR parity) is easy to see in a Flame Graph (below). Using the Fletcher4 algorithm for checksum processing time as a reference (they take the same amount of time in the original and improved code), one can see the dramatic improvement Mr. Nešković’s work brings to RAIDZ calculation times.

Used with permission.

"I contributed a general framework for RAIDZ parity operations, which permits easy incorporation of vector instructions. I used that as a base to implement SSE and AVX2 variants for x86. The same framework is being used as the base for other platforms.," he commented.

"What accelerating RAIDZ means to GSI is that we can dedicate fewer cores to parity and reconstruction processing. If we can use only a single core on a large server for RAIDZ processing, we free up cores for other work and reduce the need for more servers. That saves costs of building and running Lustre clusters with a ZFS back end file system using RAIDZ for high availability and reliability," added Nešković. His presentation on his work at the Luster User Group (LUG) can be found here; the video of it is here.

Creating Lustre Snapshots

A valuable feature of ZFS is creating mountable, read-only snapshots of the entire file system captured at a moment in time. This creates a global view of the Lustre files system, useful for a number of purposes, from recovering files to creating backups to enabling users to see differences across the data over time, and more. Until now, access to ZFS snapshots was not built into Lustre. But, Fan Yong of Intel’s High Performance Data Division (Intel HPDD) is adding ZFS snapshot capability to Lustre with ZFS back end file systems. Their project has created several new command lines accessible from a Lustre console, which in turn execute routines in both the Lustre and ZFS environments.

Used with permission.

The process of actually creating the snapshot is very fast, because it takes advantage of a copy-on-write file system, rather than copying the entire file system’s data. Core to the snapshot process is creating a global write barrier that waits for existing transactions to complete and blocks new transactions to start while the snapshot is created. "Creating the snapshot is fast," stated Fan Yong. "Setting up the system to do that is time consuming, because we have to create a global write barrier across the pool."

Used with permission.

The barrier freezes the file system by locking out transactions on each Metadata target (MDTs), one by one, inhibiting any changes to Metadata during the snapshot. "In theory, this sounds simple," stated Yong. "Efficiently implementing it is not."

The user gives the snapshot its own unique name within the command line to create the barrier:

lctl barrier_freeze <fsname>[timeout(seconds)]

Once created, the user can mount the snapshot as a read-only file system on a client. Following the snapshot, separate commands ‘thaw’ the barrier and provide a status of the snapshot. New commands were also created to fork and erase configuration logs.

Yong’s research and benchmarks show that while the write barrier can take several seconds to complete, it is highly scalable across their test system with little change as the number of MDTs are added. And the impact on MDT IO performance is minimal, making the wait for the snapshot very worth the value of having a complete picture of the file system at a given moment in time.

Used with permission.

Yong’s work is multi-phasic and ongoing. Phase 1 with snapshotting capability is expected to land in Lustre 2.10. Future work will integrate additional capabilities, such as auto-mounting the snapshot, and other features. Yong’s presentation at LUG 2016 can be found here, and the vide of it is here.

Making Metadata Performance Scalable and Fast

Metadata server performance is an area that has been a challenge for Lustre on ZFS deployments. Historically, Lustre’s ldiskfs has been an order of magnitude faster than ZFS, much of it due to the Metadata server performance. To mitigate this, some integrators mix back end file systems with Metadata servers based on ldiskfs and object servers based on ZFS. But, Metadata server performance has improved significantly with the work being done by Alexey Zhuravlev of Intel HPDD.

From a survey Mr. Zhuravlev performed on ZFS-based MDSs, he found that "there are several areas where bottlenecks occur. Some are due to Lustre code and some due to ZFS." Declaration time is a significant one, where ldiskfs is a mere fraction of time compared to ZFS for all declaration operations. "The more specific the declaration, the more expensive it is on ZFS," added Zhuravlev. And Lustre declarations are very specific, which, it turns out, are not required. Lustre patch LU-7898, which landed in Lustre 2.9, brings ZFS declaration times in line with ldiskfs.

Used with permission.

ZFS dnodes also presented a performance challenge to Lustre due to their small size. ZFS 0.6.5 dnodes are 512 bytes, which are not nearly large enough to accommodate Lustre’s extended attributes. If there is not enough space in the dnode, ZFS will create an extra block called a spill block, and use it to store extended attributes. But accessing the spill block results in an extra disk seek. "It took more than eight gigabytes of raw writes to create one million files," commented Zhuravlev. A patch for a variable dnode will be landed in the ZFS master 0.7 release that will allow dnodes to be sized from 512 bytes to a few kBs, eliminating the need to look at the spill block, which is expected to reduce the disk seeks by half. "With large dnode support in ZFS and Lustre, one million files now take only one gigabyte of raw writes," added Zhuravlev. Additionally, ZFS didn’t support dnode accounting, and Lustre implements its own primitive support for file quotas, which is very expensive and doesn’t scale well. Patch LU-2435, which is nearly ready for landing, will add native dnode accounting to ZFS. "We estimate performance on dnode accounting will be both highly scalable and higher performance," stated Zhuravlev.

Fixes to declaration times (Lustre 2.9), dnode size increase (ZFS 0.7), and dnode accounting (Lustre 2.10) are showing significant improvements to MDS performance on ZFS, delivering both high scalability and performance that is expected to pass ldiskfs.

"We continue to work on MDS performance on ZFS," added Zhuravlev. We are actively looking at other optimizations and working the upstream ZFS community." He presented his work at the European Open File System’s (EOFS) 2016 Lustre Administrator and Developer Workshop. His presentation is here.

Large IO Streaming Improvements

Write performance of Lustre on ZFS runs near to its theoretical maximum, but block allocation is not file-aware, so it tends to scatter blocks around the disks, which means more disk seeks when the file is read back. This keeps read performance lagging Lustre’s write performance on ZFS as presented at LUG 2016 by Jinshan Xiong of Intel’s HPDD. A lot of patches have been landed during 2015 that has helped performance on ZFS, including Xiong’s vectorization of the Fletcher 4 algorithm used to compute checksums using Intel AVX instructions.

Used with permission.

Additional IO performance work is being done and upstreamed to the ZFS on Linux project with expectations to be landed in Lustre 2.11. This work includes

Increasing support on Lustre for a 16 MB block size—already supported by ZFS—which will increase the size of data blocks written to each disk. A larger block size will reduce disk seeks and boost read performance. This, in turn, will require supporting a dynamic OSD-ZFS block size to prevent an increase in read/modify/write operations.
Implementing a dRAID mechanism instead of RAIDZ to boost performance when a drive fails. With RAIDZ, throughput of a disk group is limited by the spare disk’s bandwidth. dRAID will use a mechanism that distributes data to spare blocks among the remaining disks. Throughput is expected to improve even when the group is degraded because of a failed drive.
Creating a separate Metadata allocation class to allow a dedicated high throughput VDEV for storing Metadata. Since ZFS Metadata is smaller, but fundamental, reading it faster will result in enhanced IO performance. The VDEV should be an SSD or NVRAM, and it can be mirrored for redundancy.

Mr. Xiong’s presentation from LUG 2016 can be seen here. The video of it is here.