As we continue to evolve our data infrastructure, we’re constantly looking for ways to maximize the utility and efficiency of our systems. One technology we’ve deployed is HDFS RAID, an implementation of Erasure Codes in HDFS to reduce the replication factor of data in HDFS. We finished putting this into production last year and wanted to share the lessons we learned along the way and how we increased capacity by tens of petabytes.
The default replication of a file in HDFS is three, which can lead to a lot of space overhead. HDFS RAID reduces this space overhead by reducing the effective replication of data. The replication factor of the original file is reduced, but data safety guarantees are maintained by creating parity data.
There are two Erasure Codes implemented in HDFS RAID: XOR and Reed-Solomon. The XOR implementation sets the replication factor of the file to two. It maintains data safety by creating one parity block for every 10 blocks of source data. The parity blocks are stored in a parity file, also at replication two. Thus for 10 blocks of the source data, we have 20 replicas for the source file and two replicas for the parity block, which accounts for a total of 22 replicas. The effective replication of the XOR scheme is 22/10 = 2.2.
The Reed-Solomon implementation sets the replication factor of the file to one. It maintains data safety by creating four parity blocks for every 10 blocks of source data. The parity blocks are stored in a parity file at replication one. Thus for 10 blocks of the source data, we have 10 replicas for the source file and four replicas for the parity blocks, which accounts for a total of 14 replicas. The effective replication of Reed-Solomon scheme is 14/10 = 1.4
Theoretically we should be able to get 2.2x replication factor for XOR RAID files and 1.4X replication factor for Reed Solomon RAID files.
However, when we deployed RAID to production, we saved much less space than expected. After investigating, we found a significant issue in RAID, which we christened the “small file problem.”
The following chart shows how many physical blocks we could save by doing XOR and RS RAID for files with different numbers of logical blocks. As you can see, files with 10 logical blocks provide the best opportunity for space-saving. The fewer blocks a file has, the smaller the capacity for conservation. If a file has less than three blocks, RAID can’t save any space. We consider such un-RAID-able files “small files.” More analysis showed that in our production cluster, more than 50% of the files were small files. This meant that half of the cluster was completely un-RAID-able and was still subject to the 3x replication factor.
We developed directory RAID to address the small file problem based on one simple observation: In Hive, files under the leaf directory are rarely changed after creation. So, why not treat the whole leaf directory as a file with many blocks and then RAID it? In directory RAID, all the file blocks under the leaf directory are grouped together in alphabetical order, and only one parity file is generated for the whole directory. As shown below, with directory RAID, we are able to RAID file2 and file3, therefore saving more physical blocks than file-level RAID. In this way, as long as a leaf directory has more than two blocks, we are able to reclaim spaces from it, even if it’s full of small files.
A RAID node in an HDFS cluster is responsible for parity files generation and block fixing. It scans RAID policy files and generates a corresponding parity file for each source file or source directories by submitting a map/reduce job. The RAID node also periodically asks HDFS for corrupt files. For each corrupt file, it finds its corresponding parity file and fixes its missing blocks by decoding source blocks and its parity blocks. If a RAID file with missing blocks gets read, the content of the missing data will be reconstructed on the fly, thus avoiding the chance of data unavailability. We also enhance HDFS with a RAID block placement policy, placing 10 blocks of a stripe in different racks. This improves data reliability in case of rack failure.
Deploying RAID in large HDFS clusters with hundreds of petabytes of data presents a lot of challenges. Here are just a few:
We use different RAID policies for our Hive data stored in HDFS. New data is entered in HDFS with a replication factor of three, and most of it gets queried for analysis and reports intensively in its first day of life. After one day, it gets directory XOR RAIDed, with a reduced replication factor but still offering two copies of each byte for reading. When data becomes one month old, files under a Hive partition are compacted into RAID-friendly larger files. We choose the block size of compacted files so its number of blocks is close to a multiple of 10 blocks. They are then arrayed in a Reed-Solomon RAID. That compacted cold data ends up with a replication factor of close to 1.4.
With RAID fully rolled out to our warehouse HDFS clusters last year, the cluster’s overall replication factor was reduced enough to represent tens of petabytes of capacity savings by the end of 2013.
HDFS RAID is currently implemented as a separate layer on top of HDFS. RAID node is a standalone daemon responsible for file/directory RAID and block fixing. This solution presents a few problems. First, a file has to be created and then read again when RAID occurs, thus wasting network bandwidth and disk I/Os. Secondly, when files are replicated across clusters, sometimes parity files are not moved together with their source files. This often leads to data loss. Lastly, block fixing is often delayed by a delay of querying missing blocks from HDFS and the limited map/reduce slots allocated for RAID.
We are currently working on building RAID to be natively supported in HDFS. A file could be RAIDed when it is first created in HDFS, saving disk I/Os. Once deployed, the NameNode will keep the file RAID information and schedule block fixing when RAID blocks become missing, while the DataNode will be responsible for block reconstruction. This removes the HDFS RAID dependency on map/reduce.
A number of people on the Facebook Data Infrastructure team contributed to the development of Directory RAID, including Dikang Gu, Peter Knowles, and Guoqiang Jerry Chen.