Facebook's MySQL databases are spread across our global data centers, and we need to be able to recover from an outage in any of these locations, at any given point in time. In such a disaster event, not only do we have to recover the service as quickly and reliably as possible, but we also need to ensure we don't lose data in the process. To that end, we've built a system that continuously tests our ability to restore our databases from backups.
Our restore system consists of two main components:
The data CRT collects about each restore job's progress helps us understand our resource requirements for database restores, and ORC helps us verify backup integrity. This post will focus on ORC's internals — specifically the internal peon state machine and some of the challenges we had to overcome when coming up with the restore process for a single database.
Before building a continuous restore pipeline, we first need to understand the nature of backups available to us. Currently, we take three kinds of backups, all of which are stored in HDFS:
Both full and diff backups pass the
--single-transaction option to
mysqldump so we get a consistent snapshot of the database and can both be taken from slaves as well as master instances. I'll refer to diff and full backups as dumps for the remainder of this post.
Since dumps are only taken once a day, having binlog backups means we get full coverage of every transaction executed against the database being backed up. This then enables us to perform point-in-time restores by replaying transactions from binlogs on top of a restored dump to bring the database’s state up to a certain point in time. All of our database servers also use global transaction IDs (GTIDs), which gives us another layer of control when replaying transactions from binlog backups.
In addition to storing backups in HDFS, we also write them to an off-site location. Shlomo Priymak’s talk at Code as Craft covers our backup architecture in much greater detail.
ORC has three components:
Peons contain all relevant logic for retrieving backups from HDFS, loading them into their local MySQL instance, and rolling them forward to a certain point in time by replaying binlogs. Each restore job a peon works on goes through these five stages:
mysqldumpoutput, irrespective of the backup type.
mysqlbinlogprogram to filter out binlog events from other collocated databases and empty transactions, and then replay the required transactions on the same MySQL instance.
Each stage also has a corresponding failure state, so if a job fails at DOWNLOAD, it moves to the DOWNLOAD_FAILED state and doesn’t progress to LOAD.
Perhaps the most challenging portion of the restore process is determining which binlogs to download and replay. Full and diff backups can be taken from the master or a slave; however, we only take binlog backups from the master. This means simple timestamp comparisons cannot be used to determine which binlogs need to be replayed.
We deployed GTIDs across our fleet in 2014, and that gives each transaction a globally unique identifier. In addition, each running MySQL server maintains a
gtid_executed (GTID set) variable, which acts as a counter of transactions executed by that instance so far.
With GTIDs in place, a transaction replicated from master to slave maintains its GTID, which means we know definitively whether or not it is included in a GTID set. We can also perform superset/subset comparisons against a server's
gtid_executed value since it is just a monotonically increasing counter and functions as a mathematical set.
Putting these together, we can record the
gtid_executed value from the server when the dump is being taken and also record the GTID set contained in each binlog file to perform a consistent comparison and determine which transactions in a binlog need to be replayed. Further, once the first transaction to be replayed is identified, we know every subsequent transaction needs to be replayed and can avoid any additional GTID comparisons. We also use the
--stop-datetime option to
mysqlbinlog to determine where the binlog stream should be stopped.
Backups are the last line of defense in the event of a disaster; having them gives us confidence in our ability to recover from outages. However, just taking backups isn't enough. ORC helps us continuously test our backups to verify their integrity and also gives us an understanding of the resources we'd require to successfully restore from them.
Making a system like ORC operate at Facebook scale requires adding a lot of alerting, monitoring, and automated failure detection and remediation. All of these are implemented in CRT and will be covered in a follow-up post outlining how we scale our restore process to tens of thousands of databases.