Improving the Linux kernel with upstream contributions

The Linux kernel is constantly changing, growing roughly 1.4 million lines of code in the last year alone. The kernel community is constantly adding new features, supporting new hardware, refining interfaces, and fixing bugs. In order to take advantage of these changes, production Linux deployments need to take on the difficult task of validating new kernels across all their workloads.

Scaling over hundreds of thousands of machines makes Facebook unique in many ways, and over the years, our kernel team has developed specialized functionality to help support the fleet. When moving to new kernels, we often need to choose between slowing the rate of change to reduce the testing required and increasing the rate of change to pull in the code we need to support our constantly evolving data centers.

We believe the best way to make sure new kernels work well for our services is to actively participate in the kernel community. Upstream participation ensures that our changes are part of the core kernel that is validated with each release, and contributing our patches upstream results in reviews that help improve code quality and maintainability. By focusing on upstream before deploying code in production, we are able to reduce internal churn, create fewer regressions, and improve the Linux kernel for everyone. We refer to this as our “upstream-first” philosophy.

The Linux kernel is a community effort, both inside and outside of Facebook. The work the team does is made possible through the efforts of thousands of developers around the world. We’d like to update the community on our contributions in the last few months.

Development

As v3.10 of the Linux kernel went into maintenance mode earlier this year, our internal development of this older kernel predictably slowed and we turned our focus to v4.0. Our v3.10-fb branch saw 80 commits from 10 internal contributors. Our v4.0-fb branch saw 101 commits from 12 internal contributors. Interestingly, the upstream (mainline) Linux branch saw 238 commits from nine internal contributors. This is our upstream-first philosophy at work. We upstream our development efforts for community vetting in most cases before the code hits production. This is particularly useful for making sure that complex changes get an appropriate number of eyes on them from within the community.

Blk-mq

If you’re not familiar with blk-mq, it stands for “block multi-queue” and is a newish framework for block layer queuing in the kernel. Historically, the block layer design employed a single request queue, which was an obvious bottleneck for storage performance. Blk-mq introduces two types of queues: per-CPU core software queues and hardware queues, which map to the underlying hardware driver submission queue. This design shows reduced system utilization, lower latency, and higher performance.

Blk-mq has performed so well that it is now enabled in production for several solid state storage systems inside of Facebook. We’re continuing to develop new features on top of the core blk-mq functionality to support the latest in high-speed storage and make it more useful across different types of hardware.

Btrfs

We have been working toward deploying Btrfs slowly throughout the fleet, and we have been using large gluster storage clusters to help stabilize Btrfs. The gluster workloads are extremely demanding, and this half we gained a lot more confidence running Btrfs in production. More than 50 changes went into the stabilization effort, and Btrfs was able to protect production data from hardware bugs other filesystems would have missed.

New Vegas congestion control

NV is a modern congestion-control algorithm, part of a series of planned data center–focused improvements to TCP. You can read more about this work below:

Writeback support for cgroups

We use cgroups around Facebook for resource limiting in systems across our fleet. Cgroups can limit resources like CPU and memory, but until recently it has been missing useful support for IO limiting. With the changes described below, we’ll be able to provide resource limits for most of the major subsystems — block, net, memory, and CPU. This fine-grained resource control will eventually let us do things like smartly collocate container jobs, ensuring that they can’t starve each other of resources.

Network virtualization and reliability

We’ve continued work on GUE, IPv6 flow label support, and ILA. These are all related to how we might be able to provide a pure virtualized network environment, which would allow us to do things like have uniquely addressable containers. IPv6 flow labels also provide a mechanism for rewiring network routes on demand. When the IPv6 flow label changes, the network hardware can rehash the flow and take an alternate path.

IPv6 performance improvements

Facebook has been actively migrating our internal and external traffic to IPv6 for some time, and this migration has found a number of bottlenecks in the kernel IPv6 support. We sent our IPv6 performance improvements upstream, and our patches to scale the routing tree have allowed us to remove a number of Facebook-specific hacks from production kernels.

Linux IPv6 improvement: Routing cache on demand

RAID 5/6 caching

We implemented a caching layer for MD RAID 5/6. This series of patches solves two problems: plugging the RAID 5/6 write hole to keep parity consistent after power failures and greatly improving performance for synchronous workloads. This change set is currently being reviewed upstream, and we are continuing to adapt it based on the feedback we are receiving.

MD: a caching layer for raid5/6

Release engineering

We have been making steady progress to ensure that we are able to quickly deploy new kernels internally. The 4.0 kernel is now deployed on a significant percentage of our fleet, and the ability to quickly canary new changes in production shortens the development cycle dramatically.

Moving forward

Our broad goal as a kernel team is to maintain a rapid pace of integration between the upstream kernel and our own production environments. We’re focused on improving the Linux storage, networking, and container stacks. We’re quickly taking advantage of the newest technology in Linux and pushing the lessons we learn back into the community.

Keep an eye out for more case studies and patch submissions as we make progress.