Python in production engineering

Romain Komorn

Python aficionados are often surprised to learn that Python has long been the language most commonly used by production engineers at Facebook and is the third most popular language at Facebook, behind Hack (our in-house dialect of PHP) and C++. Our engineers build and maintain thousands of Python libraries and binaries deployed across our entire infrastructure.

Every day, dozens of Facebook engineers commit code to Python utilities and services with a wide variety of purposes including binary distribution, hardware imaging, operational automation, and infrastructure management.

Python at Facebook by the numbers

  • 21 percent of Facebook Infrastructure's codebase
  • Millions of lines of code, thousands of libraries and binaries
  • 2016 to date: average 5,000 commits per month, 1,000+ committers
  • 5 percent Py3 (as of May 2016)

Python in production engineering

Python is heavily used by the Facebook infrastructure teams and is ubiquitous in production engineering. Teams typically maintain Python client libraries (generally Thrift) for their services, providing simple and reliable interfaces to any other team wanting to interact with them.

Having access to these libraries reduces the amount of code that production engineers have to write, test, and maintain, enabling them to move faster as they integrate their services into Facebook's infrastructure and allow that infrastructure to scale reliably and efficiently.

Infrastructure management

Production engineering owns much of the software used to manage our infrastructure. Virtually all of it is written in Python, and it covers the life cycle of our hardware, from the time that it arrives in one of our data centers to the time that it is decommissioned.

Python is the language driving the services involved in:

  • Network switch setup and imaging (TORconfig)
  • Whitebox switch CLIs (FBOSS)
  • Core services (DNS, Chef, etc.) via Kobold, a pluggable system for service turn-up
  • Auto-remediation of server hardware faults and service failures, using FBAR
  • Scheduling and automating execution of maintenance work using Dapper (see our SRECon16 presentation)
  • Server imaging, burn-in testing, and repair management by Cyborg
  • Fault detection and diagnosis using machinechecker (a CLI utility to verify the health of servers in Facebook's fleet)

Platform services

As our infrastructure has scaled up, some services that used to be monolithic were split into multiple components, giving rise to a variety of general-purpose Python services.

Today, many of our infrastructure management tools are built on top of a common platform, made up of:

  • Job Engine — a scalable job scheduling and execution framework that any team can extend for its own purpose, currently running millions of jobs each month
  • fbpkg — a BLOB distribution service that's based on BitTorrent and is used to transfer large files and software packages (including the facebook.com binary)
  • FBTFTP — our high-performance IPv6-friendly TFTP implementation, used every time a server is imaged
  • Osmosis — a generic workflow definition and execution tool used by a broad set of teams, supporting uses ranging from our office or data center buildout to kernel and OS upgrades

Service configuration management

Our host-level configuration management is done using Chef. Our service-level configuration management, however, is handled via a Facebook-authored project called Configerator. Engineers write Python that is executed to produce configuration objects, which are then stored as JSON files that can be consumed by any service. Validators, also written in Python, ensure that these configuration objects are defined correctly. Python is also used as the configuration language for Tupperware, Facebook's container deployment system.

Using Python enables us to write code that dynamically generates configuration objects without creating, maintaining, or learning to use complex templating systems that are typically required for this use case.

Operational efficiency

Many teams have built on top of existing libraries and systems to further improve their own operations or general operational needs at Facebook.

  • Our MySQL infrastructure team created the MySQL Pool Scanner, a service that automatically keeps our database infrastructure healthy the way a database administrator typically would.
  • Our widely distributed binaries (daemons, agents, or CLI utilities that are rolled out to all our servers) are safely pushed using slowroll orchestrator, a Python tool built on top of Job Engine that allows for phased rollouts with automatic safety checks.

Python 3 deployments

Facebook's scale pushes Python's performance to its limits. Our codebase features various models and libraries (Twisted, Gevent, futures, AsyncIO, and many others). All ports and new projects use Python 3 unless Python 2 support is absolutely necessary. Currently, 5 percent of our Python services in production are running Python 3.

The following Python 3-compatible projects have already been open-sourced:

  • FBOSS CLI — a Python 3.5 CLI that hits thrift APIs on Facebook in house switch agent
  • Facebook Python Ads API — compatible with Python 3
  • FBTFTP — a dynamic TFTP server framework written in Python 3
  • PYAIB — Python Async IrcBot framework

There is a lot of exciting work to be done in expanding our Python 3 codebase. We are increasingly relying on AsyncIO, which was introduced in Python 3.4, and seeing huge performance gains as we move codebases away from Python 2. We hope to contribute more performance-enhancing fixes and features back to the Python community.

Keep Updated

Stay up-to-date via RSS with the latest open source project releases from Facebook, news from our Engineering teams, and upcoming events.

Subscribe
Facebook © 2017