Engineering Blog

Blog posts tagged 'Production Engineering'

Peter KnowlesEngineering

How production engineers support global events on Facebook

Posted about 4 months ago
blog post · Infra · Production Engineering · Backend · Testing

The Production Engineering team at Facebook carefully plans and builds infrastructure to ensure service uptime and reliability. Read more...

Junyi Luke LuEngineering at Facebook

OnlineSchemaChange rebuilt in Python

Posted about a year ago
blog post · Production Engineering · Open Source · MySQL · Python · Tooling · Testing

The new version of OnlineSchemaChange is written in Python and has a more flexible architecture. Read more...

Glenn RivkeesEngineering

Legacy support on IPv6-only infra

Posted about a year ago

A series of changes to Facebook's load balancers allows IPv4 traffic to be served on its IPv6 data center infrastructure. Read more...

Divij RajkumarProduction Engineer at Facebook

Continuous MySQL backup validation: Restoring backups

Posted about 2 years ago

Our system continuously tests our ability to restore our databases from backups, ensuring that we can quickly and reliably recover from an outage. Read more...

Angelo FaillaEngineering

DHCPLB: An open source load balancer

Posted about 2 years ago

From hackathon prototype to internship project, the new load balancer is now deployed across Facebook's server fleet to manage DHCP traffic. Read more...

Marlon DutraEngineering

Scalable and secure access with SSH

Posted about 2 years ago

Facebook leverages signed certificates with principals for scalable, reliable security access. Read more...

Antoine ReversatEngineering

The mobile device lab at the Prineville data center

Posted about 2 years ago

The custom-built rack lets engineers run tests on thousands of phones to understand the performance implications of a code change. Read more...

Romain KomornEngineering

Making Facebook self-healing: Automating proactive rack maintenance

Posted about 2 years ago

Aggregate Maintenance Handlers provide a way to safely automate maintenance on multiple servers at once. Read more...

Romain KomornEngineering

Python in production engineering

Posted about 2 years ago
blog post · Production Engineering · Python

Production engineers at Facebook use Python for variety of purposes, including binary distribution, hardware imaging, operational automation, and infrastructure management. Read more...

Phil DibowitzProduction Engineer at Facebook

Facebook Chef cookbooks

Posted about 2 years ago
blog post · Infra · Data · Open Source · Production Engineering

This suite of cookbooks — along with a sample 'init' cookbook — will allow anyone who wants to use our model of Chef in their own environment to get started easily and quickly. Read more...

Erin GreenEngineering

Using ISC Kea DHCP in our data centers

Posted about 3 years ago

Inside Facebook's transition to ISC Kea.

Under the hood: Facebook’s cold storage system

Posted about 3 years ago

Finding a place for images to live so they can be instantly available is a recurring scale challenge for Facebook. Read more...

Ryan AlbrechtEngineering

Web performance: Cache efficiency exercise

Posted about 3 years ago
blog post · Web · Production Engineering · Caching · Front End

My team was discussing the parts of that are currently uncached, and the question came up: What is the efficiency of the cache since, at Facebook, we release new code twice a day? We were worried that our release process might be negatively impacting our cache performance. Read more...

Zheng MiEngineering Manager at Facebook

Mobile performance: Tooling infrastructure at Facebook

Posted about 3 years ago

We built a performance monitoring and prediction platform to help us understand the performance implications of a code change and ultimately decrease the number of regressions engineers have to deal with. Read more...

Augmented Traffic Control: A tool to simulate network conditions

Posted about 3 years ago

Today we are open-sourcing our design for Augmented Traffic Control on GitHub. Read more...

Yufei ZhuEngineering at Facebook

Serving Facebook Multifeed: Efficiency, performance gains through redesign

Posted about 3 years ago
blog post · Infra · Production Engineering

We leveraged the disaggregation concept to redesign Multifeed, creating a 40% efficiency improvement via total memory and CPU consumption optimization for parts of the Multifeed infrastructure in the process. Read more...

Yuval BacharEngineering

Introducing “6-pack”: the first open hardware modular switch

Posted about 3 years ago

With “6-pack,” we have created an architecture that enables us to build any size switch using a simple set of common building blocks. Read more...

Alexey AndreyevEngineering

Introducing data center fabric, the next-generation Facebook data center network

Posted about 4 years ago

The more than 1.35 billion people who use Facebook on an ongoing basis rely on a seamless, “always on” site performance. On the back end, we have many advanced sub-systems and infrastructures in place that make such a real-time experience possible, and our scalable, high-performance network is one of them. Read more...

Nathan BronsonSoftware engineer at Facebook

Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale

Posted about 4 years ago
blog post · Infra · Production Engineering

As we’re building and running systems at Facebook, sometimes we encounter metastable failure states. These are problems that create conditions that prevent their own solutions. In gridlocked traffic, for example, cars that are blocking an intersection keep traffic from moving, but they can’t exit the intersection because they are stuck in traffic. This kind of failure ends only when there is an external intervention like a reduction in load or a complete reboot. Read more...

Audience Insights query engine: In-memory integer store for social analytics

Posted about 4 years ago
blog post · Web · Data · Infra · Production Engineering · Analytics · Data Science

A query engine with a hybrid integer store that organizes data in memory and on flash disks so that a query can process terabytes of data in real time. Read more...

Introducing Proxygen, Facebook's C++ HTTP framework

Posted about 4 years ago

We are excited to announce the release of Proxygen, a collection of C++ HTTP libraries, including an easy-to-use HTTP server. In addition to HTTP/1.1, Proxygen (rhymes with “oxygen”) supports SPDY/3 and SPDY/3.1. We are also iterating and developing support for HTTP/2. Read more...

Mike ArpaiaEngineering

Introducing osquery

Posted about 4 years ago

Maintaining real-time insight into the current state of your infrastructure is important. At Facebook, we've been working on a framework called osquery which attempts to approach the concept of low-level operating system monitoring a little differently. Read more...

Building Mobile-First Infrastructure for Messenger

Posted about 4 years ago
blog post · Mobile · Infra · Messages · Production Engineering · Backend · Storage

Messages have been part of Facebook for many years, beginning as direct messaging similar to email (available in your inbox the next time you visited the site) and then eventually evolving into a real-time messaging platform that provides access to your messages on a number of mobile apps or in a browser. But until recently the back-end systems hadn't evolved much from early iterations, and Messenger's performance and data usage started to lag behind — especially on networks with costly data plans and limited bandwidth. To fix this, we needed to completely re-imagine how data is synchronized to the device and change how data is processed in the back end to support our new synchronization protocol. Read more...

Phil DibowitzProduction Engineer at Facebook

Facebook, configuration management, community, and open source

Posted about 4 years ago
blog post · Infra · Data · Open Source · Production Engineering

Last year we began speaking at conferences around the world about our approach to managing hundreds of thousands of servers. We had outgrown our existing system and needed something new. We wanted a system that would let any engineer make any change they needed to any systems they owned via simple data-driven APIs while also scaling to Facebook's huge infrastructure, and while also minimizing the size of the team that would have to own the system. We designed a new paradigm and built a framework to bring it to life. At the core of that framework is Chef — but the way we ended up using Chef is pretty unique. We wanted to share how and why we made those choices and the benefits they brought us. Read more...

Lessons from Deploying MySQL GTID at Scale

Posted about 4 years ago
blog post · Data · MySQL · Production Engineering · Open Source

Global Transaction ID (GTID) is one of the most compelling new features of MySQL 5.6. It provides major benefits in failover, point-in-time backup recovery, and hierarchical replication, and it's a prerequisite for crash-safe multi-threaded replication. Over the course of the last few months, we enabled GTID on every production MySQL instance at Facebook. In the process, we learned a great deal about deployment and operational use of the feature. We plan to open source many of our server-side fixes via WebScaleSQL, as we believe others in the scale community can learn from this and benefit from the work we've done. Read more...

Introducing mcrouter: A memcached protocol router for scaling memcached deployments

Posted about 4 years ago

Most web-based services begin as a collection of front-end application servers paired with databases used to manage data storage. As they grow, the databases are augmented with caches to store frequently-read pieces of data and improve site performance. Often, the ability to quickly access data moves from being an optimization to a requirement for a site. This evolution of cache from neat optimization to necessity is a common path that has been followed by many large web scale companies, including Facebook, Twitter[1], Instagram, Reddit, and many others. Read more...

Qiang WuInfrastructure Software Engineer at Facebook

Making Facebook’s software infrastructure more energy efficient with Autoscale

Posted about 4 years ago

Improving energy efficiency and reducing environmental impact as we scale is a top priority for our data center teams. We’ve talked a lot about our progress on energy-efficient hardware and data center design through the Open Compute Project, but we’ve also started looking at how we could improve the energy efficiency of our software. We explored multiple avenues, including power modeling and profiling, peak power management, and energy-proportional computing. One particularly exciting piece of technology that we developed is a system for power-efficient load balancing called Autoscale. Autoscale has been rolled out to production clusters and has already demonstrated significant energy savings. Read more...

Keep Updated

Stay up-to-date via RSS with the latest open source project releases from Facebook, news from our Engineering teams, and upcoming events.

Facebook © 2018