Engineering Blog

Blog posts tagged 'Production Engineering'

Glenn RivkeesProduction Engineer at Facebook

Legacy support on IPv6-only infra

Posted about 2 months ago

A series of changes to Facebook's load balancers allows IPv4 traffic to be served on its IPv6 data center infrastructure. Read more...

Divij RajkumarProduction Engineer at Facebook

Continuous MySQL backup validation: Restoring backups

Posted about 5 months ago

Our system continuously tests our ability to restore our databases from backups, ensuring that we can quickly and reliably recover from an outage. Read more...

Angelo FaillaEngineering

DHCPLB: An open source load balancer

Posted about 6 months ago

From hackathon prototype to internship project, the new load balancer is now deployed across Facebook's server fleet to manage DHCP traffic. Read more...

Marlon DutraEngineering

Scalable and secure access with SSH

Posted about 6 months ago

Facebook leverages signed certificates with principals for scalable, reliable security access. Read more...

Antoine ReversatEngineering at Facebook

The mobile device lab at the Prineville data center

Posted about 9 months ago

The custom-built rack lets engineers run tests on thousands of phones to understand the performance implications of a code change. Read more...

Romain KomornEngineering

Making Facebook self-healing: Automating proactive rack maintenance

Posted about 9 months ago

Aggregate Maintenance Handlers provide a way to safely automate maintenance on multiple servers at once. Read more...

Romain KomornEngineering

Python in production engineering

Posted about 10 months ago
blog post · Production Engineering · Python

Production engineers at Facebook use Python for variety of purposes, including binary distribution, hardware imaging, operational automation, and infrastructure management. Read more...

Phil DibowitzProduction Engineer at Facebook

Facebook Chef cookbooks

Posted about 11 months ago
blog post · Infra · Data · Open Source · Production Engineering

This suite of cookbooks — along with a sample 'init' cookbook — will allow anyone who wants to use our model of Chef in their own environment to get started easily and quickly. Read more...

Erin GreenEngineering

Using ISC Kea DHCP in our data centers

Posted about 2 years ago

Inside Facebook's transition to ISC Kea. Read more...

Under the hood: Facebook’s cold storage system

Posted about 2 years ago

Finding a place for images to live so they can be instantly available is a recurring scale challenge for Facebook. Read more...

Ryan AlbrechtEngineering

Web performance: Cache efficiency exercise

Posted about 2 years ago
blog post · Web · Production Engineering · Caching · Front End

My team was discussing the parts of facebook.com that are currently uncached, and the question came up: What is the efficiency of the cache since, at Facebook, we release new code twice a day? We were worried that our release process might be negatively impacting our cache performance. Read more...

Zheng MiEngineering Manager at Facebook

Mobile performance: Tooling infrastructure at Facebook

Posted about 2 years ago

We built a performance monitoring and prediction platform to help us understand the performance implications of a code change and ultimately decrease the number of regressions engineers have to deal with. Read more...

Augmented Traffic Control: A tool to simulate network conditions

Posted about 2 years ago

Today we are open-sourcing our design for Augmented Traffic Control on GitHub. Read more...

Yufei ZhuEngineering at Facebook

Serving Facebook Multifeed: Efficiency, performance gains through redesign

Posted about 2 years ago
blog post · Infra · Production Engineering

We leveraged the disaggregation concept to redesign Multifeed, creating a 40% efficiency improvement via total memory and CPU consumption optimization for parts of the Multifeed infrastructure in the process. Read more...

Yuval BacharEngineering

Introducing “6-pack”: the first open hardware modular switch

Posted about 2 years ago

With “6-pack,” we have created an architecture that enables us to build any size switch using a simple set of common building blocks. Read more...

Nathan BronsonSoftware Engineer at Facebook

Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale

Posted about 2 years ago
blog post · Infra · Production Engineering

As we’re building and running systems at Facebook, sometimes we encounter metastable failure states. These are problems that create conditions that prevent their own solutions. In gridlocked traffic, for example, cars that are blocking an intersection keep traffic from moving, but they can’t exit the intersection because they are stuck in traffic. This kind of failure ends only when there is an external intervention like a reduction in load or a complete reboot. Read more...

Alexey AndreyevEngineering

Introducing data center fabric, the next-generation Facebook data center network

Posted about 2 years ago

The more than 1.35 billion people who use Facebook on an ongoing basis rely on a seamless, “always on” site performance. On the back end, we have many advanced sub-systems and infrastructures in place that make such a real-time experience possible, and our scalable, high-performance network is one of them. Read more...

Audience Insights query engine: In-memory integer store for social analytics

Posted about 2 years ago
blog post · Web · Data · Infra · Production Engineering · Analytics · Data Science

A query engine with a hybrid integer store that organizes data in memory and on flash disks so that a query can process terabytes of data in real time. Read more...

Introducing Proxygen, Facebook's C++ HTTP framework

Posted about 2 years ago

We are excited to announce the release of Proxygen, a collection of C++ HTTP libraries, including an easy-to-use HTTP server. In addition to HTTP/1.1, Proxygen (rhymes with “oxygen”) supports SPDY/3 and SPDY/3.1. We are also iterating and developing support for HTTP/2. Read more...

Mike ArpaiaEngineering

Introducing osquery

Posted about 2 years ago

Maintaining real-time insight into the current state of your infrastructure is important. At Facebook, we've been working on a framework called osquery which attempts to approach the concept of low-level operating system monitoring a little differently. Read more...

Building Mobile-First Infrastructure for Messenger

Posted about 2 years ago
blog post · Mobile · Infra · Messages · Production Engineering · Backend · Storage

Messages have been part of Facebook for many years, beginning as direct messaging similar to email (available in your inbox the next time you visited the site) and then eventually evolving into a real-time messaging platform that provides access to your messages on a number of mobile apps or in a browser. But until recently the back-end systems hadn't evolved much from early iterations, and Messenger's performance and data usage started to lag behind — especially on networks with costly data plans and limited bandwidth. To fix this, we needed to completely re-imagine how data is synchronized to the device and change how data is processed in the back end to support our new synchronization protocol. Read more...

Phil DibowitzProduction Engineer at Facebook

Facebook, configuration management, community, and open source

Posted about 2 years ago
blog post · Infra · Data · Open Source · Production Engineering

Last year we began speaking at conferences around the world about our approach to managing hundreds of thousands of servers. We had outgrown our existing system and needed something new. We wanted a system that would let any engineer make any change they needed to any systems they owned via simple data-driven APIs while also scaling to Facebook's huge infrastructure, and while also minimizing the size of the team that would have to own the system. We designed a new paradigm and built a framework to bring it to life. At the core of that framework is Chef — but the way we ended up using Chef is pretty unique. We wanted to share how and why we made those choices and the benefits they brought us. Read more...

Lessons from Deploying MySQL GTID at Scale

Posted about 3 years ago
blog post · Data · MySQL · Production Engineering · Open Source

Global Transaction ID (GTID) is one of the most compelling new features of MySQL 5.6. It provides major benefits in failover, point-in-time backup recovery, and hierarchical replication, and it's a prerequisite for crash-safe multi-threaded replication. Over the course of the last few months, we enabled GTID on every production MySQL instance at Facebook. In the process, we learned a great deal about deployment and operational use of the feature. We plan to open source many of our server-side fixes via WebScaleSQL, as we believe others in the scale community can learn from this and benefit from the work we've done. Read more...

Introducing mcrouter: A memcached protocol router for scaling memcached deployments

Posted about 3 years ago

Most web-based services begin as a collection of front-end application servers paired with databases used to manage data storage. As they grow, the databases are augmented with caches to store frequently-read pieces of data and improve site performance. Often, the ability to quickly access data moves from being an optimization to a requirement for a site. This evolution of cache from neat optimization to necessity is a common path that has been followed by many large web scale companies, including Facebook, Twitter[1], Instagram, Reddit, and many others. Read more...

Qiang WuInfrastructure Software Engineer at Facebook

Making Facebook’s software infrastructure more energy efficient with Autoscale

Posted about 3 years ago

Improving energy efficiency and reducing environmental impact as we scale is a top priority for our data center teams. We’ve talked a lot about our progress on energy-efficient hardware and data center design through the Open Compute Project, but we’ve also started looking at how we could improve the energy efficiency of our software. We explored multiple avenues, including power modeling and profiling, peak power management, and energy-proportional computing. One particularly exciting piece of technology that we developed is a system for power-efficient load balancing called Autoscale. Autoscale has been rolled out to production clusters and has already demonstrated significant energy savings. Read more...

Saving capacity with HDFS RAID

Posted about 3 years ago
blog post · Data · Infra · Production Engineering

As we continue to evolve our data infrastructure, we’re constantly looking for ways to maximize the utility and efficiency of our systems. One technology we’ve deployed is HDFS RAID, an implementation of Erasure Codes in HDFS to reduce the replication factor of data in HDFS. We finished putting this into production last year and wanted to share the lessons we learned along the way and how we increased capacity by tens of petabytes. Read more...

Scaling the Facebook data warehouse to 300 PB

Posted about 3 years ago
blog post · Data · Infra · Production Engineering

At Facebook, we have unique storage scalability challenges when it comes to our data warehouse. Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB. In the last year, the warehouse has seen a 3x growth in the amount of data stored. Given this growth trajectory, storage efficiency is and will continue to be a focus for our warehouse infrastructure. Read more...

Keep Updated

Stay up-to-date via RSS with the latest open source project releases from Facebook, news from our Engineering teams, and upcoming events.

Subscribe
Facebook © 2017