February 8, 2010

Site Reliability Engineering at Facebook

Mark Schonbach

I recently returned from visiting family and friends in Delaware, and I was asked by everyone, "What do you do at Facebook?" The best answer that I could give them without launching into a 45-minute technical discussion is that: "I'm responsible for making sure that Facebook is up at all times and performing at its peak." Obviously, I don't run one of the largest sites in the world by myself; I'm part of a small team of Site Reliability Engineers (SRE) that works day and night to ensure that you and the other 400+ million users around the world are able to access Facebook, that the site loads quickly, and all of the features are working.

Our Site Reliability Engineering team currently consists of teams in Palo Alto, London and our brand new Dublin, Ireland office. At Facebook, we are very proud of our level of engineering impact with over 1.2 million users per engineer, but we are even more proud of the fact that we keep Facebook up and running with one SRE for every 18 million users. That level of impact is unrivaled compared to other technology companies.

The work that the Site Reliability Engineering team does can best be summarized this way:

Site - Does it work? Facebook's SRE team is tasked with making sure the site is up and running 24 hours per day, 365 days per year. To support a global user base, we keep a watchful eye on our various internal and external monitoring tools and systems so you're able to connect and share with friends and family regardless of whether it is noon in New York or midnight in Manila. We manage our high traffic load by balancing the user experience against our available worldwide capacity. The SRE team is empowered with the knowledge and responsibility to fix just about any operational issue we may encounter, problem solve with other Technical Operations and Engineering teams as appropriate, and follow any issue through to its completion.

Reliability - Does it work well? Facebook's dirty little secret is that behind the scenes, our infrastructure is extraordinarily complex. While it is extremely rare that the entire site is offline, it is more common that one feature is temporarily unavailable. The SRE team works tirelessly to ensure that not only is the core of Facebook up and running, but also that you can use all of the features of the site, e.g. photo uploads, chat, and Facebook Connect. Even though we work directly with key partners and developers to make sure their applications are working well, we don't get any special perks for our farms and mafias. We also work with the Release Engineering team to coordinate scheduled and emergency code updates and understand what is being changed and how it could affect the site.

Engineering - Could it work better? We always have one eye looking towards the future. We regularly hack tools on the fly that help us manage and perform complex maintenance procedures on one of the largest, if not the largest memcached footprints in the world. We develop automated tools to provision new servers, reallocate existing ones, and detect and repair applications or servers that are misbehaving. We are only able to maintain such a high user to server ratio due to a knowledgeable and experienced set of engineers. We also track performance issues and look at long-term trends to correct issues and look for ways to make Facebook run even faster and more efficiently.

After I attempt to explain what I do, the next question I am usually asked is, "What do you like most about your job?" Aside from the awesome food every day and the amazingly talented people I work with, the thing I like most about being an SRE is that I never know what I am going to encounter when I arrive to work in the morning. One day could involve spending hours troubleshooting a complicated networking issue, and the next could be spent writing a tool to verify that all of our servers are responding efficiently. It brings a smile to my face every time I get a friend request from an old friend I'd previously lost touch with, because I know that my hard work is worth something meaningful to millions and millions of people around the world. Facebook is truly a fast-paced, dynamic environment, yet offers the freedom to operate and do what is necessary to make things better. This is best summed up by example: at the end of my first week as an SRE, I had already investigated and corrected a troublesome issue that had been plaguing the team. It was gratifying to see myself having an impact in such a short span of time.

As Facebook continues to grow, we are always looking to expand our team with talented, motivated people who believe in what we do and who are eager to jump in and help us face our future challenges. If this sounds like you, take a look at our SRE job description; we would love for you to join our team!

Mark Schonbach is balancing traffic between datacenters while sitting in traffic on Interstate 280 on the Facebook shuttle.

Keep Updated

Stay up-to-date via RSS with the latest open source project releases from Facebook, news from our Engineering teams, and upcoming events.

Subscribe
Facebook © 2017