Building data science teams to have an impact at scale

Rajiv KrishnamurthyAshish Kelkar

Facebook's vast scale and range of engineering initiatives — in particular in the context of infrastructure engineering — present a tremendous opportunity for data science to tackle interesting, important challenges and to affect products and services used by billions of people every day. Data scientists on the Facebook infrastructure team work on problems that range from designing machine learning algorithms for content delivery networks (CDNs), to building statistical models that forecast data center demand, to designing tools that can detect regressions hidden among billions of time series data points. Because data science plays this important role, Facebook has designed a distinctive model for building unified teams with both infrastructure engineers and data scientists. Using this integrated approach, Facebook's data scientists help build more efficient tools and systems, improve performance, and devise long-term scaling strategies across our infrastructure stack.

How data science contributes to infrastructure engineering work

Close collaboration between data scientists and infra engineers is an important part of optimizing storage on Facebook's CDN, which hosts photos, video, and other content for quick access by our apps. These global server networks are a finite resource, and it isn't feasible to store every piece of content on the CDN indefinitely. The typical industry practice is to store recently requested information and then remove it when there are no further requests. With the volume of data served to people who use Facebook every day, the CDN ends up temporarily storing a lot of information that is never requested again, which is an inefficient use of CDN capacity. To address this, our data scientists collaborated with the network engineering team to build machine learning models that can predict whether a given piece of content will be requested again in the near future. This technique has helped the team improve cache efficiency, and our data scientists are now working with engineers to implement it on a production scale.

Our infrastructure data scientists also teamed up with engineers to build a suite of tools to detect performance degradation in Facebook's family of apps. These issues are often hard to catch because, with billions of people using our apps on billions of devices, it is very hard to tease out real performance issues from changes that are instead linked to underlying trends in device usage or demographics. To solve this problem, infrastructure data scientists collaborated with mobile engineers to develop tools that isolate such effects and provide “controlled” comparisons. With this approach, mobile engineers are able to detect performance issues that wouldn't have surfaced if we were simply tracking overall metrics such as averages or percentiles.

In addition to developing systems that detect client-side performance issues (commonly referred to as performance regressions), the team created analogous server-side tools. Detecting server-side performance regressions is challenging because our data sets have trend, seasonality, and noise components that make common threshold-based detection strategies ineffective. The data scientists worked with performance engineers to develop custom algorithms that can account for these confounding components in order to flag real regressions. Having effective regression detection systems is important because inefficient code compounds over time to adversely impact our scaling efforts and quality of user experience. Through these projects and others, our data scientists have directly contributed toward making our apps faster and more engaging for users.

Figure 1: Regression detection in performance data sets using a signal processing technique called lowpass filter.

Data scientists also play a very important role in determining our data center strategy. Adding or expanding these centers is a complex, important long-term decision, and it always includes input from the data science team. Our data sets have some unique properties — complex seasonal fluctuations and usage trends that vary greatly by device type and location — that make long-term forecasting quite challenging. Conventional time series forecasting techniques such as ARIMA don't work well in this context, so the team developed a custom suite of forecasting models, including Bayesian time series models and deep learning models to generate more accurate predictions. The results have proved to be very accurate, and several teams rely on them for strategic planning. In addition to providing long-term forecasts, the team is refining techniques to predict the impact of infrequent events on our infrastructure. This is especially important because different events affect our systems in different ways. Halloween, for example, spurs a high volume of photo uploads, while the end of a major football game often prompts a huge surge in status updates. These are completely different types of traffic to our data centers. Understanding the impact of these events and predicting their scale is critical to ensure a smooth, reliable experience for our users.

Figure 2: Activity for a subset of users displaying clear trend and seasonal effects.

Setting the stage for impactful data science

Effective collaboration requires deep mutual understanding and a foundation of shared experience among all team members. For data scientists, this process starts when new employees join Facebook. The company on-boards infra data scientists and engineers through the Bootcamp program, which provides broad exposure to our engineering systems in a supportive learning environment. We also encourage our partner engineering teams to identify mentors to guide new data scientists as they ramp up in their first projects. New data scientists receive mentoring on the most effective ways to communicate the results of their complex analyses.

Organizationally, data science teams are structured to support broad infrastructure teams such as Network Engineering, Storage Infra, and Release Management. Some data scientists prefer to develop deep domain expertise in their application area and spend several years embedded with their partner team, while others prefer to move across partner teams every 12 to 18 months in order to develop a broad understanding of how modern cloud systems are designed and run. We encourage both approaches.

Adding skills and building for the future

Because data science is a rapidly evolving discipline, we make sure our data scientists have opportunities to learn and master state-of-art skills. In addition to internal training sessions and chalk talks, we bring in external speakers to cover important developments in the field. Many of our data scientists are closely connected to the academic community and attend and present at major conferences such as INFORMS, KDD, and NIPS, which has helped produce a range of innovative collaborations.

As fields such as artificial intelligence and virtual and augmented reality become increasingly important to Facebook's services, there will be new opportunities to leverage data science. Our model of close, extended collaboration between infra teams will ensure that Facebook is well-positioned to address these and other challenges to come.

Keep Updated

Stay up-to-date via RSS with the latest open source project releases from Facebook, news from our Engineering teams, and upcoming events.

Facebook © 2018