Introducing Big Basin: Our next-generation AI hardware

At Facebook, we use artificial intelligence (AI) to power services like speech and text translations, photo classifiers, and real-time video classification. We are committed to advancing the field of AI and its disciplines, and believe that closer integration of software and hardware will help tackle these challenges. To accelerate our progress as we train larger and deeper neural networks, we created Big Basin, our next-generation GPU server, and we're open-sourcing the design through the Open Compute Project.

Big Basin is the successor to our Big Sur GPU server, which we announced in late 2015. Big Sur was the first widely deployed, high-performance AI compute platform in our data centers, and it continues to be used for training research. With Big Basin, we can train machine learning models that are 30 percent larger because of the availability of greater arithmetic throughput and a memory increase from 12 GB to 16 GB. This enables our researchers and engineers to move more quickly in developing increasingly complex AI models that aim to help Facebook further understand text, photos, and videos on our platforms and make better predictions based on this content.

Modular, more scalable design

Since the deployment of Big Sur in our data centers, we have gathered valuable feedback from our Applied Machine Learning (AML), Facebook AI Research (FAIR), and infrastructure teams around serviceability, reliability, performance, and cluster management. We have incorporated these learnings into the Big Basin design.

Modularity was a major design focus when developing Big Basin. We designed the system to allow for the disaggregation of the CPU compute from the GPUs, which enables us to leverage and connect existing OCP components and integrate new technology when necessary. For the Big Basin deployment, we are connecting our Facebook-designed, third-generation compute server as a separate building block from the Big Basin unit, enabling us to scale each component independently.

Our new Tioga Pass server platform will also be compatible with Big Basin. This will lead to increased PCIe bandwidth and performance between the GPUs and the CPU head node. The GPU tray in the Big Basin system can be swapped out for future upgrades and changes to the accelerators and interconnects. The connection between the head node and GPUs is managed using external PCIe cables similar to Facebook's Lightning NVMe storage hardware. In this case, Big Basin behaves like a JBOD — or as we call it, JBOG, “just a bunch of GPUs.”

With a disaggregated design, we are able to further improve the serviceability and thermal efficiency of the system. Big Basin is split into three main sections: the accelerator tray, the inner chassis, and the outer chassis. With a fully toolless design, the GPUs can be serviced in-rack using integrated sliding rails. By utilizing existing Facebook infrastructure tooling and hardware components, our technicians are able to further reduce the operational complexity and downtime during repairs. Furthermore, due to the disaggregated design, the GPUs are now positioned directly in front of the cool air being drawn into the system, removing preheat from other components and improving the overall thermal efficiency of Big Basin.

Higher performance and more-capable GPUs

Built in collaboration with our ODM partner QCT (Quanta Cloud Technology), the current Big Basin system features eight NVIDIA Tesla P100 GPU accelerators. These GPUs are connected using NVIDIA NVLink to form an eight-GPU hybrid cube mesh — similar to the architecture used by NVIDIA's DGX-1 system. This setup, combined with the NVIDIA Deep Learning SDK, utilizes this new architecture and interconnects to improve deep learning training across all GPUs.

Compared with Big Sur, Big Basin will bring us much better gain on performance per watt, benefiting from single-precision floating-point arithmetic per GPU increasing from 7 teraflops to 10.6 teraflops. Half-precision will also be introduced with this new architecture to further improve throughput.

Larger models and faster training times

As mentioned above, Big Basin can train models that are 30 percent larger because of the availability of greater arithmetic throughput and a memory increase from 12 GB to 16 GB. In tests with the popular image classification model ResNet-50, we were able to reach almost 100 percent improvement in throughput compared with Big Sur, allowing us to experiment faster and work with more complex models than before.

We believe open collaboration helps foster innovation for future designs, putting us all one step closer to building complex AI systems that will ultimately help us build a more open and connected world. The design specifications for Big Basin are publicly available through the Open Compute Project, and a comprehensive set of hardware design files will be released shortly.