Highlights:
- According to Meta, the company will train newer, more potent AI systems, such as Llama 3, the anticipated replacement for Llama 2, and improve its current ones using the additional clusters.
- The two clusters have distinct designs, despite having the same number of GPUs connected via 400 gigabyte-per-second endpoints.
Meta Platform Inc., commonly known as Meta launched two clusters of graphics processing units (GPUs), which are incredibly powerful. It is claimed that these will help train the next generation of generative artificial intelligence models, such as the emerging Llama 3.
The two 24,576-GPU data center scale clusters, according to Meta engineers Kevin Lee, Adi Gangidi, and Mathew Oldham, were constructed to accommodate far larger and more sophisticated generative AI models than those previously released, like the well-known open-source algorithm Llama 2, which competes with OpenAI’s ChatGPT and Google LLC’s Gemini. According to the engineers, they will also support ongoing AI research and development.
Thousands of Nvidia Corp.’s most powerful H100 GPUs are packed into each cluster, which is significantly larger than the company’s previous large clusters, which held about 16,000 Nvidia A100 GPUs.
The business has allegedly been busy purchasing thousands of Nvidia’s latest chips. A report recently claimed that the business has grown to be one of the chipmaker’s biggest clients.
According to Meta, the company will train newer, more potent AI systems, such as Llama 3, the anticipated replacement for Llama 2, and improve its current ones using the additional clusters. Although it was commonly assumed that Meta was working on Llama 3, this blog post is the first time the company has officially said that it is. The engineers stated that Llama 3 is still under development, but they did not indicate when an announcement may be made.
Longer term, Meta wants to develop artificial general intelligence (AGI) systems that, compared to current generative AI models, will be far more creatively human-like. It was said in the blog post that the new clusters will aid in scaling these goals. Furthermore, Meta disclosed that it is currently developing its PyTorch AI framework to prepare it for supporting higher GPU counts.
Beneath the Surface
The two clusters have distinct designs, despite having the same number of GPUs connected via 400 gigabyte-per-second endpoints. One of them is based on Arista Networks Inc.’s Arista 7800 with Wedge400 and Minipack2 OCP rack switches and offers remote direct memory access, or RDMA, via a converged Ethernet network fabric. The other is constructed with Nvidia’s proprietary Quantum2 InfiniBand network fabric technology.
Both clusters were constructed with the Grand Teton open GPU hardware platform from Meta, which is intended to handle high-demanding AI tasks. Grand Teton is said to have twice the computation and data network bandwidth, twice the power envelope, and four times the host-to-GPU bandwidth of its predecessor, the Zion-EX platform.
According to Meta, the clusters use its most recent Open Rack power and rack infrastructure architecture, which is intended to provide data center designers more design freedom. The engineers claim that Open Rack v3 offers additional configuration flexibility by allowing power shelves to be mounted anywhere inside the rack as opposed to fastened to the busbar.
Furthermore, the number of servers per rack can be adjusted, allowing for a more effective distribution of throughput capacity across the servers. Thus, according to Meta, it has been able to somewhat lower the total rack count.
The Linux-based Filesystem in Userspace application programming interface that the clusters employ for storage is supported by Meta’s distributed storage technology, Tectonic. Additionally, Meta collaborated with Hammerspace Inc., a startup, to develop a brand-new parallel network file system for the clusters.
Finally, the engineers clarified that the clusters use the company’s most cutting-edge E1.S solid-state drives and are built on the YV3 Sierra Point server architecture. The team reported that they implemented Nvidia’s Collective Communications Library, a collection of communication protocols tailored for the company’s GPUs, and that they altered the cluster’s network topology and routing architecture.
More GPUs on the Way
Meta stated that it is still totally dedicated to fostering open innovation within its AI hardware stack. The engineers informed that the business is a part of the newly formed AI Alliance, whose mission is to establish an open ecosystem that will improve transparency and trust in AI research and guarantee that everyone can profit from its advancements.
The engineers mentioned, “As we look to the future, we recognize that what worked yesterday or today may not be sufficient for tomorrow’s needs. That’s why we are constantly evaluating and improving every aspect of our infrastructure, from the physical and virtual layers to the software layer and beyond.”
Additionally, Meta disclosed that it plans to acquire over 350,000 H100 GPUs from Nvidia by the end of the year and would keep adding to its collection. It will employ these to keep expanding its AI infrastructure, and in the not-too-distant future, even more robust GPU clusters should become available.