Argonne Aurora A21: All’s well that ends well, better


33603D_LCF_ON LOCATION_Aurora last leaf installation

When it comes to many high-performance computing systems we’ve seen over the decades, we like to say that the hardware is the easy part. This isn’t universally true, and it certainly hasn’t been true for the “Aurora” supercomputer at Argonne National Laboratory, the second exascale system in the United States to actually be installed.

Last Thursday, after only heaven knows how many days of work involved unpacking the blade servers, consisting of a pair of “Sapphire Rapids” Max Series CPUs with HBM2e main memory and six “Ponte Vecchio” Max Series GPU compute engines, all 10,624 of the blades going in the 2 exaflops Aurora A21 system were finally and completely installed. That has been an even tougher job than it sounds. (Our backs hurt when we think about putting 10,624 blades in the racks, weighing over 70 pounds each…. And that’s also a mountain of cardboard to crush and recycle.)

Aurora final blade installation

The Aurora saga that began even before Argonne announced the plan for the original 180 petaflops pre-exascale machine, which we named Aurora A18, back in April 2015 based on Intel “Knights Hill” many-core processors and InfiniBand-like Omni- Path 200 interconnects, scalable across as many as 50,000+ compute nodes. There were many issues, particularly with Intel’s 10-nanometer processes used to create the calculators, but also managing that level of concurrency across so many nodes and supporting emerging AI workloads at such a machine. And so the original A18 machine architecture was killed in March 2018 and the Knights Hill processor in July 2018. And instead of getting a pre-exascale machine in 2018, Argonne was promised a machine with more than 1 exaflops on steadily double precision floating point computation in 2021 which was called Aurora A21 and it wasn’t clear at all that it would be a hybrid CPU-GPU architecture as Intel had not yet announced its intention to enter the GPU compute arena of the data center to enter Nvidia and AMD. That plan was unveiled in March 2019, with Intel as prime contractor and Cray as system builder.

The Aurora A21 machine, due to be installed here in 2023 after delays with the Ponte Vecchio GPUs and Sapphire Rapids CPUs, is a much better machine than Argonne was originally going to get and about as good as it could have been expected for this year given its condition of CPUs and GPUs here in 2023 as well. It may have taken a lot more nodes to break that 2 exaflops of peak performance barrier than Intel or Argonne expected – we revealed last month that the Ponte Vecchio GPUs are tuned to 31.5 teraflops of peak performance in FP64, significantly below the device’s claimed peak performance of 52 teraflops – but the resulting Aurora machine is equally a compute and bandwidth beast. And it looks like Argonne made a decent deal with it too – which was entirely justified given all the delays. Had it been possible to run Ponte Vecchio’s GPUs at full speed, the Aurora A21 would be rated at a peak of over 3.3 exaflops, which would probably make it the undisputed champion in the supercomputing arena for a few years. have made.

Bow, but probably a little slowly after all that heavy lifting.

As it stands, it looks like “El Capitan” at Lawrence Livermore will also peak over 2 exaflops, based on AMD’s hybrid CPU-GPU devices, dubbed the Instinct MI300A. How far over 2 exaflops is unclear, but El Capitan only needs to beat 2,007 exaflops to beat Aurora A21. With Lawrence Livermore playing second fiddle to the supercomputer performance of whatever machine Oak Ridge National Laboratory installed at about the same time in recent years, you can bet Lawrence Livermore is out to beat Argonne in November’s Top500 supercomputer rankings. 2023. We think it could be by 10 percent or more if the economy cooperates.

Argonne’s researchers and scientists just want to get some science done, we imagine, and look forward to just having a big beast to run their applications on.

All those compute nodes are stored in 166 of Hewlett Packard Enterprise’s “Shasta” Cray XE cabinets, which Cray purchased in May 2019 for $1.3 billion, each containing 64 compute blades per cabinet and consisting of a total of eight rows of cabinets distributed across an area the size of two basketball courts. In addition to these compute nodes, the Aurora A21 contains 1,024 storage nodes running Intel’s Distributed Asynchronous Object Storage, or DAOS, with a capacity of 220 TB and a bandwidth of 31 TB/s over the Slingshot connection itself. The compute nodes are also connected by Slingshot, and HPE/Cray is working on a shared compute and storage network in the pre-exascale and exascale machines it is building in the United States and Europe. There used to be one network for computing power and another for storage.

We’re excited to see how the Aurora A21 performs, and there’s no reason it couldn’t agree with El Capitan or Frontier in terms of real-world performance. And if so, there’s hope for Intel’s revised HPC and GPU ambitions, which seem a bit more practical today than they did a few years ago.

Intel has killed the next generation “Rialto Bridge” GPU and has also pushed its own hybrid CPU-GPU devices past 2025, as well as killing its “Gaudi” family of matrix matrix engines in favor of creating a re-established envisions the discrete GPU “Falcon Shores” using the integrated Ethernet connection and the matrix arithmetic units of the Gaudi line. It remains to be seen if this is all done to get the next system deal at Argonne, which must be pretty miffed at Intel given all the changes and delays. Five years and three different architectures is a whimsical little pill to swallow.

Argonne and Lawrence Livermore were both early enthusiasts of IBM’s BlueGene line of massively parallel machines, and Argonne was a strong supporter of the Knights computer engines and Omni-Path interconnections that Intel built as a kind of out-of-the-box replacement for BlueGene machines. . But now Argonne has been burned down three times — once by IBM and twice by Intel — and may not want to be the risk lamb the U.S. Department of Energy wants so it can keep two or three different architectural options in the pipeline. at any time. The DOE hasn’t tapped Nvidia to provide calculators for any of its current exascale computing systems, but that’s always an option for future machines likely to weigh between 5 exaflops and 10 exaflops in the 2028 to 2030 time frame.

That still seems a long way off, but given how hard it is to squeeze performance and cost out of these massive machines, it really isn’t that far off. Argonne is no doubt thinking through its options, and Intel is no doubt trying to show that anything that follows the Falcon Shores GPU – perhaps getting an integrated CPU-GPU device like AMD’s Lawrence Livermore’s El Capitan – will fill the role and foot. the bill.

All we know for sure is that it’s really Argonne’s job to push the boundaries, even if it gets cut short by it at times. That is what supercomputer centers do and have always done. It’s in the job description.

On the other hand, Argonne could take a sharp left turn and be the first US national lab to say hello to it and move all of its applications to Microsoft Azure’s North Central America region outside of Chicago on hybrid CPU-GPU machines connected by InfiniBand or gussied Ethernet.

Stranger things have obviously happened with Argonne, you have to admit.


Source link

Leave a Reply

Your email address will not be published. Required fields are marked *