VAST Data has been awarded a contract by the Texas Advanced Computing Center (TACC) for its single-tier, NFS-based flash storage system, outperforming traditional disk-based parallel HPC file systems such as Luster.
Based at the University of Texas at Austin, TACC is building Stampede3, an open access supercomputer powered by Intel Max Series CPU and GPU nodes capable of delivering 10 petaFLOPS of performance. The hardware was funded by a $10 million grant from the National Science Foundation (NSF) and will be used by the scientific supercomputing research community.
Jeff Denworth, co-founder and CMO of VAST Data, said in a blog post, “TACC has selected VAST as the data platform for Stampede3, their next-generation large-scale research computer that will power applications for more than 10,000 US scientists and engineers. The purchase of Stampede precedes selection for their next really big system, which will likely be announced later this year.”
With over 1,858 compute nodes, over 140,000 cores, over 330 terabytes of RAM, and 13 petabytes of VAST data storage, Stampede3 is poised to significantly increase the computing power of the scientific community. The VAST flash system offers both scratch and nearline storage, replacing a DDN Luster disk-based system in the earlier Stampede2.
The VAST storage provides 450 GBps of read bandwidth and serves as a combined scratch and nearline storage tier. Despite considering several storage options, including Luster, DAOS, BeeGFS, and Weka, TACC chose VAST Data for its ability to handle expected AI/ML workloads that require fast random reads. Thanks to data reduction, with a reduction ratio of 2:1, and QLC NAND, TACC found VAST’s flash cost affordable compared to traditional disk storage.
Stampede3 is a hybrid or heterogeneous setup with several subsystems:
- High-end simulation of 560 Xeon Max Series CPUs (c63,000 cores); Sapphire Rapids Gen 4 Xeon SPs with High Bandwidth Memory (HBM),
- AI/ML and graphics subsystem with 40 Max series GPUs (Ponte Vecchio as it was) in 10 Dell PowerEdge XE9640 servers, each with 128 GB HBM2e RAM,
- High memory-dependent computation of 224 Gen 3 Xeon SP nodes included in the earlier Stampede2 system,
- Legacy and interactive computing anywhere from >1,000 Stampede2 Gen 2 Xeon SP nodes.
Dan Stanzioni, Executive Director of TACC, commented, “We believe the high-bandwidth memory of the Xeon Max CPU nodes will deliver better performance than any CPU our users have ever seen. They provide more than double the memory bandwidth performance per core over current 2nd and 3rd Gen Intel Xeon Scalable processor nodes on Stampede 2.”
These processing systems and the storage facilities will be interconnected with an Omni-Path Fabric 400 Gbps network, providing a smooth transition for existing Stampede2 users as it transforms to Stampede3. This upgraded system is expected to operate from this year through 2029.
Stampede2 used a Luster parallel file system running on 35 Seagate disk-based ClusterStor 300 arrays (Scalable Storage Units, or SSUs). There was a 33 x SSU scratch system and a 2 x SSU home system. These were supported by a 25 PB DDN Luster back-end storage system called Stockyard, which operates throughout the TACC site.
Luster is used by more than half of the top 500 supercomputing systems and is an almost standard supercomputing file system due to its ability to provide read/write IO to thousands of compute nodes simultaneously. TACC chose VAST’s NFS over Luster because VAST Data’s architecture provides parallel file system performance without the inherent complexity of a dedicated parallel file system such as Luster. It also outperforms Luster, we’re told, as it has proven to be able to handle an extremely IO-intensive nuclear physics workload faster and more efficiently.
One TACC nuclear physics workload is extremely IO intensive and Luster can only handle 350 nodes before Luster’s metadata server runs out. TACC tested a VAST Data system and found that it also supported 350 client nodes on this workload, running 30 percent faster than the Luster storage. It then connected the VAST storage to Frontera and scaled the client node number through 500, 1,000, and 2,000 nodes to 4,000 clients and the VAST storage was running sufficiently.
Denworth noted that “20U hardware running VAST software can accommodate up to 50 racks of Dell servers.” During a VAST software upgrade, one of the VAST storage servers had a hardware failure that caused the overall storage to work with two versions of the VAST operating system. There was also an error in the software installer. The storage was still working, minus one server with two VAST OS versions, and the upgrade process was complete with the installation software bug and faulty hardware fixed.
Denworth said: “Updates to the HPC file system are largely done offline (causing downtime) and it would be crazy to think about running in production with multiple versions of software on a common data platform.” There was no VAST system downtime during the outages, TACC said.
The VAST storage should be installed in September and Stampede3 should be up and running in production mode by March next year.
TACC also evaluated a flash file system software upgrade for Frontera, looking at Weka and VAST. Frontera will be replaced by the next flagship TACC supercomputer, the exascale-class Horizon, and VAST is now in the running for selection as a Horizon storage vendor.
Dewnworth commented, “Stampede3 will be the start of a great partnership between VAST and TACC, and we would like to thank them for their support and guidance as they chart a new path towards exascale AI and HPC.”
TACC has several supercomputer systems:
- $60 million Dell Frontera; TACC’s flagship system, performs at 23.5 Linpack petaFLOPS from 8,008 Intel Xeon Cascade Lake-based nodes plus specialized subsystems. It ranks 21st in the TOP 500 supercomputer list. It has a capacity of 56 PB with 4 x DDN 18K Exascaler storage disk-based arrays providing 300 GBps bandwidth. There are also 72 x DDN IME flash servers with a capacity of 3 PB and a bandwidth of 1.5 TBps.
- The $30 million Dell-based Stampede2 offers 10.7 petaFLOPS using 4,200 Intel Knights Landing-based nodes and 1,736 Intel Xeon Skylake-based nodes. It is a capacity class system at number 56 in the Top 500 list, and is the second generation Stampede system and is being replaced by Stampede3. Stampede2 is a 2017 system and followed the original Stampede system from 2012.
- Lonestar5 for HPC and remote visualization jobs running on 301.8 teraFLOPS with >1,800 nodes and >22,000 cores with a 12PB Dell BeeGFS file storage system. Now replaced by Lonestar6.
- Wrangler is a smaller 62 teraFLOPS supercomputer for data-intensive work, like Hadoop, with 96 Intel Haswell nodes (24 cores and a minimum of 128 TB of DRAM per node), a 500 TB high-speed flash-based object storage system, and a 10 PB disk-based mass storage subsystem with a replicated location in Indiana.
- Stockyard2 is a global file system that provides a shared 10PB DDN Luster project workspace at 1TB/user and 80GBps bandwidth.
- Ranch is a 100 PB tape archive that uses a Quantum tape library.
Stampede2 has been successful as an open science system, with more than 11,000 users working on more than 3,000 funded projects with more than 11 million simulations and data analysis tasks since its launch in 2017. This replicated the success of Stampede1, which ran more than 8 million simulations with more than 3 billion computing hours delivered to more than 13,000 users in more than 3,500 projects.
At some point in 2018, Stampede2 was equipped with 3D Xpoint NVDIMMS as an experimental component in a small subset of the system.