Cavium Takes ARM to Petascale with Astra

The last few years have been unusually exciting for the data center – Intel introducing Xeon Scalable while AMD re-entered the market with EPYC. On the ARM side, we’ve seen Qualcomm announcing Centriq, Ampere talking about eMAG, and Cavium launching ThunderX2.

Among the available ARM options, it is Cavium that is gaining the most traction. ThunderX-based prototype systems have been used for various European efforts including the Mont-Blanc project initiative, the Isambard supercomputer from the University of Bristol, and a number of other projects. While the ARM ecosystem continues to improve, more significant investment is required. This is where Astra comes in. Astra is the fourth prototype system being built by the Sandia National Laboratories as part of its Vanguard project designed to deliver a future exascale ARM machine. It is part of a larger initiative designed to evaluate the viability of non-x86 architectures for HPC.

 

Sandia Roadmap (SNL)

All three prior prototypes, Hammer, Sullivan, and Mayer, were also ARM-based. Hammer was based on first-generation X-Gene by AppliedMicro while Sullivan was based on the original ThunderX processors. Mayer was built by HPE and Cavium last year. It consisted of 47 nodes using pre-production ThunderX2 parts.

Compute Node

Astra uses HPE’s Apollo 70 systems. Those use a highly dense chassis system architecture that fit in just 2U and consist of four dual-socket nodes.

Apollo 70 System (HPE)

Each node has two 1,600 W power supplies, 1 Gbps Ethernet management port, and a Mellanox ConnectX-5 EDR link. Each node has a dual-socket Cavium ThunderX2 processor with 28 cores operating at 2 GHz. Presumably, this is the ThunderX2 CN9975 but it could be an unannounced SKU.

We have recently discussed the ThunderX2 family. Those processors are based on the Vulcan microarchitecture and incorporate up to 32 cores. For Astra, Sandia is using 28-core parts operating at 2 GHz, likely due to a better performance/power efficiency design point. Each chip supports up to eight channels of DDR4 DIMMs with rates up to 2666 MT/s as well as 56 PCIe 3 lanes.

 
 
ThunderX2 Chip Overview (WikiChip)

The ThunderX2 CN9975 supports two-way multiprocessing. Communication is done over second-generation Cavium Coherent Processor Interconnect (CCPI2) which provides 600 Gbps of aggregated bandwidth. For the Astra supercomputer, each node uses an 8 GiB DDR4-2666 dual-rank DIMM per controller for a total of 64 GiB and 170.7 GB/s of aggregated memory bandwidth per socket. For each node, there is a single Mellanox EDR InfiniBand ConnectX-5 VPI card designed for the Open Compute Project (OCP) providing the 100 Gb/s link.

Astra Node (WikiChip)

Full Node Capabilities

With eight DIMMs per controller, each node has 128 GiB of memory feeding 56 cores with a total bandwidth of 341.33 GB/s per node. Those cores operate at up to 2 GHz, each with 2 NEON 128-bit units providing a peak theoretical performance of 8 double-precision FLOPS/cycle. This works out to 16 GFLOPS per core.

Full Node Capabilities
  Socket Node
Processors 1 × CPU 2 × CPU
Core 28 (112 threads) 56 (224 threads)
FLOPS (SP) 896 GFLOPS
28 × 32 GFLOPS
1,792 GFLOPS
2 × 28 × 32 GFLOPS
FLOPS (DP) 448 GFLOPS
28 × 16 GFLOPS
896 GFLOPS
2 × 28 × 16 GFLOPS
Memory 64 GiB (DDR4)
8 × 8 GiB
128 GiB (DDR4)
2 × 8 × 8 GiB
Bandwidth 170.7 GB/s
8 × 21.33 GB/s
341.33 GB/s
16 × 21.33 GB/s

Compute Rack

The HPE Apollo 70 compute rack contains 18 chassis for a total of 72 compute nodes along with 3 InfiniBand switches. There is a single 36-port L1 switch per 6 chassis.

Rack (WikiChip)

With 72 nodes, there are 144 ThunderX2 processors per rack for a peak compute power of 64.5 teraFLOPS.

Full Rack Capabilities
  Node Rack
Processors 72
72 × CPU
144
72 × 2 × CPU
Core 56 (224 threads) 4,032 (16,128 threads)
72 × 56 (224 threads)
FLOPS (SP) 1,792 GFLOPS
2 × 28 × 32 GFLOPS
129 TFLOPS
72 × 2 × 28 × 32 GFLOPS
FLOPS (DP) 896 GFLOPS
2 × 28 × 16 GFLOPS
64.51 TFLOPS
72 × 2 × 28 × 16 GFLOPS
Memory 128 GiB (DDR4)
2 × 8 × 8 GiB
9 TiB (DDR4)
72 × 2 × 8 × 8 GiB

Full System

Astra comprises 36 racks for a total of 2,592 compute nodes and 5,184 processors. The overall interconnect design is a three-level fat tree with a 2:1 tapered fat-tree at L1. The 36 racks comprise 648 chassis and 108 L1 switches. There are 3 540-port switches. Those are formed from 30 level 2 switches that provide 18 ports each (540 in total) with the remaining 18 links going for each of the 18 level 3 switches.

 
540-Port Switch (WikiChip)

With a switch per 6 chassis (24 ports), the remaining 12 ports are used to link into the L2 switches, 4 links per 540-port switch.

Astra L1 to L2 (WikiChip)

With a little over five thousand cores, Astra will have a peak theoretical performance of 2.322 petaFLOPS, making it by far the most powerful ARM supercomputer built to date. In addition to the 36 computer racks, the full system includes 3 networking racks, 2 storage racks, a utility rack, and 12 of HPE MCS-200 fan coils cooling units. The projected nominal power consumption under LINPACK for the system is 1.36 MW with a peak wall projection total power of slightly over 1.6 MW.

Astra Supercomputer (Sandia)
Astra Capabilities
  Rack System
Processors 144
72 × 2 × CPU
5,184
36 × 72 × 2 × CPU
Core 4,032 (16,128 threads)
72 × 56 (224 threads)
145,152 (580,608 threads)
36 × 72 × 56 (224 threads)
FLOPS (SP) 129 TFLOPS
72 × 2 × 28 × 32 GFLOPS
4.644 PFLOPS
36 × 72 × 2 × 28 × 16 GFLOPS
FLOPS (DP) 64.51 TFLOPS
72 × 2 × 28 × 16 GFLOPS
2.322 PFLOPS
36 × 72 × 2 × 28 × 16 GFLOPS
Memory 9 TiB (DDR4)
72 × 2 × 8 × 8 GiB
324 TiB (DDR4)
36 × 72 × 2 × 8 × 8 GiB

Extended WikiChip Article: Astra



Spotted an error? Help us fix it! Simply select the problematic text and press Ctrl+Enter to notify us.

Spelling error report

The following text will be sent to our editors: