Over the last couple of years, through funding from NEDO, PEZY has been designing a series of many-core MIMD processors known as the PEZY-SCx family. Last week, the small Japanese firm in collaboration with ExaScaler, announced that they have once again reached the number one spot on the Green500 list. This time with the ZettaScaler-2.2 supercomputer. The company had previously reached the number one spot in June 2015 and June 2016 with their 1,024-core PEZY-SC and PEZY-SCnp processors.
Powering the ZettaScaler-2.2 is the PEZY-SC2. The SC2 is a second-generation chip featuring twice as many cores – i.e., 2,048 cores with 8-way SMT for a total of 16,384 threads. Operating at 1 GHz with 4 FLOPS per cycle per core as with the SC, the SC2 has a peak performance of 8.192 TFLOPS (single-precision). Both prior chips were manufactured on TSMC’s 28HPC+, however in order to enable the considerably higher core count within reasonable power consumption, PEZY decided to skip a generation and go directly to TSMC’s 16FF+ Technology.
The SC incorporated two ARM926 cores and while that was sufficient for basic management and debugging its processing power was inadequate for much more. The SC2 uses a hexa-core P-Class P6600 MIPS processor which share the same memory address as the PEZY cores, improving performance and reducing data transfer overhead. With the powerful MIPS management cores, it is now also possible to entirely eliminate the Xeon host processor. However, PEZY has not done so yet.
Feeding the Beast
One of the bigger changes PEZY has made is getting rid of the “prefecture” units which were used as synchronization units for preparing the L3. In the SC, the chip was divided into four “prefectures”, each containing 16 “cities” for a total of 256 cores and their own L3 cache.
The SC2 eliminates all of this and instead introduces a unified last level cache (LLC) which is shared by all the cores as well as the six MIPS64 cores. Additionally, half of the memory controllers were removed on the SC2 and the remaining four were upgraded to support 64-bit DDR4-3200 for an aggregated memory bandwidth of 102.4 GB/s.
First Use of TCI
In place of the four controllers that were removed, they added four custom TCI ports. ThruChip Interface (TCI) is an alternative 3D packaging interconnect technology to through-silicon via (TSV) developed at Keio University in Japan. Instead of using vertical interconnect access to connect multiple dies, TCI is a wireless near-field inductive coupling technology. That is, TCI uses a magnetic field to penetrate through a semiconductor without a need for a physical medium like a through electrical conductor.
The SC2 has four custom TCI-DRAM interfaces, allowing it to achieve extremely high bandwidth of 512 GB/s per port for a total aggregated bandwidth of 2 TB/s. PEZY uses a TCI 3D DRAM chip called the UM-1 that is being developed by an affiliated company UltraMemory. UltraMemory was founded in November 2013 for the purpose of designing ultra-wide DRAM using TCI. This is the first time a commercial chip has utilized TCI technology and going purely based on the “UM-1” part number, we can speculate that this is also UltraMemory’s first model based on this technology as well.
To accommodate 2,048 cores, the number of cities were doubled to 128. The new high-level block diagram of the PEZY-SC2 chip should look similar to this:
Low-precision for Deep Learning
In areas such as deep learning and AI, high-precision calculations are not always necessary. The PEZY-SC did not have support for 16-bit floating point operations. With the SC2, the processing elements were enhanced by adding support for 16-bit half precision floating point arithmetic in an attempt to increase adaptability in the field of deep learning.
Announced last week, the ZettaScaler-2.2 (Gyouyou) features 7,056 PEZY-SC2 chips operating at a lower frequency of 700 MHz along with 45 W TDP 16-core Intel Xeon D host processors. Earlier this year, PEZY reported the die size to be roughly 620 mm², meaning yield problems would be a definite concern. Although those chips integrate 2,048 cores, only 1,984 cores are active in order to improve yield. This puts the peak performance of the currently installed chips at 5.555 TFLOPS for single-precision, 33% less computational power than the theoretical maximum. The ZS-2.2 has a Linpack performance of 14.13 PFLOPS with a theoretical peak performance of 19.89 PFLOPS. The system consumed 962.3kW, putting its performance per watt at 14.69 GFLOPS/W, surpassing the 14.11 GFLOPS/W of the TSUBAME 3.0, placing them at rank 1 on the Green500 list. The next Top500 list is scheduled to be announced on November 12 at the 2017 SuperComputing Conference (SC17) which will be held in Denver, Colorado.
For every 8 PEZY-SC2 chips, there is a single 16-core Xeon D processor. With 7,056 SC2 chips, we’re looking at 882 Xeon D chips for a total of 14,013,216 cores. It’s worth noting that with just over fourteen million cores, the ZettaScaler-2.2 will have the highest core count of any supercomputer in the Top500, surpassing the Chinese supercomputer Sunway TaihuLight by over 3 million cores. Keep in mind that the performance submission deadline for the Top500 was November 1 at 23:59 Pacific Time, so it is entirely possible PEZY managed to further performance tune the system since the original October 26 announcement.
The high efficiency achievement can be attributed to a number of things including the move to the more energy efficient 16nm FinFet process as well as the novel use of liquid immersion cooling which reduces the chip temperature, consequently reducing leakage current. The ZettaScaler is an incredibly dense system. In the photo above there are 26 liquid immersion cooling tanks. However at this time, with 7,056 PEZY-SC2 chips, only 13.8 are filled, meaning the current system is operating at half of its capacity.
PEZY is not planning on stopping any time soon. Earlier this year the company laid out their future roadmap which extends into the 2020s.
|Die||412 mm²||620 mm²||700 mm²||740 mm²|
|Voltage||0.9 V||0.8 V||0.65 V||0.55 V|
|Clock||733 MHz||1 GHz||1.33 GHz||1.6 GHz|
|Wide-IO||N/A||4 x 1,024 bit||8 x 2,048 bit||8 x 4,096 bit|
|Peak Wide-IO||2.1 TB/s||12.2 TB/s||24.4 TB/s|
|Efficiency||6.7 GFLOPS/w||15 GFLOPS/w||40 GFLOPS/w||60 GFLOPS/w|
The PEZY-SC3 will be introduced with the ZettaScaler-3.0 supercomputer in late 2019. PEZY expects to the system to exceed 1 EFLOPS. With the help of ExaScaler, PEZY hopes to expand the system to about 100 cooling tanks which should give you an idea how just how many of those chips they intend on using. Even with existing PEZY-SC2 processors, that’s sufficient to support over 100 million cores. PEZY also hopes to widen their TCI-DRAM interfaces and double the number of ports in order to increase their memory bandwidth by a tenfold by the time the PEZY-SC4 is introduced. Both the SC3 and SC4 are expected to replace the standard PCIe controllers with silicon photonics (likely optical PCIe). In addition to those features, PEZY has been considering the use of multi-die chips in order to further increase the number of cores.
Whether PEZY will succeed with their highly aggressive roadmap remains to be seen. Nonetheless, this is one company really worth keeping an eye on!
- TSMC Details 5 nm
- IBM Doubles Its 14nm eDRAM Density, Adds Hundreds of Megabytes of Cache
- TSMC Announces 2x Reticle CoWoS For Next-Gen 5nm HPC Applications
- CEA-Leti Demos a 6-Chiplet 96-Core 3D-Stacked MIPS Processor
- Intel Refreshes 2nd Gen Xeon Scalable, Slashes Prices
- Radeon RX 5700: Navi and the RDNA Architecture
- 7nm Boosted Zen 2 Capabilities but Doubled the Challenges
- Arm Launches the Cortex-M55 and Its MicroNPU Companion, the Ethos-U55
- Inside Rosetta: The Engine Behind Cray’s Slingshot Exascale-Era Interconnect
- Arm Ethos is for Ubiquitous AI At the Edge