It has been slightly over three years since Intel acquired Nervana Systems. Over the last half-year, the company lifted the lid on the architecture of both its inference accelerator, codename Spring Hill, and the training accelerator, codename Spring Crest. At the company’s annual AI Summit which was held late last month, Intel’s VP and GM of AI Products Group, Naveen Rao, announced that both products are now in production and shipping to customers. Despite dozens of AI startups, Intel joins an exclusive list of less than a handful of companies with dedicated DC ASIC AI chips that are actually shipping.
Intel has been pursuing AI-related featuring for a number of years now. In 2017 the company attributed around $1B of revenue to AI. This year, Intel says, the AI revenue will be around $3.5B. At the edge, the company recently announced its 3rd generation Movidius chip. With the launch of its Cascade Lake Xeons early this year, Intel augmented the cores with the AVX-512 VNNI extension for accelerating inference. The launch of its NNPs the latest thrust to compete in terms of real performance power efficiency against existing GPU-based products.
The architecture of both products has been covered by WikiChip extensively previously.
WikiChip Microarchitecture Articles:
The Intel Nervana NNP for training (NNP-T) currently comes in two configurations: PCIe card or an OAM card. The difference between the two is their cooling capacity and thus performance. The NNP-T 1300 is a PCIe card-based SKU which operates at 950 MHz while the NNP-T 1400 operates at 1.1 GHz. The chip is fabricated by TSMC on its 16-nanometer process. The full Spring Crest chip integrates 24 tensor processor clusters (TPCs), each with 2.5 MiB of local scratchpad memory. The NNP-T 1400 offers all 24 TPCs along with the full 60 MiB of scratchpad SRAM. The NNP-T 1300 is a slightly lower binned chip with 2 TPCs disabled. All models come with 32 GiB HBM2 (2.4 Gbps). Assuming all else equal, the NNP-T 1400 should be up to 26% faster.
|Intel NNP-T Models|
|Model||NNP-T 1300||NNP-T 1400|
|Form Factor||PCIe card||OAM Card|
|Interface||PCIe Gen4 x16||PCIe Gen4 x16|
|Frequency||950 MHz||1,100 MHz|
|TDP||300 W||375 W|
|SRAM||55 MiB||60 MiB|
|HBM||32 GiB||32 GiB|
Both models expose the 16 inter-chip links (ICLs) for scale-out capabilities. “With our NNP-T, we have 95% scaling capabilities on important workloads such as ResNet 50 which are relatively small but also for things such as BERT which are about natural language understating. We show very little degradation across 32 chips. This is measured performance,” said Naveen. The NNP-T 1300 is slightly more restricted. It only supports the ring topology. We spotted a couple of NNP-T systems by Supermicro at its booth at Supercomputing 2019. Supermicro first NNP-T system features two Cascade Lake CPUs along with eight PCIe cards and up to 6 TB of DDR4 memory. This is very similar to some of its other GPU-based chassis which are also a 4U design. It’s worth adding that since current Intel Xeons do not support PCIe Gen4, these accelerators fall back to Gen3, leaving half of the potential peak bandwidth on the table.
Just like the NNP-T 1300, NNP-T 1400 supports the ring topology, but it can also support an array of more complex topologies such as fully connected and hybrid cube mesh. The system Supermicro had on display at Supercomputing featured eight OAM cards connected in a hybrid cube mesh topology.
Supermicro says both systems are ready for production.
POD Reference Design
The ability to scale plays a big role with the Nervana NNP-Ts. “Scale-out is probably the most important problem in training,” Naveen said. Models are ballooning at a rate of roughly double every 3.5 months. In other words, the size of AI models is increasing at a rate of 10x per year – faster than any other technology. Introducing better architectures and improved chips alone are unlikely to be able to catch up with the rate of model complexity. Being able to scale-out hardware as improved hardware is released will likely be the only way to cope with this kind of complexity. To that end, the Naveen also announced the ten-rack NNP-T pod which was co-designed with Supermicro. The POD reference design features 10 racks with 6 nodes per rack (each with the eight OAM cards we showed above) for a total of 480 NNP-Ts per system.
The Intel Nervana NNP for inference (NNP-I) is a whole different beast with an entirely different microarchitecture. This chip is fabricated on the company’s 10-nanometer process and comes in two different configurations: M.2 card and a PCIe card. The difference is in the TDP. Whereas the NNP-I 1100 comes in an M.2 card and supports a max TDP of 12 W, the NNP-I 1300 comes in a PCIe card with a max TDP of 75 W. The PCIe card actually has two chips on board. Intel had previously said that Spring Hill was designed to span from 10 W to 50 W. Current models are somewhere between those two points, likey binned for better power-performance sweet spots. It’s worth adding that unlike the NNP-T, both models come with all 12 inference compute engines (ICEs) enabled along with two Sunny Cove cores.
In terms of peak performance per watt, the NNP-I 1100 delivers a power-performance efficiency of 4.17 TOPS/Watt whereas the dual-chip NNP-I 1300 achieves 2.27 TOPS/W, albeit at a much higher TDP which is fairly respectable when compared to existing GPU-based products that are on the market.
|Intel NNP-I Models|
|Model||NNP-I 1100||NNP-I 1300|
|Form Factor||M.2 card||PCIe card|
|Interface||PCIe Gen3 x8||PCIe Gen3 x8|
|Chips||1x NNP-I||2x NNP-I|
|TDP||12 W||75 W|
|Max Perf||50 TOPS||170 TOPS|
EDSFF Form Factor
One of the surprise announcements during the AI Summit was that Intel will be offering the NNP-I in an EDSFF (ruler) form factor. This product aims squarely at the highest compute density possible for inference. “This is nearly 4x of the Nvidia T4 system in terms of physical density. It’s designed for power efficiency; fitting in a power budget and scaling from 10 to 15 watts in each chip for network edge and cloud,” said Naveen.
Intel hasn’t announced specific models yet but the game here is density. The rulers will come with a 10-35W TDP range. 32 NNP-Is in a ruler form factor can be packed in a single 1U rack. Below is an early production chassis by Supermicro which was shown at Supercomputing 2019. Making some sort of a comparison is a little hard but Supermicro has a pretty powerful inference box based on the Nvidia T4s which packs a whopping 20 T4s in a 4U chassis. In a live demo, the 32 NNP-Is in a 1U was able to achieve 3.75x the performance (in terms of images/second on ResNet-50) of the 20 T4s in a 4U design.
Currently, there is no info on when we can expect the EDSFF NNP-I variant to be available. The other NNP-I and NNP-T models are shipping today. Supermicro says that its systems are ready for production deployment but general availability will come later in 2020.