Today, Arm is launching its latest generation big core, the Cortex-A720, replacing the Cortex-A715 which was launched last year. Codenamed Hunter, the Cortex-A720 continues Arm yearly cadence of new big cores with improved performance an energy efficiency.
This article is part of a series of articles from Arm’s Client Tech Day:
- Arm Launches Next-Gen Efficiency Core; Cortex-A520
- Arm Introduces A New Big Core, The Cortex-A720
- Arm Introduces The Cortex-X4, Its Newest Flagship Performance Core
The Cortex-A720 is the company’s most versatile performance core, serving as the big core in a typical DSU configuration and often coupled with the Cortex-A500-series little cores. Compared to the A715, Arm says the new core offers as much as 20% improvement in power effiency.
Arm’s yearly cadence means chip architects have to limit the scope of improvements for any given generation to meat timelines. Some years, Arm focuses on larger pipeline improvements such as the decode, pipeline width and length while other years, the focus on more energy efficiency and area improvement changes. The Cortex-A720 now supports the Armv9.2 ISA and features 32/64 KiB L1s and 128-512 KiB private L2. Whereas last generation – the Cortex-A715 – saw various large pipeline changes such as larger decode thanks to Arm dropping 32-bit arch support, the new Cortex-A720 primarily focused on power efficiencies without making any major changes in terms of depths and widths.
Arm made various enhancements to the front-end of the A720. The first one is the reduction in mispredict penalty. Here Arm says it has shaved a whole cycle – down from 12 on the Cortex-A715 which was said to provide significant real-world application benefits. The second improvement deals with predictions. Arm spent quite a bit of time improving the Cortex-A7xx series branch predictions over the past few generations. Starting with the Cortex-A710, the big core was capable of predicting two unconditional branches per cycle. Last generation, designers enhanced that to fully support support conditional branches as well. Here, in the Cortex-A720, Arm improved the 2-taken branch prediction further. Arm says that structural optimizations were made in order to improve efficiency while also simultaneously substantially reducing the power without impacting performance.
On the back-end of the Cortex-A720, Arm says it piplined the FDIV/FSQRT unit. The end product is no meaningful area impact while providing substantial performance improvement. Data transfers between the FP/Vector unit to the integer units have also been optimized, reducing latency. Transfer speeds between the vector and the general-purpose register files were equally optimized. The Cortex-A720 also improved store data latencies with earlier availability.
On the memory size, the Cortex-A720 reduced the L2 cache hit latency to 9 cycles, down from 10 in the Cortex-A715. Prefetchers are a continuing improvement path for Arm. In the Cortex-A720, Arm said it has added a new L2 spatial-prefetch engine. Overall, Arm said it made generational accuracy/coverage improvements to existing prefetchers.
One interesting change that came with the new Cortex-A720 is a dual configuration. The Cortex-A720 comes in an area-optimize configuration as well as the full configuration. Under the area-optimized configuration, the Cortex-A720 offers no area cost compared to the Cortex-A78 while offering a ten percent improvement. Under the full configuration, the Cortex-A720 offers up to 20% improvement in energy efficiency over the Cortex-A715, albeit at higher area cost.