Last year Arm introduced the Cortex-A710, the company’s first ARMv9 implementation in a big core. As it has been a tradition over the past few years around the May/June time, today Arm is introducing their latest next-generation cortex-A710 successor – the Cortex-A715, formerly known as Makalu.
This article is part of a series of articles covering Arm’s Client Tech Day 2022.
- Arm Refreshes The Cortex-A510, Squeezes Higher Efficiency
- Arm Introduces The Cortex-A715
- Arm Unveils Next-Gen Flagship Core: Cortex-X3
Succeeding the Cortex-A710 as the newest big core, the A715 supports largely the same as ARMv9.0 ISA with several enhancements. Perhaps more critically, the new core offers exclusive support for only AArch64 – dropping 32-bit support altogether. The design principles for the A715 remain similar to the prior big core: improve performance at a higher ratio than affecting power and area. With this iteration, performance emphasis was placed on improving throughout without significantly widening the pipeline or extending its depth (although both took place). Finally, Arm engineers introduced targeted improvements – such as to the branch predictor and prefetching enhancements – that were inspired by earlier Cortex-X designs.
Compared to the Cortex-A710, the new A715 is said to deliver a 5% performance improvement at iso-power. Likewise, at the same performance levels as the A710, the A715 consumes 20% less power. Both comparisons are done at iso-process. Put it differently, Arm says that the new Cortex-A715 can deliver the same performance as the first-generation Cortex-X1 core. The X1 was Arm’s flagship performance core in 2020.
Overall, it’s clear that power reduction was more important in this generation – especially in sustained use cases. What’s a bit unusual in this core is that the performance improvement seems a bit underwhelming. It’s not unheard of for Arm to switch between a large performance uplift and a large power reduction (at a much lower performance uplift), but in this particular case, we were expecting a much bigger uplift given their 2020 Arm TechCon announcement (later reiterated at their Vision Day last year) which promised up to 30% single-core performance over the Cortex-A78. Compared to the A78, in terms of IPC, we’re somewhere around 15%. It’s unclear why the discrepancy is so big. Nonetheless, the DVFS curve shown below shows good power-efficiency gains across the entire performance spectrum.
Behind the scenes quite a bit changed in a single generation. The vast majority of changes took place in the front end of the core in the memory subsystem.
Arm spends a lot of time refining their prefetchers and branch predictors. It’s part of the reason they can maintain relatively small cache sizes. In this iteration, they doubled the Direction Predictor capacity along with improving its accuracy. In the prior generation, A710, the core was able to predict two unconditional branches per cycle. Now, in the A710, this capability was extended further to support conditional branches. In other words, whereas the A710 could one unconditional conditional and only one conditional branch taken, it can now do two.
The other improvement in the A715 is introducing a 3-stage prediction scheme for fast turnaround. Whereas previously, Arm had a fast L0 0-cycle prediction and a slower, 2-cycle prediction structure, with the A715, Arm broke it down into three stages with a new 1-cycle turnaround intermediate structure, reducing the latency to get predictions.
With the higher-capacity branch predictor producing higher branch request bandwidth, it’s possible to encounter more instances where two separate instruction streams are fetched. To accommodate this, the A715 now supports higher instruction cache lookup bandwidth up to twice the tags/cycles.
Pure 64-bit Enables Different Tradeoffs
The new Cortex-A715 is a pure AArch64 implementation and that means the design team can get rid of various architectural quirks and inefficiencies that came with the 32-bit arch. Arm says that due to the more normal nature of AArch64, the new decoders can not only be more efficiently designed and optimized, but they are also considerably smaller. In fact, Arm says the new decoders are actually “4x smaller than the ones found in the Cortex-A710 with power-saving to match” which is quite remarkable.
A lot of changes took place along with those new decoders. Firstly, Arm took the instruction fusion mechanism and moved it directly to the instruction cache. Previously, the A710 did it specifically at the MOP cache. This means, that now, all applications can take advantage of fused instructions at the fetch level (i.e. benefiting from the higher effective instruction throughput). Secondly, previously, some instructions could only be handled by specific decoders. Now all decoders can handle all operations.
Due to the smaller AArch64 decoder size, Arm added a 5th decode lane. In other words, the new A715 fetch/decode bandwidth now matches the A710 MOP bandwidth while the instruction cache gained the MOP fusion capabilities. By moving many of the benefits of the MOP cache along with its newly added decode lane, Arm says it was able to achieve similar performance without the MOP cache. For this reason it was removed. Removing the cache also offered some area and power gain, albeit in terms of performance, the fairly large design swap largely equal each other out.
On the memory subsystem side, the Cortex-A715 grew the load reply queue. This is the structure that holds the issued load access. Arm doubled the number of data cache banks. With more banks, there are now more read/write ports allowing for a higher number of data accesses concurrency. The last change in the A715 is that there are now 50% more L2 TLB entries and along with that Arm says that each entry can now store double the Virtual Addresses (VA) which means that under the right condition it’s possible to achieve up to 3x the effective TLB reach over the Cortex-A710.
Looking forward, Arm disclosed two new cores for the two years – Hunter and Chaberton. Software support for Neoverse Demeter and Cortex Hunter & Hayes started getting pushed out late last year.