Today Arm is introducing a complete portfolio of IPs for the mobile market which includes a new little Armv9 CPU, a new big Armv9 CPU, a new flagship performance Armv9 CPU, new Mali GPUs, and even a new DSU. The last thing that’s needed to interconnect everything together is a coherent interconnect IP and a more comprehensive SoC transport interconnect. That’s where the new CoreLink CI-700 and the NI-700 come into play.
This article is part of a series of articles covering Arm’s Tech Day 2021.
- Arm Unveils Next-Gen Armv9 Big Core: Cortex-A710
- Arm Unveils Next-Gen Armv9 Little Core: Cortex-A510
- Arm Launches Its New Flagship Performance Armv9 Core: Cortex-X2
- Arm Launches The DSU-110 For New Armv9 CPU Clusters
- Arm Launches New Coherent And SoC Interconnects: CI-700 & NI-700
It has only been a week since we detailed Arm’s latest cache-coherent mesh interconnect for the enterprise market. The CMN-700 was designed primarily for things such as large core-count SoCs for the server market. Today’s launch of the CI-700 is a similar-purposed interconnect that better targets the client market.
The CoreLink CI-700 coherent interconnect is actually based on the recently-launched CMN-700 enterprise-grade mesh network. Unlike the CMN-700, the CI-700 is a custom variant especially tailored for the client devices and comes with additional efficiency optimizations specifically for the mobile consumer market. With that in mind, the CI-700 is a fully coherent interconnect supporting up to eight DSUs as well as up to 24 AMBA ACE-Lite or AXI managers accelerators or DMA devices, supports up to eight memory interfaces which can be either CHI or ACE-Lite and up to four ACE-Lite interfaces for peripherals.
The new CI-700 implements a system-level cache (SLC) with a snoop filter which helps reduce power and improve performance. The cache is exclusive to the DSU clusters, so their capacity is effectively added on top of the DSU capacity. It is also a true system-level cache, capable of caching any and all memory transactions from not just the CPUs, but also the GPU, and any other accelerator that might be interconnected as well as other high-bandwidth devices. The SLC has support for MPAM cache partitioning which is a feature that helps ensure predictability of performance by reserving certain cache capacities for certain devices or address spaces. For example, in order to prevent the GPU from consuming the entire cache for itself, MPAM can reserve a certain capacity for the CPUs, preventing a single device from starving out all other devices from system resources.
The new CI-700 is designed to run at around 1 GHz and up to 2 GHz in high-performance implementations.
As in the design of the CMN-700, the CI-700 is based around a crosspoint router. It has four ports connecting to other XPs and two ports for connecting IPs. With the new CI-700, Arm added two new additional types of XPs. One of the types is an XP that supports only two interfaces to other XPs but comes with four ports for connecting to other IPs. The other new type of XP is a singleton with six connections all going to other IPs. The two new types were specifically added to enable a higher ratio of IP blocks to connectivity compared to the original XPs mainly because mobile devices have fewer bandwidth requirements than servers (when they first developed the CMN-600 technology, the optimizations were around servers).
The CI-700 was designed to be scalable and configurable – capable of going from a single crosspoint to a large 4×2 mesh. It’s worth pointing out that Arm has an aggregator component that can connect two devices to a single XP so a single XP can be expended beyond just the immediate device ports it has. Alternatively, in the most extensive configuration, you can scale it up to a 2-dimensional mesh up to a maximum size of 4×3. For many applications, one doesn’t necessarily need a very large mesh and 4×3 is almost certainly overkill for most mobile devices. Arm’s own reference platform for premium smartphones for example uses a 2×2 mesh of XPs.
The system-level cache on the CI-700 can be configured from anywhere from 1 to 8 slices and up to 4 MiB per slice up to 32 MiB. The snoop filter is also configurable which helps improve the power consumption and performance and Arm recommends having a snoop filter that’s twice the size of the cache or up to 8 MiB per slice in order to cover twice the address space as the cache due to the way the snoop filter is constructed.
One of the ways the CI-700 can achieve a reduction in power consumption is by reducing the memory accesses. For the example below with a Mali-G710 without a system-level cache versus 8 MiB of SLC as commonly found in premium smartphones today. With the 8 MiB of SLC, the system exhibits around a 28% reduction in external memory bandwidth. Additionally, although the SLC does consume additional power, thanks to the reduction in external memory power, there is an 8% reduction in net system power which directly translates to better battery life.
Just like the way the new DSU-110 added support for Memory Tagging Extension (MTE) in order to improve its performance, the new CI-700 also significantly enhances the performance of MTE. When you use MTE, every transaction has a 4-bit tag. Memory can only be accessed with the same tag which is checked on each access. Within the CI-700 within the system cache, the tag is stored along with the data in the system cache and is also checked to see if it matches. When the data in the cache is written back to memory, the data and the tag is written as two separate memory accesses because they are stored in separate areas in memory. For that reason, the CI-700 has a configurable size tag cache which helps significantly reduce the amount of bandwidth from the tags and also coalesces them together to form a single memory access which considerably improves performance by reducing the bandwidth.
The NI-700 is a new flexible packetized network-on-chip interconnect for both high-bandwidth accelerators and the rest of the SoC connectivity such as other peripherals. It’s applicable to just about every market. It can be used with the CI-700, CMN-700, or on its own. The NI-700 consists of a network of routers (round dots) connected to interfaces (rectangles) with links that go between them.
On the NI-700, all the transactions from the AMBA CHI or AXI are converted to a packetized format and that helps reduce the wire count by 30% on average. This also helps reduce routing congestion which helps with the physical design. It supports both multiple clocks and power domains. it’s designed to be implementable on modern processes up to around 1 GHz fairly easily. And it also supports the AMBA standard along with the recent security and reliability features it offers.
Integrated Device Management (IDM)
Integrated Device Management (IDM) is a feature on the CI-700 that increases the robustness of the system thereby reducing the need for reboots of devices. It does this by increasing the uptime of the SoC by detecting devices in the system that are not responding and trying to recover from the issue. IDM will identify, log, and report devices that are causing a time-out by not responding. When that happens it will isolate the problematic device from the rest of the system. It will complete the stalled transaction to make sure protocols are not violated. And finally, it will inform software and allow it to take remedial actions such as reset the device, power it up (e.g., from some aggressive power management or so). Overall, Arm expects this feature to reduce the number of times a user is ultimately going to reboot a device (such as a set-top, WiFi, router, etc.) in order to fix an issue. IDM can also enable a new approach to system power management by not requiring all the code to always have full knowledge of the power system. Instead, the software can access the device as normal, and the IDM could automatically try to reboot the device and have it all transparently happening to the software albeit at the cost of some delay.