ISSCC 2018: AMD’s Zeppelin; Multi-chip routing and packaging

Scalable Data Fabric (SDF)

The I/O Hub interfaces with the SDF through the I/O Master/Slave (IOMS) interface. Likewise, the two CCXs interface with the Cache-Coherent Masters (CCMs). The IOMS and the CCMs are the only interfaces that are capable of making DRAM requests. The DRAM is attached to the DDR4 interface which is attached to the Unified Memory Controller (UMC) which can communicate with the scalable data fabric.

WikiChip’s diagram of the SDF transport layer.

The Coherent AMD socKet Extender (CAKE) module translates the request and response formats used by the SDF transport layer to and from the serialized format used by the IF Inter-Socket SerDes and the IF On-Package SerDes.

Local Access

Under a local access, a core request would go through the CCX, CCM, through the fabric to the local UMC and to the local DRAM channel. The read data would then follow the same path in the reversed order back to the core. A round-trip on a system with a CPU frequency of 2.4 GHz and DDR4-2666 19-19-19 memory (i.e., MEMCLK of 1333 MHz) would be roughly 90 nanoseconds.

WikiChip’s diagram of the SDF transport layer with a local access path shown.

Non-Local Access

In the case of the EPYC and Ryzen Threadripper where more than a single Zeppelin is used, a memory access may have to be routed to neighboring Zeppelin. Regardless of the route, the path is always the same. A local core request is routed through the CCX, CCM, and to the CAKE module which encodes the request and sends it through the SerDes to a CAKE module on a remote die. The remote CAKE decodes the request and sends it to the appropriate UMC to the DRAM channel. The response is then routed back in the reverse order back to the request-originating core. A round-trip on a system with a CPU frequency of 2.4 GHz and DDR4-2666 19-19-19 memory (i.e., MEMCLK of 1333 MHz) across a different socket in a two-way multiprocessing configuration is roughly 200 nanoseconds while a round-trip across dies on the same package takes roughly 145 nanoseconds.

WikiChip’s diagram of data flow across multiple Zeppelin SoCs.

By the way, the longest distance consists of two hops – one to the adjacent socket and another one to get to next die. AMD did not report the round-trip latency for this scenario.

WikiChip’s diagram of the longest path possible across multiple Zeppelin SoCs.

The difference in latency boils down to the type of SerDes used for the access. It’s worth pointing out that AMD’s “Smart Prefetch” (marketed under AMD’s “SenseMI”) in the core complexes help to greatly mitigate the latency of requests to memory that is attached to remote dies.

I/O Subsystem

There are two x16 high-speed SerDes lanes located at the upper-left and lower-right corner of the dies. Both links are MUX’ed with the Infinity Fabric InterSocket controller (IFIS) and the PCIe controller. Additionally, the lower-right link is also MUX’ed with the SATA controller. When the Infinity Fabric is the selected protocol, the entire link (i.e., all 16 lanes) is used up for this purpose. When the PCIe protocol is selected, up to 8 PCIe ports of varying widths are possible. For the link where the SATA controller is also an option, up to 8 of the PCIe lanes can be used up for this purpose. Note that a mix configuration is possible – i.e., if a subset of the SATA ports are used, the remaining lanes can still be used for standard PCIe lanes.

WikiChip’s diagram of the possible SerDes MUXing and bifurcation options for the Zeppelin SoCs.

Though this wasn’t in the presentation, which predated AMD’s announcement of EPYC Embedded 3000 and Ryzen Embedded V1000 SoCs, the bottom links can also be MUX’ed with the Ethernet port. They can be configured as either up to 8 SATA lanes and up to up to 4 x 10GbE ports, or a mixed configuration with the PCIe.

AMD says that the muxing logic that was added to support these features added less than one channel clock of latency to the latency-sensitive infinity fabric path.

Spotted an error? Help us fix it! Simply select the problematic text and press Ctrl+Enter to notify us.

Leave a Reply

6 Comments on "ISSCC 2018: AMD’s Zeppelin; Multi-chip routing and packaging"

avatar
  Subscribe  
Notify of
Jerry
Guest

Why didn’t AMD use three IFOP SerDes to connect the two dies between Threadripper? is it because the other two SerDes are on the other side of the die and routing would be problematic because it would need to go under the die? Given those are single-ended I could see noise problems with that I guess.

Lu Ji
Guest

They use low-swing single-ended signaling so there is no way they could go under the noisy die without pumping up the voltage and sacrificing all the low power attributes they designed it for in the first place.

– Lu

Tri
Guest

Wait, but there is an IFOP right next to the DDR MC that could have been used, which does not have to be routed under the die. Illustration: https://imgur.com/lrXKCM9

Paul Dougherty
Guest

I think Intel might have higher yields on Skylake X than AMD would with a monolithic Epyc because they’ve been in the same process for ~4 years? šŸ˜‰

Luc Boulesteix
Guest

Well, Samsung/Gf 14nm lpp is a fairly mature process at this point. It does seem amd is getting yield issues on their larger dies, but to be fair, we don’t really know how intel is doing there either

Jeff
Guest

What happened to the home agent/coherence controller residing with DRAM, as mentioned in Kevin Lepak’s HC29 presentation? I am assuming that the CCMs serve only as the request initiators and that line state control is still handled inside the UMCs, but I’m curious why this detail would get buried or changed since August.