Pushing AMD's Infinity Fabric to Its Limit

(chipsandcheese.com)

227 points | by klelatti 2 days ago

33 comments

  • majke 2 days ago

    This has puzzled me for a while. The cited system has 2x89.6 GB/s bandwidth. But a single CCD can do at most 64GB/s of sequential reads. Are claims like "Apple Silicon having 400GB/s" meaningless? I understand a typical single logical CPU can't do more than 50-70GB/s, and it seems like a group of CPU's typically shares a mem controller which is similarly limited.

    To rephrase: is it possible to cause 100% mem bandwith utilization with only or 1 or 2 CPU's doing the work per CCD?

    • ryao a day ago

      On Zen 3, I am able to use nearly the full 51.2GB/sec from a single CPU core. I have not tried using two as I got so close to 51.2GB/sec that I had assumed that going higher was not possible. Off the top of my head, I got 49-50GB/sec, but I last measured a couple years ago.

      By the way, if the cores were able to load things at full speed, they would be able to use 640GB/sec each. That is 2 AVX-512 loads per cycle at 5GHz. Of course, they never are able to do this due to memory bottlenecks. Maybe Intel’s Xeon Max series with HBM can, but I would not be surprised to see an unadvertised internal bottleneck there too. That said, it is so expensive and rare that few people will ever run code on one.

      • buildbot a day ago

        People have studied the Xeon Max! Spoiler - yes, it's limited to ~23GB/s per core. It can't achieve anywhere close to the theoretical bandwidth of the HBM even, with all cores active. It's a pretty bad design in my opinion.

        https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi...

        • electricshampo1 a day ago

          It is integer factors better overall total BW than ddr5 spr; I think they went for minimal investment + time to market for the spr w/ hbm product rather than heavy investment to hit full bw utilization. Which may have made sense for intel overall given business context etc

    • KeplerBoy a day ago

      Aren't those 400 GB/s a figure which only apply when the GPU with its much wider interface is accessing the memory?

      • bobmcnamara a day ago

        That figure is at the memory controller.

        It applies as a maximum speed limit all the time, but it's unlikely that a CPU would cause the memory controller to reach it. Why it's important is that it causes increased latency whenever other bus controllers are competing for bandwidth, but I don't think Apple has documented their internal bus architecture or performance counters necessary to see how.

      • doctorpangloss a day ago

        Another POV is that maybe the max memory bandwidth figure is too vague to guide people optimizing libraries. It would be nice if Apple Silicon was as fast as "400GB/s" sounds. Grounded closer to reality, the parts are 65W.

        • KeplerBoy a day ago

          But those 65 Watts contain State of the Art Flops/Watt.

    • jmb99 a day ago

      > The cited system has 2x89.6 GB/s bandwidth.

      The following applies for certain only to the Zen4 system; I have no experience with Zen5.

      That is the theoretical max bandwidth of the DDR5 memory (/controller) running at 5600 MT/s (roughly: 5600MT/s ÷ 2MT/s × 32 bits/T = 89.6GB/s). There is also a bandwidth limitation between the memory controller (IO die) and the cores themselves (CCDs), along the Infinity Fabric. Infinity Fabric runs at a different clock speed than the cores, their cache(s), and the memory controller; by default, 2/3 of the memory controller. So, if the Memory controller's CLocK (MCLK) is 2800MHz (for 5600MT/s), the FCLK (infinity Fabrick CLocK) will run at 1866.66MHz. With 32 bytes per clock read bandwidth, you get 59.7GB/s maximum sequential memory read bandwidth per CCD<->IOD interconnect.

      Many systems (read: motherboard manufacturers) will overclock the FCLK when applying automatic overclocking (such as when selecting XMP/EXPO profiles, and I believe some EXPO profiles include overclocking the FCLK as well. (Note that 5600MT/s RAM is overclocked; the fastest officially supported Zen4 memory speed is 5200MT/s, and most memory kits are 3600MT/s or less until overclocked with their built-in profiles.) In my experience, Zen4 will happily accept FCLK up to 2000MHz, while Zen4 Threadripper (7000 series) seems happy up to 2200MHz. This particular system has the FCLK overclocked to 2000MHz, which will hurt latency[0] (due to not being 2/3 of MCLK) but increase bandwidth. 2000MHz × 32 bytes/cycle = 64GB/s read bandwidth, as quoted in the article.

      First: these are theoretical maximums. Even the most "perfect" benchmark won't hit these, and if they do, there are other variables at play not being taken into account (likely lower level caches). You will never, ever see theoretical maximum memory bandwidth in any real application.

      Second: no, it is not possible to see maximum memory bandwidth on Zen4 from only one CCD, assuming you have sufficiently fast DDR5 that the FCLK cannot be equal to the MCLK. This is an architecture limitation, although rarely hit in practice for most of the target market. A dual-CCD chip has sufficient memory bandwidth to saturate the memory before the Infinity Fabric (but as alluded to in the article, unless tuned incredibly well, you'll likely run into contention issues and either hit a latency or bandwidth wall in real applications). My quad-CCD Threadripper can achieve nearly 300GB/s, due to having 8 (technically 16) DDR5 channels operating at 5800MT/s and FCLK at 2200MHz; I would need an octo-CCD chip to achieve maximum memory bandwidth utilization.

      Third: no, claims like "Apple Silicon having 400GB/s) are not meaningless. Those numbers are achieved the exact same way as above, and the same way Nvidia determines their maximum memory bandwidth on their GPUs. Platform differences (especially CPU vs GPU, but even CPU vs CPU since Apple, AMD, and Intel all have very different topologies) make the numbers incomparable to each other directly. As an example, Apple Silicon can probably achieve higher per-core memory bandwidth than Zen4 (or 5), but also shares bandwidth with the GPU; this may not be great for gaming applications, for instance, where memory bandwidth requirements will be high for both the CPU and GPU, but may be fine for ML inference since the CPU sits mostly idle while the GPU does most of the work.

      [0] I'm surprised the author didn't mention this. I can only assume they didn't know this, and haven't tested over frequencies or read much on the overclocking forums about Zen4. Which is fair enough, it's a very complicated topic with a lot of hidden nuances.

      • bpye a day ago

        > Note that 5600MT/s RAM is overclocked; the fastest officially supported Zen4 memory speed is 5200MT/s

        This specifically did change in Zen 5, the max supported is now 5600MT/s

    • neonsunset a day ago

      Easily, the memory subsystem on AMDs consumer parts is embarrassingly weak (on all desktop and portable consumer devices in general save for Apple ones and select bespoke designs).

    • jeffbee a day ago

      There are large differences in load/store performance across implementations. On Apple Silicon for example the M1 Max a single core can stream about 100GB/s all by itself. This is a significant advantage over competing designs that are built to hit that kind of memory bandwidth only with all-cores workloads. For example five generations of Intel Xeon processors, from Sandybridge through Skylake, were built to achieve about 20GB/s streams from a single core. That is one reason why the M1 was so exceptional at the time it was released. The 1T memory performance is much better than what you get from everyone else.

      As far as claims of the M1 Max having > 400GB/s of memory bandwidth, this isn't achievable from CPUs alone. You need all CPUs and GPUs running full tilt to hit that limit. In practice you can hit maybe 250GB/s from CPUs if you bring them all to bear, including the efficiency cores. This is still extremely good performance.

      • majke a day ago

        I don't think single M1 cpu can do 100GB/s. This source says 68GB/s peak: https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

        • wizzard0 a day ago

          btw what's about as important is that in practice you don't need to write super clever code to do that, these 68GB/s are easy to reach with textbook code without any cleverness

          • zamadatix a day ago

            68 Gbps of memory read/write can be easily reached (assuming the memory bandwidth is there to reach it with) on any current architecture by running a basic loop adding 64 bit scalars. What could be even less clever than that?

            • namibj 12 hours ago

              Needs to be more than one accumulator.

              • zamadatix 3 hours ago

                I mean:

                  const uint64_t size = // Some large value
                  uint64_t a[size] = // Some random values
                  uint64_t b[size] = // Some random values
                  uint64_t c[size] = {0};
                
                  uint64_t i = 0;
                  while(i < size) {
                    c[i] = a[i] + b[i];
                  }
                
                  // Disable all optimizations so the above isn't optimized away/vectorized
                
                That's the world's simplest loop with 16 bytes of memory read per loop so even if your core is a piece of crap that averages a single increment and addition per cycle it just needs to run at ~4.3 GHz to still pass the bar anyways. Running this code on my MacBook and my x86 desktop with compiler optimizations off I'm not seeing either fail to reach 64 GB/s.
        • jeffbee a day ago

          That's the plain M1. The Max can do a bit more. Same site since you favor it: https://www.anandtech.com/show/17024/apple-m1-max-performanc...

          • majke a day ago

            > From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s. This is extremely impressive and outperforms any other design in the industry by multiple factors, we had already noted that the M1 chip was able to fully saturate its memory bandwidth with a single core and that the bottleneck had been on the DRAM itself. On the M1 Max, it seems that we’re hitting the limit of what a core can do – or more precisely, a limit to what the CPU cluster can do.

            Wow

  • Agingcoder 2 days ago

    Proper thread placement and numa handling does have a massive impact on modern amd cpus - significantly more so than on Xeon systems. This might be anecdotal, but I’ve seen performance improve by 50% on some real world workloads.

    • bob1029 a day ago

      NUMA feels like a really big deal on AMD now.

      I recently refactored an evolutionary algorithm from Parallel.ForEach over one gigantic population to an isolated population+simulation per thread. The difference is so dramatic (100x+) that loss of large scale population dynamics seems to be more than offset by the # of iterations you can achieve per unit time.

      Communicating information between threads of execution should be assumed to be growing more expensive (in terms of latency) as we head further in this direction. More threads is usually not the answer for most applications. Instead, we need to back up and review just how fast one thread can be when the dependent data is in the right place at the right time.

      • Agingcoder a day ago

        Yes - I almost view the server as a small cluster in a box, and an internal network with the associated performance impact when you start going out of box

      • bobmcnamara a day ago

        Is cross thread latency more expensive in time, or more expensive relative to things like local core throughput?

        • bob1029 a day ago

          Time and throughput are inseparable quantities. I would interpret "local core throughput" as being the subclass of timing concerns wherein everything happens in a smaller physical space.

          I think a different way to restate the question would be: What are the categories of problems for which the time it takes to communicate cross-thread more than compensates for the loss of cache locality? How often does it make sense to run each thread ~100x slower so that we can leverage some aggregate state?

          The only headline use cases I can come up with for using more than <modest #> of threads is hosting VMs in the cloud and running simulations/rendering in an embarrassingly parallel manner. I don't think gaming benefits much beyond a certain point - humans have their own timing issues. Hosting a web app and ferrying the user's state between 10 different physical cores under an async call stack is likely not the most ideal use of the computational resources, and this scenario will further worsen as inter-thread latency increases.

    • hobs a day ago

      When I was caring more about hardware configuration on databases in big virtual machine hosts not configuring NUMA was an absolute performance killer, more than 50% performance on almost any hardware because as soon as you left the socket the interconnect suuuuucked.

  • cebert 2 days ago

    George’s detailed analysis always impresses me. I’m amazed with his attention to detail.

    • geerlingguy 2 days ago

      It's like Anandtech of old, though the articles usually lag product launches a little further. Probably due to lack of resources (in comparison to Anandtech at its height).

      I feel like I've learned a bit after every deep dive.

      • ip26 a day ago

        He goes far deeper than I remember Anandtech going.

    • IanCutress a day ago

      Just to highlight, this one's Chester :)

  • AbuAssar 2 days ago

    Great deep dive into AMD's Infinity Fabric! The balance between bandwidth, latency, and clock speeds shows both clever engineering and limits under pressure. Makes me wonder how these trade-offs will evolve in future designs. Thoughts?

    • Cumpiler69 2 days ago

      IMHO these internal and external high speed interconnects will be more and more important in the future, as More's law is dying, GHz aren't increasing, and newer FAB nodes are becoming monstrously expensive, so connecting cheaper made dies together is the only way to scale compute performance for consumer applications where cost matters. Apple did the same on the high end M chips.

      The only challenge is SW also needs to be rewritten to use these new architectures efficiently otherwise we see performance decreases instand of increases.

      • sylware a day ago

        You would need fine-grained hardware configuration from the software based on that very software semantics and task. If ever possible in a shared hardware environment.

        Video game consoles with shared GPU(for 3D) and CPU had to chose: favor the GPU with high bandwidth and high latency, or the CPU with low lantency with lower bandwidth. Since a video game console is mostly GPU, they went for the GDDR, namely high bandwidth with high latency.

        On linux, you have the alsa-lib which does handle sharing the audio device among the various applications. They had to choose a reasonable default hardware configuration for all: it is currently stereo 48kHz, and it is moving to the 'maximum number of channels' at a maximum of 48kHz with left and right channels.

  • lincpa 2 days ago

    [dead]