![]() ![]() ![]() Cache sizes, RAM, power consumption, compute cluster configurations - everything makes sense and everything is balanced. So far, Apple is the only vendor that makes really pragmatic hardware with true linear scaling. So it's not too crazy to assume that these results are independent of the AMX's abilities! However, in my experiments, I have found that Accelerate very nicely saturates the AMX resources and there is none left over for future sharing (as shown in the dgesv examples).įinally, as a last note on performance, we have found that using OpenBLAS to run numerical workloads directly on the Performance cores (and not using the AMX instructions at all) is competitive on larger linear algebra workloads. If this were not the case, one could argue that increasing the number of threads could allow multiple threads to more effectively share the underlying AMX coprocessor and we could get the kind of scaling seen in the article. That is consistent with the tests that I've run, and I'm assuming that the CFD solver he's running is well-written and making good use of the hardware (it does seem to be doing so from the shape of his graphs!). This means we should see parallelism step up in 8 steps, rather than 20 steps as was shown in the graph.Įverything I just said is true assuming that a single processor, running well-optimized code can completely saturate an AMX co-processor. And if we run four processes in parallel, we see that they take ~32 seconds to finish.Īll in all, the kind of linear scaling shown in the article doesn't map well to the limited number of AMX co-processors available in Apple hardware, as we would expect the M1 Max to contain maybe 8 co-processors at most. So there is some small speedup, but it's very small. If we run two `dgesv_accelerate` processes in parallel, we see that they take ~15 seconds to finish. While running, `htop` reports that the process is pegging two cores (analogous to the result in my original SO answer on the M1 pegging one core this supports the idea that the M1 Pro contains two AMX co-processors). Compiling it, we get `dgesv_accelerate` which uses Accelerate to solve a medium-size linear algebra problem, that typically takes ~8s to finish on my M1 Pro. This is supported by taking the code found in the gist linked from my SO answer and running it on my M1 Pro. One of the key takeaways is that appears to be one AMX coprocessor per "complex", leading us to hypothesize that the M1 Pro contains 2 AMX co-processors. I wrote an SO answer exploring some of the performance implications of this, and since I wrote that another more knowledgable poster has added more information. In my experience, on the M1 and the M1 Pro, there are a limited number of AMX co-processors that is independent of the number of cores within the M1 chip. This linear scaling per-core doesn't match my experience with using the AMX co-processor. The SoC DRAM bandwidth while seeking around was at around 40-50GB/s – I imagine that workloads that stress CPU, GPU, media engines all at the same time would be able to take advantage of the full system memory bandwidth, and allow the M1 Max to stretch its legs and differentiate itself more from the M1 Pro and other systems. Doing the same thing on my 5900X machine results in single-digit frames. The new media engine on the M1 Pro and Max are now able to decode and encode ProRes RAW formats, the above clip is a 5K 12bit sample with a bitrate of 1.59Gbps, and the M1 Max is not only able to play it back in real-time, it’s able to do it at multiple times the speed, with seamless immediate seeking. >That leaves everything else which is on the SoC, media engine, NPU, and just workloads that would simply stress all parts of the chip at the same time. While I’m sure there’s some productivity workload out there where the GPU is able to stretch its legs, we haven’t been able to identify them yet. >I haven’t seen the GPU use more than 90GB/s (measured via system performance counters). It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s. >Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve, as adding additional cores and threads beyond this point does not increase the bandwidth to DRAM at all. The numbers we have are from their M1 Max deep dive, with the M1 Ultra being two M1 Max chips fused together. I saw this quantified (I think at anandtech)Ĭorrect. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |