Quite right I dont see an issue here for AMD going forward, they can keep tweaking the number of ccxs per chiplet while still maintaining a ring architecture per ccx.
I still think 12-Core CCD is the realistic future.
The math to make a 12-Core CCD is easier and more flexible.
Dual 8-Core CCX to make a CCD doesn't seem practical from a CCD perspective.
Especially if you want flexible packaing to make a CCD. 12-Core CCX allows: 3x 4-Core CCX's 2x 6-Core CCX's a 8-Core CCX + 4-Core CCX 1x 12-Core CCX's.
This gives you ALOT of flexibility on the consumer end when designing Ryzen while upping the Core count on a CCD to a reasonable # while maintaining reasonable physical size / complexity.
Moving from the 8-core unified CCX of Zen 3 to 3x 4-core or 2x 6-core CCXs could mean a performance regression for games and emulators that use 8 cores, because of the added latency. I only like a 12-core unified CCX, and Cutress seems to think that's not happening. So one or two 8-cores it is. If the CCXs can have differing core counts, that might be ideal so you can have 8+4.
The Strix Point APU was rumored to have 8+4 big/small cores. If AMD goes big/small and brings it to desktop, all bets could be off.
We know that Zen 4 desktop and Epyc (Raphael, Genoa/Bergamo) will use 8-core chiplets again on 5nm. I don't think we know anything concrete after that. Just a good assumption that core counts will go up for Zen 5 desktop if they stay at 16 cores on Raphael. It's entirely possible that AMD will also pursue small cores on desktop, seeing as Alder Lake is doing that in 2021.
Alder Lake's small cores will also boost performance per die area, assuming it works right.
Symmetry is still pretty important in the Zen architecture. AMD would most likely use a dual CCX consisting of 8-cores each in a CCD. AMD can then just disable 2 cores in each to have a 12-core CCD consisting of 2x 6-core CCXes. Mainstream AM5 could remain at 16-cores max, but would drop a chiplet, which reduces cost/complexity of package and leaves more chiplets available for EPYC.
3D V-cache could link both CCXes together to access a larger, unified L3/L4 cache (L3 would probably still be contained within each CCX, but V-cache could have ring interconnect links built-in) without hitting IO die and UMCs (until dataset is too large and/or TLB starts thrashing). This would stem the latency penalty of going back to a dual CCX design per chiplet, at least versus older Zen 2 hard CCX partitions. Of course, adding another cache hierarchy reduces bandwidth and increases latency, but it's still faster than having to hit IF PHYs to access IO die, hit UMCs for data, then return to CCX. Depends on how it's implemented I guess, but current V-cache is a direct connection to existing unified L3 in an 8-core CCD, so it acts just like an expanded L3. It's definitely more difficult to connect 2 separate L3s, but I'm pretty sure it can be done as a "virtual" unified L4 or something.
Zen 4 will use 8-core chiplets again, with Ryzen maxing out at 16 cores or maybe 24 cores with 3 chiplets. It seems unlikely that Zen 5 Ryzen would max out at 16 cores after Mark Papermaster hinted at future core count increases, and given renewed competition from Intel. If AMD uses a 16-core CCD/chiplet, maybe the platform will go straight to 32 cores using 2 chiplets.
I'm wondering, the chiplet approach would also allow for some sort of hybrid cpu with performance and efficiency cores so maybe we'll see something completely different.
If you look at the images for the Alder Lake desktop, mobile, and ultra mobile designs, it groups 4 small cores next to 4, 3, or 1 big core, and then it has 2 of those groups for 8+8, 6+8, and 2+8 total. Having them grouped together lets a process get switched between them with low latency. An AMD big/small implementation for desktop could have both types of cores on the same chiplet. A wrinkle is that it might not be optimal for server CPUs. We also don't know if or when AMD might bring big/small to desktop.
AMD could use more advanced packaging technologies that don't have the limitations that the current chiplets have, and that could change the design as well.
hahaha - its actually funny. We've had how many Intel marketing articles in the last year, and the ONE we get about AMD is about how their current architecture is not going to support adding more cores in the future. i.e. a negative spin. Unfortunately for Ian and his insightful analysis - this was covered about 4 years ago by Jim@Adored.
Roughly around the same time, Ian published his own conclusion that the core wars would take a back seat to interconnect. It became quickly evident with MCM designs that interconnect would use up more and more power budget. Everyone could see this, it was time for readers to look at modern chips more holistically and not just as cores and clock speed.
"Rest assured, once AMD and Intel have finished fighting over cores, the next target on their list will be this interconnect."
So, you missed all the Zen microarchitecture deep dives, announcements, EPYC announcements, reviews, analysis, and Lisa Su/Mark Papermaster/Forrest Norrod interviews I've done over the previous years then?
Also negative spin? What? If anything here I'm highlighting potential directions to go. Would you prefer if I was facetious and praised the ring bus and exclaimed that it should be extended to 16 cores?
Ian, you should just ignore fanboyism in the comment section. I find your analysis to be interesting. I am sure AMD and Intel have their own strategy towards core counts.
Not to mention that with AI for creating chips, we might end up with designs that are totally impossible to be designed by men... and this is coming in the next couple of years.
On my side, I believe AMD is probably working on a 12 cores chiplet for Zen 4. By looking at the last 4 years, AMD just keep bringing innovations. Like Lisa said, it is an incremental evolution toward their ultimate vision.
No i didn't, i read them all. My point was and is Anandtech will literally print any marketing presentation Intel throws out there. The purpose of these marketing articles were to keep mindshare on Intel, despite them having very little to compete with AMD. The negative spin aspect has been ongoing for some time, and has also been noticed by others. I am not a 'Fanboy' but i am certainly a consumer who has not been impressed with Intel over the years. (Compiler shenanigans, high pricing, fusing cpu features, 4 cores for a decade, benchmarking, paying the large OEM's to not use AMD cpu's etc etc). Maybe i'm wrong in this instance, but on first read (plus headline) i got the sense that mind share was being directed against AMD. Especially given that it is clear AMD have been experimenting with connecting their CPU's through the interposer on a larger (65nm) node for some time.
That's just your own predesposed bias showing, unfortunately. This topic is literally talking about the directions AMD have been researching when they will go beyond ring's, now that we know Zen 3 is some sort of ring/bisected ring. It's literally describing the innovation and R&D that AMD is putting into its topology. If you're not getting that from this article, then I've got some bad news about your personal biases.
Note that I'm the one that writes the AMD/Intel CPU topics here and it's _always_ through a critical lens. I rake each one over the coals for all the details, and I always put in the difference between proven features and claims vs speculative claims. I actively avoid putting down specific claimed benchmark data if at all possible. Perhaps you have AnandTech confused with somewhere else.
Well, respectfully there always seems to be an Intel bias here and as I said previously I’m not the only one who has noticed this. I admit maybe I do have an AMD bias but that is a conscious decision based on past Intel behaviour.
could it be just the timing of something ? like the flood intel related content is because intel just had some strong of announcement and a convention or show , ie hot chips or some other trade show ? thats when this " intel bias " you refer to comment posts seem to happen.
It's just rabid fanboys. Anand recently redid a test on epyc processors because they had weird results where the IOD was eating up way too much power at idle, and their new test showed even larger performance gains over intel and solidifies AMD's dominance in performance. Of course fanboys will just ignore that and claim the site is intel biased because they interview people from intel and give them a fair review instead of just trashing them at any opportunity.
Whether consciously chosen or not, your bias is significantly skewing your reading of this article. (Not that it matters, but I'm quite partial to AMD myself, for many of the same reasons you state.) It is in no way critical or negative, it is exploratory and factual. The questions you are reacting to - such as the title - do not show any evidence of bias, but seem like earnestly curious questions about the way forward. There has to be one, after all, and if they are currently using a ring bus, that has specific implications. This is not critical or negative. It is simply a statement of fact, and asking the subsequent question of "how will they move forward from this?" The article also details AMD's commitment to advanced packaging and chip production methods, and how these have the potential for never seen before methods of overcoming these challenges. If that reads as negative to you, you are inferring a tone that isn't present in the article.
Ian, i am reading most of your articles, you are doing a super job, no question about it, however i think that you should spend some time clarifying this issue.
i came to the think that you are rather a fan of Intel when you said that you have personally asked for intel to change the name of their chips, why in the world would a write ask for intel to change the marketing? not to mention that the "intel 7" name is perceived by many as a cheap marketing tactic, but lets not speak about the accuracy or validation of that name, my question is: how is that part of your conversation with Intel?
i am not going to say that you are against AMD or TSMC, but have you ever suggested anything to AMD on improving their marketing?
Stupid objection. Informed people should make the effort to tell corporations more about the what products should be made rather than being passive.
My good friend recently sent a message to Honda demanding that it have more respect for its design team in terms of aesthetics and less focus on copying the industry trend of immature angry-looking vehicles (for the plebs — rich people can buy pleasant-looking designs).
The difference between Cutress making a suggestion to Intel and my friend making a suggestion to any major corporation is that he can bypass the corporate communications firewall — the thing designed to keep the great unwashed out of the picture (except for their wallets, of course).
I'm not privy to Ian's thinking, but the answer to this seems pretty obvious: he has contacts, and Intel's naming is terrible. Also, when he talks about asking them to change the naming of their chips he's talking about product names, not node names, i.e. Core i9-1123456789hqx number salad naming, which is confusing garbage stupidity on so many levels.
Though to be fair, while changing the naming of their node from "10nm" to "7" does come off as weird, it only really highlights that node names (at all fabs!) are nonsense anyway, and not really representative of any feature size in the node. And given that Intel's latest node is pretty similar in its features to other fabs' "7nm" nodes, naming them similarly is less confusing overall (you no longer need the "10 can be as good as 7 depending on where it's coming from" arguments) and thus benefits everyone.
In other words: both asking them to change their product naming system and embracing the node renaming is a positive attitude towards clearer, less confusing naming practices. And that's a good thing, regardless of your views of Intel. I haven't bought an Intel product since my Core2Quad back in 2008 and don't have any plans to do so in the near future, but I still appreciate reductions in stupidity on their part.
Because what Intel has in the pipeline and has released info on is more interesting (it's a much larger change from their current offerings) and is much closer to release ( AMD's not saying much about Zen 4 or RDNA3, and the only thing new coming any time soon are the chips with extra L3 cache ) ? If they were reporting misleading numbers in actual reviews, that would be a different story.
More interesting in this case is also more complex and with larger changes with means higher risk. Given the issues Windows has had with core priority and properly scheduling it will be interesting to see how they handle this.
Intel has a pretty good track record cooperating with Microsoft (and other vendors) to optimise for architectures before release - e.g. SSE vs 3DNow. AMD generally take the release-first-optimise-later approach ('fine wine').
Where does it say their current architecture is not going to support adding more cores? The article talks about ring scalability, and gives an example of Intel scaling the ring to 12 cores. Intel has also been using 10 core rings on Comet Lake.
Sure, we don't see 16 cores on a single ring. But you're probably looking at a much bigger die at that point, and the point of chiplets is to avoid large dies.
Never mind, I see that you were just shifting between talking about core-to-core and socket-to-socket topologies without really explaining that. Or stating which level of the Sapphire Rapids interconnect hierarchy you were referring to (core-to-core, tile-to-tile, or socket-to-socket).
Looking at the slide you posted doesn't make me think ring bus. It makes me think the L3 cache is the the interconnect. Maybe I misunderstood something about the architecture, but all cores can write to any part of the L3 right? so they could also read any part of the L3? Why would you need any other way to connect the cores if you can access any part of the L3 at comparable latency? Is the L3 slower than the core to core? Is it lower bandwidth?
Beyond the technical issues of how you do it in chips I think there is another big question. Does AMD finally move to multiple CCX design across it's product stack or does it stick with one CCX design. So far AMD has only had a single CCX design across all it's product stack so everything from laptop APUs with integrated graphics up to their biggest Epyc server cpus have used the same base CCX. When it was 4 cores their lowest end consumer CPUs and APUs could have a single 4 core CCX. With Zen 3 going to an 8 core CCX those were pushed up to an 8 core CCX. Going to a 16 core CCX for future Epyc chips might be nice on that end but making the minimum CCX size 16 cores on the consumer end might be a bit much. Given that I think the CCX size will stay at 8 for for a while and they might either go with more chiplets or a larger chiplet with 2 8 core CCXs.
APU's don't use chiplets. They are monolithic to save on power.
My money would be on AMD continuing to use small chiplets and relying on more packaging tricks to increase the number of cores per processor. This will let them continue the extremely effective strategy of using the same die across desktop, workstation, and server products.
They don't use chiplets but so far they have always used the same base CCX design that other chips have used. Zen 1/2 had a 4 core CCX and the APUs all had a single CCX with up to 4 cores and graphics, memory, IO etc. With Zen 3 the CCX size changed to 8 cores and the APUs moved to that same 8 core CCX and we got 8 core APUs in our laptops. Going 12 or 16 core CCXs for Epyc makes sense but an 12 or 16 core CCX on a laptop APU seems like it might be a bit of overkill. Certainly they could move to having two different CCX designs but that seems like a fair bit of extra work that they have tried to avoid so far. For Ryzen chips AMD has done a lot to reuse things as much as they can from top to bottom. Using the same Chiplet across the board. Using the same CCX in both Chiplets APUs. I'd bet they probably also reused memory controller design between APU and chiplet/IO die etc. Being the smaller company and lower volume of CPUs means they benefit more from these things where as Intel can afford 3 different designs just for their server CPUs because they have the volume and size to support that.
That seems like overkill. I can see them having a 4-core CCX design for ultra-mobile / low power designs, and maybe a 12-core CCX for server, eventually moving down to desktop level.
Small correction - Zen 2 APUs had up to 8 cores, with a dual-CCX design. I have an 8-core notebook based around the 4800H.
I do agree that adding more cores to the CCX would create something of a problem in terms of designs, but it's likely they're going to respond to that by increasing the number of parallel design teams. After all, the Strix Point APU will reportedly be using small cores, and there's not yet any sign of that in the chiplets for desktop / server.
I feel they're going to continue with an 8-core CCX for a while, though of course two CCX variants is not impossible either. Depends on how of much work they're willing to do.
Nice to see a technical article. Intel showed that 10 cores (plus iGPU and IO) on a ring can work in client CPUs that are sensitive to latency. I think AMD staying at 8 cores per chiplet is driven by die size issues, rather than interconnect.
The 8-core chiplets are relatively small, and AMD/TSMC has had great yields to the point where they are disabling good cores to make a 6-core and products like the Ryzen 3 3300X don't make sense. TSMC 5nm is apparently better than 7nm was at the same point in its development.
Stuffing 16 cores into the chiplet could actually help with binning. 8 cores with low latency can be the "unit" for the forseeable future, so dual-CCX with 8 cores each is fine.
A layer of Interposer will be for simple 2D mesh connectivity between CCD's Another layer of Interposer will be for direct connection to the Central I/O die A final layer of Interposer will be for a direct connection to L4 Cache
Upper Limit of design based on leaked CPU PCB Substrate size is shown above
NOTE: - Eventually each core will get to SMT4, then SMT8 at some point with a future iterations of Zen - 12-core CCD's will eventually be made by either (3x 4-Core CCX's, 2x 6-Core CCX's, 8-Core+4-Core CCX's, or 1x 12-Core CCX) - EPYC will go from 96 Cores with 12 CCD's in the rumored Genoa to:
+ My Hypothetical / Speculative configurations based on future 12-Core CCD's: _- 12-Core CCD x 12 CCD's per EPYC CPU = 144 Core _- 12-Core CCD x 18 CCD's per EPYC CPU = 216 Core _- 12-Core CCD x 24 CCD's per EPYC CPU = 288 Core
A rather boring read. If you want to see how inter connecting can be tricky and creative ways to do it take a look at larger FPGA's. I really think introducing another level or levels of victim cache will do very little on long term other than rising the price. Larger L2 even if shared between two core's makes much more sense. Async multi lv bi directional ring bus is way to go but towards RAM which needs to shift towards HBM or HMC. Somehow the experience with DSP's regarding optimal word length and ability to efficiently feed it should be a guide line regarding the ring bus for all processing blocks (SIMD's, general purpose, DSP, GPU...).
So, how three-dimensional do those butter donuts get? Also, how much overhead in both silicon and power usage does managing those many cores in many chiplets add? Since we're using food analogies, where do you think the butter zone for cores per chiplet and chiplet per CPU is, at least currently? Lastly, here a wild thought: Expand something like Alder Lake's Thread Director to actively manage a 3D meshed chiplet-based CPU, especially if it uses big and little cores. When there still was a SUN, their slogan was "the network is the computer"; looks like that's now very much true also for the CPU.
Rocket Lake actually has 12 ring stops: the GPU has two stops and the PCIe IO complex has one on its own. (Also why use an ancient AGP graphics card in the graphic?) The largest ring complex Intel has built was for Comet Lake at 14 stops (10 cores, 2 GPU, 1 IO, 1 memory). The speculative theory is that Intel is limited to a total of 16 stops on the ring. As mentioned, legacy Xeons had begun to add separate ring systems to scale upward in core counts but those two never exceeded 14 hops around. The oddity was Westmere-EX which wasn't fully bi-directional and had some really weird latency characteristics.
I have a quibble about crossbars: the reason why they're generally avoided is that they become incredibly complex as the number of nodes on them increases. Pretty much every generation of CCX would require its own custom sized crossbar as core count increases. It is common practice to over a design a cross bar and then let it fill up as product lines evolve/die shrinks are possible. I do think we'll see AMD using a crossbar but on the IO die where the number of nodes will not radically change over time.
The interposer can indeed hold a complex topology between dies. However, I would argue that it'd wise to go with a more active interposer using a crossbar instead of a more exotic mesh-like topology. The crossbar would produce heat but it'd be relatively small. Such a cross bar interposed would be a per-product which each main passive interposer already is a per-product design to begin with.
One thing glossed over with AMD's current topologies is how many links between the IO die and the CCD exist? In Rome there were two links between the IO die and each CCD as to match the number of CCX inside the CCD. Presumably with one CCX per CCD in Milan, there would be potential to double the number of CCD/CCX but at the expense of per CCX bandwidth. (This is where the 3D V-Cache comes into play to help counter the bandwidth issue.)
AMD does have another method of increasing the number of cores in a system: increase socket count. It has puzzled my why Rome/Milan do not support a three socket system. Yes it is a bit odd but the bandwidth and links are there to do a solid point-to-point topology. Adding more sockets to a system is also viable for Intel in these large complex servers as they are expected to migrate to optical interconnects using on-package silicon photonics at some point (Intel has demoed proof of concept of this in the past.)
Speaking of the Intel side of things, it boggles my mind that they have been stuck on Sky Lake-SP/Cascade Lake-SP for so long in servers that they simply wouldn't have shattered the design into CPU chiplets + IO chiplets and link them together via EMIB/interposer. Sapphire Rapids isn't even following this as it includes some IO functionality on each die more akin to AMD's Naples design. I fathom we'll see this post Sapphire Rapids but Intel has certainly taken there time and transitioned to a defense posture.
Intel's recent patent app 20210263880, " DISAGGREGATED DIE WITH INPUT/OUTPUT (I/O) TILES ", surrounds small processor tiles with narrow IO tiles, connected by emib, in a pattern that can scale to a larger extent than the single IO tile.
I think you are comparing Intel's monolithic traditional designs with AMD's state of the art chiplet designs. You seem to be comparing a single AMD CCX to an entire Intel CPU, and trying really, really hard to see this as some sort of negative for AMD. But the fact is, of course, that AMD is not limited to a single CCX in a given CPU--which seems to me the whole point of a chiplet approach being a net positive over a monolithic design. You are also assuming some sort of equity between Intel's tiles and AMD's chiplets, an equity that has yet to be demonstrated. We know all about chiplets as AMD has been shipping them for years--we know nothing about Alder Lake apart from marketing chatter thus far--not even a shipping date. It's almost like Intel is whispering in your ear "Here is where AMD is limited" which you aren't questioning too closely...;) It remains to be seen what Intel will come up with to attempt to compete, and of course AMD is looking at Zen4 & Zen5, which further muddles any competitive picture from Intel at present.
You know, of course, that writing articles titled with question marks is bad form grammatically and journalistically. The reason it's disdained by real journalists is because it usually winds up with the article failing to answer the question raised in the title--as is true here. If you can answer the question you wish to pose, then a declarative title, such as, "An AMD chiplet has a core-count limit" or "An AMD chiplet has no core-count limit" would be much better. And of course the article would go ahead and present the limit--or present evidence that the core count need not be limited. You are not alone in writing articles with question marks these days, especially when the question posed in the title isn't answered. (So, I'm not picking on you in particular...;))
scaling up through multiple CCXs is quite literally in the article, and it wouldn't make much sense to do after spending all their time designing a unified bus. not to mention their current design is the first in years that handily beats intel's designs in all performance metrics, so why intentionally regress just to get more cores?
You're clearly too biased to make a coherent comment. Assuming you even read the article from top to bottom. Betteridge's Law doesn't really apply any more; it's a feature of an older era
Really weird to see the number of people leaving comments dedicated to reading this in the worst spirit possible. Your criticism of the headline is especially asinine: your suggestions both involve more bias, and the article answered its posed question entirely to my satisfaction (there are limits, and there are ways for them to address those limits).
Well this article fails to mention several things. First the biggest advantage of going to chiplets is the increased performance and yield due to binning. Ie smaller chips has a lower chance of an error ruining the chip or degrading performance. Secondly one big avenue not mentioned is active interposers. Here they are portrayed as more or less passive. There are a lot of interesting possibilities with putting logic into the actual interposer. This article is pretty bad at covering this highly interesting are and focus on only one aspect and missing several key points. “AdoredTV” on YouTube has several videos covering this exact topic much better.
1) Chiplets. Oh dear god we've covered that 50-100 times on the site. It was assumed as given. 2) Active interposers - did you not read the second half of the article? That's what this is _all_ about.
This comment is pretty bad at actually reading the content it's on.
I could understand this piece a bit but its hard to get everything right. However that said there is a fundamental difference in AMD Ring vs Intel.
AMD has an I/O die advantage over Intel. As Intel maxed out at 10C20T CML including all PCIe/iGPU/Memory. AMD maxes out at 16C32T because of the I/O die. That advantage is not there with Intel which is why their Mesh is garbage. It has significantly higher latency, power and lower performance when you add x86 core perf (IceLake) and AMD on the other hand relies on high cache. Just like Apple which also has high cache designs.
AMD cannot scale past 8C CCD on Zen 3 Ryzen or EPYC at the moment. On top seeing your old piece on 3D V Cache Hot Chips. Its clear that AMD is going to scale in 3D not 2D. That means they are going to stack the cores on top. Thermal issues are the first I could see there.
Finally Intel SPR is probably the first to scale to 14C28T scenario for mainstream. But Intel did not make it instead added crap cores and IDT to the equation on ADL because their dense 10nm / Intel 7 has insane power consumption again so they did not add real cores, now we will never know how fast Intel Ring is past Skylake designs look at RKL, utter failure literally ruined the core scaling, crap IMC, garbage TDP. On SPR they are again like Zen 1 NUMA and have no central I/O die which I said above. Intel cannot decouple the I/O from cores. AMD can which offers them superior flexibility. You should have mentioned that.
Genoa will scale I/O die if your theory is correct on AMD cannot scale beyond 8C CCD. Else we will see a 12C CCD shattering reality. A 12C CCD with high speed IF and Mem would be a beast in Mainstream AND Server/HEDT.
On Intel side new EMIB tile looks interesting but I want to see how it performs.
The three layer variant with cache/cores/mesh really got stuck in my head, I believe it very credible.
I'd love to see HBM or eDRAM variants to SRAM V-Cache, but I can easily imagine myself adding an S-RAM V-cache to my CPU basket soon.
For CCX variants, I'd also bet on 12-cores, but I am completely at loss on how this will play out on the "low-end".
After playing around with an 8-core 5800U based notebook at 15 Watts, I can't see myself paying even a penny for 12 cores. Above 40 Watts, sure, at 100 Watts yeah, but at 15 Watts: please wake me again tomorrow.
I've had 18 (Haswell) cores on one of my workstations for years, but the 8 cores of a Ryzen 5800X match it on all thready workloads.
I struggle to imagine reaping the benefit of having 128-512 x86 4GHz cores on my desktop workstations for the majority of my daily workloads with current software.
I've worked with CUDA ML workloads for years, which use 4K cores per socket and may scale to dozens of sockets. Their problems match these high core counts, even special architectures.
Excel (or databases) could put a CPU into every field and provide a real supercomputing calculus... but it doesn't, just like every little object in a browser's DOM won't do either.
On servers or with microservices, that matters much less, but for our workhorses aka workstations, I really see the worst of troubles ahead, because too few applications are able to exploit parallelism at these new levels.
By the time they might, hardware will have iterated several generations and then it's anyone's guess, where that might have lead...
Somehow I'd just much rather have Terahertz hardware or at leat memristors...
I do not understand why "three connections come with a power trade-off" with respect to "two connections". Is this concerning static power (i.e. leakage etc) or dynamic power?
Is this increase in power by using more connections not compensated by having less clock cycles needed to transfer the signal from one node to another node? I.e the trade off between more transistors active in one cycle versus less transistors active during more than one cycle?
Or are elements of these bus/mesh networks operating asynchronously, prof. Furber and Amulet being in mind?
The article is less than educational on how such a bus or mesh is constructed in terms of transistors.
"too few applications are able to exploit parallelism at these new levels"
That seems to be rapidly addressed for any solution that demonstrates significant time savings.
Looks like the cpu memory controller bottleneck is now a problem for the large chips trying to do AI processing.
The Habana Gaudi architecture, with ROCE controller and 10x 100Gbe per package might be the way to go for CPUs and GPUs as well.
An interesting video discussed on reddit points to limitations in OS interrupt processing as a culprit, currently limiting processing when using the latest gen Optane SSDs. See the associated youtube video "More than 15 MILLION IOPs on Xeon 8380s: The State of IO 2021".
Nice article, thank you. Well, it's midnight and I'm ready to hit the sack, so here's a mad scientist thought for the cores. Stack 'em in 4D, with a timelike separation, using hitherto unseen tech. Even better, let the cores stack with themselves. Only, the bi-directional links might be a bit of a problem: sending bits forward, no sweat; but backwards, might need a flux capacitor for that! And causality, oh, disaster ;)
Haven't read the article yet, but just want to say thank you Anandtech. For the longest time I was trying to research NOC based crossbars, but public explanations online are scarce.
This article is tech writing at its absolute best. It takes a very complex, subtle, concept breaks it down, accompanies it with terrific diagrams, then lays out a well articulated discussion. In all honestly I'm blown away. Nice work Dr Cutress! Hat tip to you.
After 4 years of Advanced Micro Devices slaying Goliath, Ian Cutress is STILL an Intel schill. Go figure, an article about AMD laden with Intel crap. Wake up Ian, I'm starting to think you went to the same school as Mark Hibben
Absolute hogwash. Not sure what you think you're doing by posting biased garbage like this, but it does the opposite of making your case for you. Please don't.
Ian Cutress STILL gets interviews with AMD's C-level suite and is a primary AMD press partner. If what you believed was truly the case, why would AMD put up with it?
Dr. Ian gets really irritated by these comments you can tell. On the other hand, I'm personally impressed by the effort put into the explanations. Thank you for this content.
I wonder if they would try anything new after 16 core 32 threads becomes a household thing. wouldn't it be more interesting to develop a wider shorter pipeline core, and stick to the same amount? Like Intel did going from netburst to Dothan? I don't think there is a lot to gain no, going from 32 to 64 threads. I know how dual core was a tremendous uplift in user experience. as where SSD'S. But going from dual core hyperthreading to quad core hypertheading was already, meh. Nice the moments you need some horsepower, but useless when just doing office work.
That's a true Anand article, but the thing is as both Intel, Amd and Arm compete they must produce faster and or more efficient chips that in combination that they benchmark their products beyond the enthusiasts wildest dreams to inform future designs means bottlenecks are miminized in each design; even the rate of consumer cpu cores follows in perfect time with memory speeds; ddr5. The professional and server side will feel the bottlenecks, not the consumer. The question is once a now dual core because 16, and once a now 8 core because 64, how then will they improve speed; or is the magical bye bye silicon moment.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
112 Comments
Back to Article
shabby - Tuesday, September 7, 2021 - link
What is this... www.inteltech i mean www.amdtech.com?!?nandnandnand - Tuesday, September 7, 2021 - link
RIP 12-core chiplet. Hello to dual-CCX 16-core chiplet.Count the number of AnandTech logos on this page.
Eneq - Tuesday, September 7, 2021 - link
Quite right I dont see an issue here for AMD going forward, they can keep tweaking the number of ccxs per chiplet while still maintaining a ring architecture per ccx.pats1111 - Thursday, September 9, 2021 - link
As well as architectural superiority over Intel, famous for screwing the consumer and datacenter for decades........Kamen Rider Blade - Tuesday, September 7, 2021 - link
I still think 12-Core CCD is the realistic future.The math to make a 12-Core CCD is easier and more flexible.
Dual 8-Core CCX to make a CCD doesn't seem practical from a CCD perspective.
Especially if you want flexible packaing to make a CCD.
12-Core CCX allows:
3x 4-Core CCX's
2x 6-Core CCX's
a 8-Core CCX + 4-Core CCX
1x 12-Core CCX's.
This gives you ALOT of flexibility on the consumer end when designing Ryzen while upping the Core count on a CCD to a reasonable # while maintaining reasonable physical size / complexity.
zodiacfml - Wednesday, September 8, 2021 - link
Yes! I expected this on 5nm but turns out AMD won't be having 5nm for a whileKamen Rider Blade - Wednesday, September 8, 2021 - link
AMD will get to TSMC 5nm eventually.nandnandnand - Wednesday, September 8, 2021 - link
Moving from the 8-core unified CCX of Zen 3 to 3x 4-core or 2x 6-core CCXs could mean a performance regression for games and emulators that use 8 cores, because of the added latency. I only like a 12-core unified CCX, and Cutress seems to think that's not happening. So one or two 8-cores it is. If the CCXs can have differing core counts, that might be ideal so you can have 8+4.The Strix Point APU was rumored to have 8+4 big/small cores. If AMD goes big/small and brings it to desktop, all bets could be off.
Kamen Rider Blade - Wednesday, September 8, 2021 - link
The Strix Point APU is a monolithic SoC designed for Mobile first and DeskTop APU second.So any BIG/little core implementation is meant for power savings and isn't a priority on DeskTop yet.
nandnandnand - Wednesday, September 8, 2021 - link
We know that Zen 4 desktop and Epyc (Raphael, Genoa/Bergamo) will use 8-core chiplets again on 5nm. I don't think we know anything concrete after that. Just a good assumption that core counts will go up for Zen 5 desktop if they stay at 16 cores on Raphael. It's entirely possible that AMD will also pursue small cores on desktop, seeing as Alder Lake is doing that in 2021.Alder Lake's small cores will also boost performance per die area, assuming it works right.
JasonMZW20 - Saturday, September 11, 2021 - link
Symmetry is still pretty important in the Zen architecture. AMD would most likely use a dual CCX consisting of 8-cores each in a CCD. AMD can then just disable 2 cores in each to have a 12-core CCD consisting of 2x 6-core CCXes. Mainstream AM5 could remain at 16-cores max, but would drop a chiplet, which reduces cost/complexity of package and leaves more chiplets available for EPYC.3D V-cache could link both CCXes together to access a larger, unified L3/L4 cache (L3 would probably still be contained within each CCX, but V-cache could have ring interconnect links built-in) without hitting IO die and UMCs (until dataset is too large and/or TLB starts thrashing). This would stem the latency penalty of going back to a dual CCX design per chiplet, at least versus older Zen 2 hard CCX partitions. Of course, adding another cache hierarchy reduces bandwidth and increases latency, but it's still faster than having to hit IF PHYs to access IO die, hit UMCs for data, then return to CCX. Depends on how it's implemented I guess, but current V-cache is a direct connection to existing unified L3 in an 8-core CCD, so it acts just like an expanded L3. It's definitely more difficult to connect 2 separate L3s, but I'm pretty sure it can be done as a "virtual" unified L4 or something.
nandnandnand - Sunday, September 12, 2021 - link
Zen 4 will use 8-core chiplets again, with Ryzen maxing out at 16 cores or maybe 24 cores with 3 chiplets. It seems unlikely that Zen 5 Ryzen would max out at 16 cores after Mark Papermaster hinted at future core count increases, and given renewed competition from Intel. If AMD uses a 16-core CCD/chiplet, maybe the platform will go straight to 32 cores using 2 chiplets.nils_ - Monday, September 13, 2021 - link
I'm wondering, the chiplet approach would also allow for some sort of hybrid cpu with performance and efficiency cores so maybe we'll see something completely different.nandnandnand - Monday, September 13, 2021 - link
If you look at the images for the Alder Lake desktop, mobile, and ultra mobile designs, it groups 4 small cores next to 4, 3, or 1 big core, and then it has 2 of those groups for 8+8, 6+8, and 2+8 total. Having them grouped together lets a process get switched between them with low latency. An AMD big/small implementation for desktop could have both types of cores on the same chiplet. A wrinkle is that it might not be optimal for server CPUs. We also don't know if or when AMD might bring big/small to desktop.AMD could use more advanced packaging technologies that don't have the limitations that the current chiplets have, and that could change the design as well.
Geef - Friday, September 10, 2021 - link
Luckily they didn't use smiling Poo emojis.DannyH246 - Tuesday, September 7, 2021 - link
hahaha - its actually funny. We've had how many Intel marketing articles in the last year, and the ONE we get about AMD is about how their current architecture is not going to support adding more cores in the future. i.e. a negative spin. Unfortunately for Ian and his insightful analysis - this was covered about 4 years ago by Jim@Adored.Jorgp2 - Tuesday, September 7, 2021 - link
>Unfortunately for Ian and his insightful analysis - this was covered about 4 years ago by Jim@Adored.Sure it was.
DannyH246 - Tuesday, September 7, 2021 - link
My mistake....it was 3 years ago.https://www.youtube.com/watch?v=G3kGSbWFig4
quocka - Tuesday, September 7, 2021 - link
Roughly around the same time, Ian published his own conclusion that the core wars would take a back seat to interconnect. It became quickly evident with MCM designs that interconnect would use up more and more power budget. Everyone could see this, it was time for readers to look at modern chips more holistically and not just as cores and clock speed."Rest assured, once AMD and Intel have finished fighting over cores, the next target on their list will be this interconnect."
https://www.anandtech.com/show/13124/the-amd-threa...
This isn't about AMD hitting a wall, it's about how they'll climb over it.
Ian Cutress - Tuesday, September 7, 2021 - link
So, you missed all the Zen microarchitecture deep dives, announcements, EPYC announcements, reviews, analysis, and Lisa Su/Mark Papermaster/Forrest Norrod interviews I've done over the previous years then?Also negative spin? What? If anything here I'm highlighting potential directions to go. Would you prefer if I was facetious and praised the ring bus and exclaimed that it should be extended to 16 cores?
eva02langley - Tuesday, September 7, 2021 - link
Ian, you should just ignore fanboyism in the comment section. I find your analysis to be interesting. I am sure AMD and Intel have their own strategy towards core counts.Not to mention that with AI for creating chips, we might end up with designs that are totally impossible to be designed by men... and this is coming in the next couple of years.
On my side, I believe AMD is probably working on a 12 cores chiplet for Zen 4. By looking at the last 4 years, AMD just keep bringing innovations. Like Lisa said, it is an incremental evolution toward their ultimate vision.
DannyH246 - Tuesday, September 7, 2021 - link
No i didn't, i read them all. My point was and is Anandtech will literally print any marketing presentation Intel throws out there. The purpose of these marketing articles were to keep mindshare on Intel, despite them having very little to compete with AMD. The negative spin aspect has been ongoing for some time, and has also been noticed by others. I am not a 'Fanboy' but i am certainly a consumer who has not been impressed with Intel over the years. (Compiler shenanigans, high pricing, fusing cpu features, 4 cores for a decade, benchmarking, paying the large OEM's to not use AMD cpu's etc etc). Maybe i'm wrong in this instance, but on first read (plus headline) i got the sense that mind share was being directed against AMD. Especially given that it is clear AMD have been experimenting with connecting their CPU's through the interposer on a larger (65nm) node for some time.Ian Cutress - Tuesday, September 7, 2021 - link
That's just your own predesposed bias showing, unfortunately. This topic is literally talking about the directions AMD have been researching when they will go beyond ring's, now that we know Zen 3 is some sort of ring/bisected ring. It's literally describing the innovation and R&D that AMD is putting into its topology. If you're not getting that from this article, then I've got some bad news about your personal biases.Note that I'm the one that writes the AMD/Intel CPU topics here and it's _always_ through a critical lens. I rake each one over the coals for all the details, and I always put in the difference between proven features and claims vs speculative claims. I actively avoid putting down specific claimed benchmark data if at all possible. Perhaps you have AnandTech confused with somewhere else.
DannyH246 - Tuesday, September 7, 2021 - link
Well, respectfully there always seems to be an Intel bias here and as I said previously I’m not the only one who has noticed this. I admit maybe I do have an AMD bias but that is a conscious decision based on past Intel behaviour.Qasar - Tuesday, September 7, 2021 - link
could it be just the timing of something ? like the flood intel related content is because intel just had some strong of announcement and a convention or show , ie hot chips or some other trade show ? thats when this " intel bias " you refer to comment posts seem to happen.Qasar - Tuesday, September 7, 2021 - link
er some string of announcements at a convention or show :-)whatthe123 - Wednesday, September 8, 2021 - link
It's just rabid fanboys. Anand recently redid a test on epyc processors because they had weird results where the IOD was eating up way too much power at idle, and their new test showed even larger performance gains over intel and solidifies AMD's dominance in performance. Of course fanboys will just ignore that and claim the site is intel biased because they interview people from intel and give them a fair review instead of just trashing them at any opportunity.warreo - Wednesday, September 8, 2021 - link
The only thing I've noticed is that every AMD fanboy on here thinks AT is an Intel shill and every Intel fanboy on here thinks AT is an AMD shill.Valantar - Wednesday, September 8, 2021 - link
Whether consciously chosen or not, your bias is significantly skewing your reading of this article. (Not that it matters, but I'm quite partial to AMD myself, for many of the same reasons you state.) It is in no way critical or negative, it is exploratory and factual. The questions you are reacting to - such as the title - do not show any evidence of bias, but seem like earnestly curious questions about the way forward. There has to be one, after all, and if they are currently using a ring bus, that has specific implications. This is not critical or negative. It is simply a statement of fact, and asking the subsequent question of "how will they move forward from this?" The article also details AMD's commitment to advanced packaging and chip production methods, and how these have the potential for never seen before methods of overcoming these challenges. If that reads as negative to you, you are inferring a tone that isn't present in the article.derbarimdiger - Thursday, September 9, 2021 - link
Ian, i am reading most of your articles, you are doing a super job, no question about it, however i think that you should spend some time clarifying this issue.i came to the think that you are rather a fan of Intel when you said that you have personally asked for intel to change the name of their chips, why in the world would a write ask for intel to change the marketing? not to mention that the "intel 7" name is perceived by many as a cheap marketing tactic, but lets not speak about the accuracy or validation of that name, my question is: how is that part of your conversation with Intel?
i am not going to say that you are against AMD or TSMC, but have you ever suggested anything to AMD on improving their marketing?
Oxford Guy - Thursday, September 9, 2021 - link
Stupid objection. Informed people should make the effort to tell corporations more about the what products should be made rather than being passive.My good friend recently sent a message to Honda demanding that it have more respect for its design team in terms of aesthetics and less focus on copying the industry trend of immature angry-looking vehicles (for the plebs — rich people can buy pleasant-looking designs).
The difference between Cutress making a suggestion to Intel and my friend making a suggestion to any major corporation is that he can bypass the corporate communications firewall — the thing designed to keep the great unwashed out of the picture (except for their wallets, of course).
GeoffreyA - Monday, September 13, 2021 - link
Spot on, and the remark on cars too.Valantar - Friday, September 10, 2021 - link
I'm not privy to Ian's thinking, but the answer to this seems pretty obvious: he has contacts, and Intel's naming is terrible. Also, when he talks about asking them to change the naming of their chips he's talking about product names, not node names, i.e. Core i9-1123456789hqx number salad naming, which is confusing garbage stupidity on so many levels.Though to be fair, while changing the naming of their node from "10nm" to "7" does come off as weird, it only really highlights that node names (at all fabs!) are nonsense anyway, and not really representative of any feature size in the node. And given that Intel's latest node is pretty similar in its features to other fabs' "7nm" nodes, naming them similarly is less confusing overall (you no longer need the "10 can be as good as 7 depending on where it's coming from" arguments) and thus benefits everyone.
In other words: both asking them to change their product naming system and embracing the node renaming is a positive attitude towards clearer, less confusing naming practices. And that's a good thing, regardless of your views of Intel. I haven't bought an Intel product since my Core2Quad back in 2008 and don't have any plans to do so in the near future, but I still appreciate reductions in stupidity on their part.
StevoLincolnite - Tuesday, September 7, 2021 - link
I moderate some big sites/forums and groups.I can assure you Ian, there is absolutely NOTHING you can do to please everyone.
You do a good job, there is a reason why I have visited Anandtech almost daily for decades.
drothgery - Tuesday, September 7, 2021 - link
Because what Intel has in the pipeline and has released info on is more interesting (it's a much larger change from their current offerings) and is much closer to release ( AMD's not saying much about Zen 4 or RDNA3, and the only thing new coming any time soon are the chips with extra L3 cache ) ? If they were reporting misleading numbers in actual reviews, that would be a different story.Eneq - Tuesday, September 7, 2021 - link
More interesting in this case is also more complex and with larger changes with means higher risk. Given the issues Windows has had with core priority and properly scheduling it will be interesting to see how they handle this.edzieba - Wednesday, September 8, 2021 - link
Intel has a pretty good track record cooperating with Microsoft (and other vendors) to optimise for architectures before release - e.g. SSE vs 3DNow. AMD generally take the release-first-optimise-later approach ('fine wine').lmcd - Wednesday, September 8, 2021 - link
That's not a fair statement considering that MS has given Intel preferential treatment. At times it has been deserved but at times it has not.Spunjji - Friday, September 10, 2021 - link
Intel has enjoyed a much larger budget for developer relations, and (historically) a closer relationship with Microsoft.hansmuff - Tuesday, September 7, 2021 - link
Got a link?chlamchowder - Tuesday, September 7, 2021 - link
Where does it say their current architecture is not going to support adding more cores? The article talks about ring scalability, and gives an example of Intel scaling the ring to 12 cores. Intel has also been using 10 core rings on Comet Lake.Sure, we don't see 16 cores on a single ring. But you're probably looking at a much bigger die at that point, and the point of chiplets is to avoid large dies.
Kamen Rider Blade - Tuesday, September 7, 2021 - link
This is why I think 12 cores is the maximum that AMD will scale up to on a per CCX or CCD basis.12 cores on a Ring-Bus is pretty damn good, any more will have too much latency / bus contention.
gonsalvg - Tuesday, September 7, 2021 - link
gotta be careful using their shade of blue without permission!Oxford Guy - Thursday, September 9, 2021 - link
Ah... another illustration of the ‘first comment, worst comment’ principle.Spunjji - Friday, September 10, 2021 - link
Sadly sopats1111 - Thursday, September 9, 2021 - link
Precisely my sentiments ShabbyStuka87 - Tuesday, September 7, 2021 - link
Some really great details in this article that at least I had not really thought about previously. Great work Ian.dwillmore - Tuesday, September 7, 2021 - link
I love the i740 photo in the Rocket Lake ring bus diagram! Well done! I'm down with AGP, yeah, you know me!Spunjji - Friday, September 10, 2021 - link
Yeah, I enjoyed that little touch too :Dmisel228 - Tuesday, September 7, 2021 - link
I love how you used an old AGP card for the GPU :DyankeeDDL - Wednesday, September 8, 2021 - link
I noticed that too! I told myself: wow, that card sure looks like one of the old ones!repoman27 - Tuesday, September 7, 2021 - link
“In next-generation Sapphire Rapids, Intel is giving each CPU 4 connections, for 1.43 average hops.”Ian, I think you meant Alder Lake, no? Pretty sure Sapphire Rapids is mesh.
Jorgp2 - Tuesday, September 7, 2021 - link
He's talking about the UPI links.repoman27 - Tuesday, September 7, 2021 - link
Never mind, I see that you were just shifting between talking about core-to-core and socket-to-socket topologies without really explaining that. Or stating which level of the Sapphire Rapids interconnect hierarchy you were referring to (core-to-core, tile-to-tile, or socket-to-socket).Rogerdodge - Tuesday, September 7, 2021 - link
Looking at the slide you posted doesn't make me think ring bus. It makes me think the L3 cache is the the interconnect. Maybe I misunderstood something about the architecture, but all cores can write to any part of the L3 right? so they could also read any part of the L3? Why would you need any other way to connect the cores if you can access any part of the L3 at comparable latency? Is the L3 slower than the core to core? Is it lower bandwidth?Rogerdodge - Tuesday, September 7, 2021 - link
I look again and see that the slide specifically says ring bus.....now feel kinda stupid...xenol - Tuesday, September 7, 2021 - link
I have to ask but where did the name "ButterDonut" come from?Ian Cutress - Tuesday, September 7, 2021 - link
A Torus is basically a donut, mixed with the Butterfly topology = ButterDonut.xenol - Tuesday, September 7, 2021 - link
And here I was thinking it was something food related. I guess this is what I get for reading this before having breakfast.Kevin G - Tuesday, September 7, 2021 - link
He hasn't had enough chips to eat today.Oxford Guy - Thursday, September 9, 2021 - link
Rumour has it that it was commissioned by law enforcement.Spunjji - Friday, September 10, 2021 - link
I'd have put my vote in for Buttoruskpb321 - Tuesday, September 7, 2021 - link
Beyond the technical issues of how you do it in chips I think there is another big question. Does AMD finally move to multiple CCX design across it's product stack or does it stick with one CCX design. So far AMD has only had a single CCX design across all it's product stack so everything from laptop APUs with integrated graphics up to their biggest Epyc server cpus have used the same base CCX. When it was 4 cores their lowest end consumer CPUs and APUs could have a single 4 core CCX. With Zen 3 going to an 8 core CCX those were pushed up to an 8 core CCX. Going to a 16 core CCX for future Epyc chips might be nice on that end but making the minimum CCX size 16 cores on the consumer end might be a bit much. Given that I think the CCX size will stay at 8 for for a while and they might either go with more chiplets or a larger chiplet with 2 8 core CCXs.Thanny - Tuesday, September 7, 2021 - link
APU's don't use chiplets. They are monolithic to save on power.My money would be on AMD continuing to use small chiplets and relying on more packaging tricks to increase the number of cores per processor. This will let them continue the extremely effective strategy of using the same die across desktop, workstation, and server products.
kpb321 - Tuesday, September 7, 2021 - link
They don't use chiplets but so far they have always used the same base CCX design that other chips have used. Zen 1/2 had a 4 core CCX and the APUs all had a single CCX with up to 4 cores and graphics, memory, IO etc. With Zen 3 the CCX size changed to 8 cores and the APUs moved to that same 8 core CCX and we got 8 core APUs in our laptops. Going 12 or 16 core CCXs for Epyc makes sense but an 12 or 16 core CCX on a laptop APU seems like it might be a bit of overkill. Certainly they could move to having two different CCX designs but that seems like a fair bit of extra work that they have tried to avoid so far. For Ryzen chips AMD has done a lot to reuse things as much as they can from top to bottom. Using the same Chiplet across the board. Using the same CCX in both Chiplets APUs. I'd bet they probably also reused memory controller design between APU and chiplet/IO die etc. Being the smaller company and lower volume of CPUs means they benefit more from these things where as Intel can afford 3 different designs just for their server CPUs because they have the volume and size to support that.Kamen Rider Blade - Tuesday, September 7, 2021 - link
I think AMD will be going with multiple CCX configurations in the future:4-core CCX
6-core CCX
8-core CCX
12-core CCX
Thanks to 3D V-cache, they won't be starved of L3, even in smaller Core/CCX configurations.
Spunjji - Friday, September 10, 2021 - link
That seems like overkill. I can see them having a 4-core CCX design for ultra-mobile / low power designs, and maybe a 12-core CCX for server, eventually moving down to desktop level.Spunjji - Friday, September 10, 2021 - link
Small correction - Zen 2 APUs had up to 8 cores, with a dual-CCX design. I have an 8-core notebook based around the 4800H.I do agree that adding more cores to the CCX would create something of a problem in terms of designs, but it's likely they're going to respond to that by increasing the number of parallel design teams. After all, the Strix Point APU will reportedly be using small cores, and there's not yet any sign of that in the chiplets for desktop / server.
GeoffreyA - Saturday, September 11, 2021 - link
I feel they're going to continue with an 8-core CCX for a while, though of course two CCX variants is not impossible either. Depends on how of much work they're willing to do.GeoffreyA - Saturday, September 11, 2021 - link
Or, if not 8, it seems AMD's style to pump it up to 16 in one go. I doubt 12 cores.Kamen Rider Blade - Tuesday, September 7, 2021 - link
I think Mobile APU's will remain as a monolithic SoC due to power requirements.DeskTop & Enterprise will fully go Chiplet to optimize for it's power parameters / capabilities.
chlamchowder - Tuesday, September 7, 2021 - link
Nice to see a technical article. Intel showed that 10 cores (plus iGPU and IO) on a ring can work in client CPUs that are sensitive to latency. I think AMD staying at 8 cores per chiplet is driven by die size issues, rather than interconnect.nandnandnand - Tuesday, September 7, 2021 - link
The 8-core chiplets are relatively small, and AMD/TSMC has had great yields to the point where they are disabling good cores to make a 6-core and products like the Ryzen 3 3300X don't make sense. TSMC 5nm is apparently better than 7nm was at the same point in its development.Stuffing 16 cores into the chiplet could actually help with binning. 8 cores with low latency can be the "unit" for the forseeable future, so dual-CCX with 8 cores each is fine.
Kamen Rider Blade - Tuesday, September 7, 2021 - link
Here's what I think the Inter-Core connectivity could look like per CCX/CCDhttps://i.imgur.com/MPw7c9f.png
Here's what I think future EPYC CCD Layouts would look like
https://i.imgur.com/cIz2y92.png
A layer of Interposer will be for simple 2D mesh connectivity between CCD's
Another layer of Interposer will be for direct connection to the Central I/O die
A final layer of Interposer will be for a direct connection to L4 Cache
Upper Limit of design based on leaked CPU PCB Substrate size is shown above
NOTE:
- Eventually each core will get to SMT4, then SMT8 at some point with a future iterations of Zen
- 12-core CCD's will eventually be made by either (3x 4-Core CCX's, 2x 6-Core CCX's, 8-Core+4-Core CCX's, or 1x 12-Core CCX)
- EPYC will go from 96 Cores with 12 CCD's in the rumored Genoa to:
+ My Hypothetical / Speculative configurations based on future 12-Core CCD's:
_- 12-Core CCD x 12 CCD's per EPYC CPU = 144 Core
_- 12-Core CCD x 18 CCD's per EPYC CPU = 216 Core
_- 12-Core CCD x 24 CCD's per EPYC CPU = 288 Core
flgt - Tuesday, September 7, 2021 - link
Interesting article Dr. Cutress, thanks for putting it together.ZolaIII - Tuesday, September 7, 2021 - link
A rather boring read. If you want to see how inter connecting can be tricky and creative ways to do it take a look at larger FPGA's. I really think introducing another level or levels of victim cache will do very little on long term other than rising the price.Larger L2 even if shared between two core's makes much more sense. Async multi lv bi directional ring bus is way to go but towards RAM which needs to shift towards HBM or HMC. Somehow the experience with DSP's regarding optimal word length and ability to efficiently feed it should be a guide line regarding the ring bus for all processing blocks (SIMD's, general purpose, DSP, GPU...).
Wereweeb - Tuesday, September 7, 2021 - link
I still don't get why people focus so much on HBM while ignoring the OMI.Thud2 - Tuesday, September 7, 2021 - link
Those maps of the elements in each design is starting to read like the Kama Sutra.GeoffreyA - Friday, September 10, 2021 - link
It's the congress of the Elephant, coupled with striking of the cores.Thud2 - Tuesday, September 7, 2021 - link
Sorry, off topic but anyone have any updates on how Ryan is doing?eastcoast_pete - Tuesday, September 7, 2021 - link
So, how three-dimensional do those butter donuts get? Also, how much overhead in both silicon and power usage does managing those many cores in many chiplets add? Since we're using food analogies, where do you think the butter zone for cores per chiplet and chiplet per CPU is, at least currently?Lastly, here a wild thought: Expand something like Alder Lake's Thread Director to actively manage a 3D meshed chiplet-based CPU, especially if it uses big and little cores. When there still was a SUN, their slogan was "the network is the computer"; looks like that's now very much true also for the CPU.
Kevin G - Tuesday, September 7, 2021 - link
Rocket Lake actually has 12 ring stops: the GPU has two stops and the PCIe IO complex has one on its own. (Also why use an ancient AGP graphics card in the graphic?) The largest ring complex Intel has built was for Comet Lake at 14 stops (10 cores, 2 GPU, 1 IO, 1 memory). The speculative theory is that Intel is limited to a total of 16 stops on the ring. As mentioned, legacy Xeons had begun to add separate ring systems to scale upward in core counts but those two never exceeded 14 hops around. The oddity was Westmere-EX which wasn't fully bi-directional and had some really weird latency characteristics.I have a quibble about crossbars: the reason why they're generally avoided is that they become incredibly complex as the number of nodes on them increases. Pretty much every generation of CCX would require its own custom sized crossbar as core count increases. It is common practice to over a design a cross bar and then let it fill up as product lines evolve/die shrinks are possible. I do think we'll see AMD using a crossbar but on the IO die where the number of nodes will not radically change over time.
The interposer can indeed hold a complex topology between dies. However, I would argue that it'd wise to go with a more active interposer using a crossbar instead of a more exotic mesh-like topology. The crossbar would produce heat but it'd be relatively small. Such a cross bar interposed would be a per-product which each main passive interposer already is a per-product design to begin with.
One thing glossed over with AMD's current topologies is how many links between the IO die and the CCD exist? In Rome there were two links between the IO die and each CCD as to match the number of CCX inside the CCD. Presumably with one CCX per CCD in Milan, there would be potential to double the number of CCD/CCX but at the expense of per CCX bandwidth. (This is where the 3D V-Cache comes into play to help counter the bandwidth issue.)
AMD does have another method of increasing the number of cores in a system: increase socket count. It has puzzled my why Rome/Milan do not support a three socket system. Yes it is a bit odd but the bandwidth and links are there to do a solid point-to-point topology. Adding more sockets to a system is also viable for Intel in these large complex servers as they are expected to migrate to optical interconnects using on-package silicon photonics at some point (Intel has demoed proof of concept of this in the past.)
Speaking of the Intel side of things, it boggles my mind that they have been stuck on Sky Lake-SP/Cascade Lake-SP for so long in servers that they simply wouldn't have shattered the design into CPU chiplets + IO chiplets and link them together via EMIB/interposer. Sapphire Rapids isn't even following this as it includes some IO functionality on each die more akin to AMD's Naples design. I fathom we'll see this post Sapphire Rapids but Intel has certainly taken there time and transitioned to a defense posture.
JayNor - Tuesday, September 7, 2021 - link
Intel's recent patent app 20210263880, " DISAGGREGATED DIE WITH INPUT/OUTPUT (I/O) TILES ", surrounds small processor tiles with narrow IO tiles, connected by emib, in a pattern that can scale to a larger extent than the single IO tile.Ian Cutress - Tuesday, September 7, 2021 - link
Note, I can count 17 ring stops on the first Broadwell XCC ringWaltC - Tuesday, September 7, 2021 - link
I think you are comparing Intel's monolithic traditional designs with AMD's state of the art chiplet designs. You seem to be comparing a single AMD CCX to an entire Intel CPU, and trying really, really hard to see this as some sort of negative for AMD. But the fact is, of course, that AMD is not limited to a single CCX in a given CPU--which seems to me the whole point of a chiplet approach being a net positive over a monolithic design. You are also assuming some sort of equity between Intel's tiles and AMD's chiplets, an equity that has yet to be demonstrated. We know all about chiplets as AMD has been shipping them for years--we know nothing about Alder Lake apart from marketing chatter thus far--not even a shipping date. It's almost like Intel is whispering in your ear "Here is where AMD is limited" which you aren't questioning too closely...;) It remains to be seen what Intel will come up with to attempt to compete, and of course AMD is looking at Zen4 & Zen5, which further muddles any competitive picture from Intel at present.You know, of course, that writing articles titled with question marks is bad form grammatically and journalistically. The reason it's disdained by real journalists is because it usually winds up with the article failing to answer the question raised in the title--as is true here. If you can answer the question you wish to pose, then a declarative title, such as, "An AMD chiplet has a core-count limit" or "An AMD chiplet has no core-count limit" would be much better. And of course the article would go ahead and present the limit--or present evidence that the core count need not be limited. You are not alone in writing articles with question marks these days, especially when the question posed in the title isn't answered. (So, I'm not picking on you in particular...;))
whatthe123 - Tuesday, September 7, 2021 - link
scaling up through multiple CCXs is quite literally in the article, and it wouldn't make much sense to do after spending all their time designing a unified bus. not to mention their current design is the first in years that handily beats intel's designs in all performance metrics, so why intentionally regress just to get more cores?Ian Cutress - Tuesday, September 7, 2021 - link
You're clearly too biased to make a coherent comment. Assuming you even read the article from top to bottom. Betteridge's Law doesn't really apply any more; it's a feature of an older eraSpunjji - Friday, September 10, 2021 - link
Really weird to see the number of people leaving comments dedicated to reading this in the worst spirit possible. Your criticism of the headline is especially asinine: your suggestions both involve more bias, and the article answered its posed question entirely to my satisfaction (there are limits, and there are ways for them to address those limits).boozed - Wednesday, September 8, 2021 - link
Misaligned ButterDonut.Hixbot - Wednesday, September 8, 2021 - link
Sounds tastysweMike - Wednesday, September 8, 2021 - link
Well this article fails to mention several things. First the biggest advantage of going to chiplets is the increased performance and yield due to binning. Ie smaller chips has a lower chance of an error ruining the chip or degrading performance. Secondly one big avenue not mentioned is active interposers. Here they are portrayed as more or less passive. There are a lot of interesting possibilities with putting logic into the actual interposer. This article is pretty bad at covering this highly interesting are and focus on only one aspect and missing several key points. “AdoredTV” on YouTube has several videos covering this exact topic much better.Oxford Guy - Thursday, September 9, 2021 - link
‘There are a lot of interesting possibilities with putting logic into the actual interposer.’Article as I read it included that bit.
Ian Cutress - Sunday, September 12, 2021 - link
1) Chiplets. Oh dear god we've covered that 50-100 times on the site. It was assumed as given.2) Active interposers - did you not read the second half of the article? That's what this is _all_ about.
This comment is pretty bad at actually reading the content it's on.
Silver5urfer - Wednesday, September 8, 2021 - link
I could understand this piece a bit but its hard to get everything right. However that said there is a fundamental difference in AMD Ring vs Intel.AMD has an I/O die advantage over Intel. As Intel maxed out at 10C20T CML including all PCIe/iGPU/Memory. AMD maxes out at 16C32T because of the I/O die. That advantage is not there with Intel which is why their Mesh is garbage. It has significantly higher latency, power and lower performance when you add x86 core perf (IceLake) and AMD on the other hand relies on high cache. Just like Apple which also has high cache designs.
AMD cannot scale past 8C CCD on Zen 3 Ryzen or EPYC at the moment. On top seeing your old piece on 3D V Cache Hot Chips. Its clear that AMD is going to scale in 3D not 2D. That means they are going to stack the cores on top. Thermal issues are the first I could see there.
Finally Intel SPR is probably the first to scale to 14C28T scenario for mainstream. But Intel did not make it instead added crap cores and IDT to the equation on ADL because their dense 10nm / Intel 7 has insane power consumption again so they did not add real cores, now we will never know how fast Intel Ring is past Skylake designs look at RKL, utter failure literally ruined the core scaling, crap IMC, garbage TDP. On SPR they are again like Zen 1 NUMA and have no central I/O die which I said above. Intel cannot decouple the I/O from cores. AMD can which offers them superior flexibility. You should have mentioned that.
Genoa will scale I/O die if your theory is correct on AMD cannot scale beyond 8C CCD. Else we will see a 12C CCD shattering reality. A 12C CCD with high speed IF and Mem would be a beast in Mainstream AND Server/HEDT.
On Intel side new EMIB tile looks interesting but I want to see how it performs.
529th - Wednesday, September 8, 2021 - link
Do these topologies change even within a certain generation? e.g. 5600x vs a 5800x ? What's the 4 core version, btw?Thanks for the write up! Enjoyed the reading
abufrejoval - Wednesday, September 8, 2021 - link
The three layer variant with cache/cores/mesh really got stuck in my head, I believe it very credible.I'd love to see HBM or eDRAM variants to SRAM V-Cache, but I can easily imagine myself adding an S-RAM V-cache to my CPU basket soon.
For CCX variants, I'd also bet on 12-cores, but I am completely at loss on how this will play out on the "low-end".
After playing around with an 8-core 5800U based notebook at 15 Watts, I can't see myself paying even a penny for 12 cores. Above 40 Watts, sure, at 100 Watts yeah, but at 15 Watts: please wake me again tomorrow.
I've had 18 (Haswell) cores on one of my workstations for years, but the 8 cores of a Ryzen 5800X match it on all thready workloads.
I struggle to imagine reaping the benefit of having 128-512 x86 4GHz cores on my desktop workstations for the majority of my daily workloads with current software.
I've worked with CUDA ML workloads for years, which use 4K cores per socket and may scale to dozens of sockets. Their problems match these high core counts, even special architectures.
Excel (or databases) could put a CPU into every field and provide a real supercomputing calculus... but it doesn't, just like every little object in a browser's DOM won't do either.
On servers or with microservices, that matters much less, but for our workhorses aka workstations, I really see the worst of troubles ahead, because too few applications are able to exploit parallelism at these new levels.
By the time they might, hardware will have iterated several generations and then it's anyone's guess, where that might have lead...
Somehow I'd just much rather have Terahertz hardware or at leat memristors...
ShaftedByHaswell - Wednesday, September 8, 2021 - link
I do not understand why "three connections come with a power trade-off" with respect to "two connections".Is this concerning static power (i.e. leakage etc) or dynamic power?
Is this increase in power by using more connections not compensated by having less clock cycles needed to transfer the signal from one node to another node? I.e the trade off between more transistors active in one cycle versus less transistors active during more than one cycle?
Or are elements of these bus/mesh networks operating asynchronously, prof. Furber and Amulet being in mind?
The article is less than educational on how such a bus or mesh is constructed in terms of transistors.
JayNor - Wednesday, September 8, 2021 - link
"too few applications are able to exploit parallelism at these new levels"That seems to be rapidly addressed for any solution that demonstrates significant time savings.
Looks like the cpu memory controller bottleneck is now a problem for the large chips trying to do AI processing.
The Habana Gaudi architecture, with ROCE controller and 10x 100Gbe per package might be the way to go for CPUs and GPUs as well.
An interesting video discussed on reddit points to limitations in OS interrupt processing as a culprit, currently limiting processing when using the latest gen Optane SSDs. See the associated youtube video "More than 15 MILLION IOPs on Xeon 8380s: The State of IO 2021".
GeoffreyA - Wednesday, September 8, 2021 - link
Nice article, thank you. Well, it's midnight and I'm ready to hit the sack, so here's a mad scientist thought for the cores. Stack 'em in 4D, with a timelike separation, using hitherto unseen tech. Even better, let the cores stack with themselves. Only, the bi-directional links might be a bit of a problem: sending bits forward, no sweat; but backwards, might need a flux capacitor for that! And causality, oh, disaster ;)Dr_Mobeyos - Wednesday, September 8, 2021 - link
Haven't read the article yet, but just want to say thank you Anandtech. For the longest time I was trying to research NOC based crossbars, but public explanations online are scarce.Bik - Thursday, September 9, 2021 - link
after Ian launching potatotech on youtube, his article's title are getting better and better as a result!ajcarroll - Thursday, September 9, 2021 - link
This article is tech writing at its absolute best. It takes a very complex, subtle, concept breaks it down, accompanies it with terrific diagrams, then lays out a well articulated discussion. In all honestly I'm blown away. Nice work Dr Cutress! Hat tip to you.pats1111 - Thursday, September 9, 2021 - link
After 4 years of Advanced Micro Devices slaying Goliath, Ian Cutress is STILL an Intel schill. Go figure, an article about AMD laden with Intel crap. Wake up Ian, I'm starting to think you went to the same school as Mark HibbenSpunjji - Friday, September 10, 2021 - link
Absolute hogwash. Not sure what you think you're doing by posting biased garbage like this, but it does the opposite of making your case for you. Please don't.Ian Cutress - Sunday, September 12, 2021 - link
Ian Cutress STILL gets interviews with AMD's C-level suite and is a primary AMD press partner. If what you believed was truly the case, why would AMD put up with it?Makste - Tuesday, September 21, 2021 - link
Dr. Ian gets really irritated by these comments you can tell. On the other hand, I'm personally impressed by the effort put into the explanations. Thank you for this content.Foeketijn - Friday, September 10, 2021 - link
I wonder if they would try anything new after 16 core 32 threads becomes a household thing.wouldn't it be more interesting to develop a wider shorter pipeline core, and stick to the same amount?
Like Intel did going from netburst to Dothan?
I don't think there is a lot to gain no, going from 32 to 64 threads. I know how dual core was a tremendous uplift in user experience. as where SSD'S. But going from dual core hyperthreading to quad core hypertheading was already, meh. Nice the moments you need some horsepower, but useless when just doing office work.
GeoffreyA - Saturday, September 11, 2021 - link
M1 having recalibrating the landscape, I wouldn't be surprised if Intel/AMD drop to lower clocks and higher IPC in the future.nandnandnand - Saturday, September 11, 2021 - link
The immediate answer for Intel is clearly big/small cores, and lower clocks on the small cores. We'll have to wait and see if AMD follows suit.GeoffreyA - Saturday, September 11, 2021 - link
Looking forward to seeing Alder Lake in action. Performance is going to be very high, but the power, that is the question.GeoffreyA - Saturday, September 11, 2021 - link
* recalibratedericore - Wednesday, October 13, 2021 - link
That's a true Anand article, but the thing is as both Intel, Amd and Arm compete they must produce faster and or more efficient chips that in combination that they benchmark their products beyond the enthusiasts wildest dreams to inform future designs means bottlenecks are miminized in each design; even the rate of consumer cpu cores follows in perfect time with memory speeds; ddr5. The professional and server side will feel the bottlenecks, not the consumer. The question is once a now dual core because 16, and once a now 8 core because 64, how then will they improve speed; or is the magical bye bye silicon moment.