Comments Locked

93 Comments

Back to Article

  • nandnandnand - Tuesday, June 1, 2021 - link

    Here's some unrealistic hype: Zen 3+ Rembrandt APUs with 3D L3 cache.
  • del42sa - Tuesday, June 1, 2021 - link

    for a price that nobody willing to pay , sure .-)
  • nandnandnand - Tuesday, June 1, 2021 - link

    You could get a 5700G OEM system for $550 on sale recently. Zen 3+ with 3D cache and RDNA2 next year? I'll throw another $50 at that.
  • 0ldman79 - Wednesday, June 2, 2021 - link

    I'd expect the cache tax to be about triple that myself.

    It's not just the cost of the cache die it's the R&D, the labor, etc...

    They might surprise me but I'm betting all 3D cache products to be $150 over the standard chips.
  • haukionkannel - Wednesday, June 2, 2021 - link

    New tech and not cheap to produce, so $150 extra seems to be low… but lets see.
    Normal zen3 and these will both be produced same time to Keep normal prices cpus also available!
  • Hifihedgehog - Tuesday, June 1, 2021 - link

    SteamPal V2, perhaps? If (1) first gen SteamPal even materializes and (2) if it is a runaway success, that is.
  • nandnandnand - Tuesday, June 1, 2021 - link

    I think SteamPal v1 would hang around on Van Gogh for a couple of years, to be replaced by something newer than Dragon Crest or Rembrandt.
  • Hifihedgehog - Tuesday, June 1, 2021 - link

    Just imagine: Intel's Crystal Well L4 eDRAM only provided 100 GB/s of 128MB cache with far higher latency. Imagine what 2 TB/s of low-latency 128 MB could provide an APU's integrated graphics.
  • Hifihedgehog - Tuesday, June 1, 2021 - link

    *AMD's 3D V-Cache offer 20 times the bandwidth Crystal Well. ;)
  • Hifihedgehog - Tuesday, June 1, 2021 - link

    Careful observers have also fielded the idea of putting this 3D V-Cache to work in APUs. There it could be used as insanely fast on-package memory for integrated graphics. Intel's Crystal Well, a 128MB eDRAM-based L4 cache as seen in 5th Gen Broadwell, provides a paltry 100 GB/s bandwidth. Well, AMD's 3D V-Cache offer 20 times the bandwidth of Crystal Well. Even a stack of HBM2E, while more capacity dense, falls massively short in bandwidth at just 480 GB/s. This would be a dream come true in an APU...
  • lightningz71 - Tuesday, June 1, 2021 - link

    This would absolutely be a game changer for both the H series APUs and the U series APUs for different reasons.

    The H series APUs won't be using their integrated GPU often, but, as compared to their desktop cousins, they have half the L3 cache. As compared to Intel's 8 core Tiger Lake H series processors, they only have 2/3 the L3 cache. Just adding a 64MB Cache module puts them above the current desktop processors in total L3* and gives them nearly 4x the cache of Tiger Lake 8c H series. If it bears the same performance uplifts as indicated in her presentation (it should be higher due to the greater proportional cache increase), it will easily take back the performance crown in mobile. It would also help in those on-battery situations where the dGPU might be turned off to conserve battery life, giving the iGPU significantly greater performance in some situations.

    For the U series APUs, AMD already has the infinity cache implementation to pattern off of to implement this. Just dedicating it to that purpose could give the iGPU a major performance uplift while taking pressure off of the memory bus for the processor. If they, instead, used it as a shared cache for memory controller access, like an L4, it could help both. I don't think that the current Vega8 implementation would really benefit much from it, though, as memory overclocking beyond the spec speeds doesn't yield anywhere near linear performance increases. This is more of a next-gen solution.
  • mode_13h - Tuesday, June 1, 2021 - link

    SRAM is going to burn some non-negligible amount of power. And I don't know that you get much back from intermittently powering it down. It depends a lot on how clever that logic is, and how much a "power virus" background process can keep the L3 active.
  • mode_13h - Tuesday, June 1, 2021 - link

    > For the U series APUs

    Those are a lower-cost, lower-margin product. This won't be cheap. As GPUs are also less latency-sensitive, I think there's less to be gained by stacking the dies.
  • haukionkannel - Wednesday, June 2, 2021 - link

    Upgrading low end products would be not a good deal, so i agree. Not coming to low marging products for some time… maybe 3 to 5 years later?
  • mode_13h - Tuesday, June 1, 2021 - link

    > Imagine what 2 TB/s of low-latency 128 MB could provide an APU's integrated graphics.

    128 MB only gets you so far. It's still only cache, and not a complete substitute for having high-bandwidth memory. Just look at their dGPUs with Infinity cache -- their bandwidth needs were reduced, but not eliminated!
  • flazza - Tuesday, June 8, 2021 - link

    ' their bandwidth needs were reduced, but not eliminated'
    As long as it gets them to 1080p / 60fps with medium settings then its pretty bad news for nvidia
  • Santoval - Thursday, June 3, 2021 - link

    Apparently this is why AMD is calling it L3 rather than L4 cache. If it serves as a direct extension of the on-die L3 cache, it has massive bandwidth and almost the same latency then it should indeed be considered an L3 cache. It is also SRAM rather than DRAM based.

    Intel did not have the tech at the time to pull something like that but they do now, with Foveros (a Greek word meaning "fearful" or -more likely in this context- "formidable"). However if Foveros is quite less power efficient than TSMC's CoW it will be limited to very low power SoCs like Lakefield..
  • whatthe123 - Saturday, June 5, 2021 - link

    the foveros intel showed off uses solder bumps to stack logic. its performance wouldn't really be comparable to what TSMC is doing here with just a TSV layer since its only stacking SRAM, though for some reason AMD not so subtly compared the two. intel's foveros packaging isn't really good enough to stack logic (lakefield in general wasn't that performant) but it was weird to compare two clearly different technologies.
  • mode_13h - Sunday, June 6, 2021 - link

    > its performance wouldn't really be comparable to what TSMC is doing here with just a TSV layer

    TSVs are only for power. Signals don't travel through the TSVs.
  • twotwotwo - Tuesday, June 1, 2021 - link

    Yup--iGPU's are memory bandwidth hogs, and back in the day Intel put 128MB of eDRAM (higher latency than this) off-chip for the GPU's benefit (Crystalwell). Thermal and cost questions are real and it probably means something that they didn't demo it on an APU here. If they do pull it off someday, could be nifty, and you get a faster CPU at the same time.

    Also, know who else is working with TSMC on high-end chips? Apple.
  • lmcd - Tuesday, June 1, 2021 - link

    Wild that Intel is now behind in three dimensions (core design, planar manufacturing, and now soon stacking, barring a Lakefield successor actually worth using).
  • cheshirster - Tuesday, June 1, 2021 - link

    Looks like that.
  • shabby - Tuesday, June 1, 2021 - link

    They're ahead in one thing though... moar netburst mhz!
  • Kevin G - Tuesday, June 1, 2021 - link

    I wonder if Zen 3 with the extra stacked SRAM is destined for the various super computers due out at the end of the year. This could be part of the custom stuff AMD developed just for them and the timelines align.

    Bringing it to the desktop at the end of the year would be interesting on AM4: it'd be the one last upgrade for the platform as Zen 4 appears destined to go to AM5. This could be the fabled but never really confirmed Zen 3+ that circulated in various rumors. There would be an IPC increase here due to the massive cache size increase and make the 'XT' moniker meaningful for once.

    As for the L3 cache not being uniform, with TSV, it wouldn't be too radical of change as I don't think each core as full uniform access to it anyway. It would behave as if there were more cores in the CCX. There would most certainly be a latency hit but the trade off in capacity at the L3 stage would be worth it.

    Recycling the 7 nm stacked SRAM for 5 nm based Zen4 logic dies would be an efficient use of fab capacity and design resources. The real question is how large the 5 nm Zen 4 CCD would be in comparison to the 7 nm SRAM die. There is enough room on the 7 nm Zen 3 die for two of these SRAM dies but it looks like AMD is avoided putting the stacks on top of the actual core logic. Granted the pictures were designed for a presentation which can play fast and loose with implementation details. However, that'd make sense to deal with thermal dynamics.
  • Ian Cutress - Tuesday, June 1, 2021 - link

    I was thinking about mentioning Trento, but AMD confirmed that Trento is Milan with a custom IO die, not custom chiplets.
  • Kevin G - Tuesday, June 1, 2021 - link

    That is the big question: how custom are the Zen 3 CCD on the bottom? Is this the same design they've been shipping for the past ~7 months now? Was this additional cache always part of the plan? Given the relatively small die size of the Zen 3 CCD, using a bit of extra area for the TSV holes wouldn't have added much in terms of overall area.
  • haukionkannel - Wednesday, June 2, 2021 - link

    Same…
  • Kamen Rider Blade - Tuesday, June 1, 2021 - link

    I concur, not putting it on the actual Core logic makes sense, for thermal transfer reasons. But is there other material that could take place of Silicon to transfer the heat away from the Logic Die area?
  • Wereweeb - Tuesday, June 1, 2021 - link

    Not if you don't want the CPU's to crack as it shrinks and expands. The best thing to do to improve cooling would be pulling an Intel and shaving down the die.
  • SaturnusDK - Tuesday, June 1, 2021 - link

    This was a prototype. I suspect that in an actual implementation the chip would be flipped so the cores are nearest to the IHS, and the "silicon area" would be used to transfer the power and signals.

    That would allow AMD to make versions of the same chiplet with and without the V-cache. Or potentially even stacking even more V-cache.
  • sgeocla - Tuesday, June 1, 2021 - link

    If you look closely at the Zen3 core chiplet die, the TSVs have always been there but nobody knew what their role was. This is as close to the final design as it gets. On top of that there's a new leaked stepping with 5.0 GHz for the top end SKUs for the rumored XT refresh. So Intel's Alder lake will be competing against this new Zen3 design with even higher clocks.
  • psychobriggsy - Tuesday, June 1, 2021 - link

    This will be a new CCD design, because the TSV area would have had to be added to the die - and certainly nobody ever saw that on the original Zen 3 chiplet die photos.

    Rembrandt is strongly rumoured to be using a Zen 3+ core, so it makes sense that this new CCD design is also using that core - even if it only provides minor bugfixes and minor performance increase.

    Warhol was likely going to be based around this new CCD with stacked SRAM, like this demo. AM4 could really do with more cache, for high core-count SKUs, as the DDR4 memory bandwidth is the limiting factor. It seems likely the industry capacity shortages have killed Warhol, as indicated in the article, it's not an insignificant amount of extra silicon. All the dies will go into Milan-X instead, the server product using this technology, which will be very popular for high-cache applications.

    Question to AT: Could you make this text box larger vertically, or add in the resize gadget?
  • mode_13h - Tuesday, June 1, 2021 - link

    > Could you make this text box larger vertically, or add in the resize gadget?

    ProTip: write your posts in another editor, then copy-and-paste. I'll admit that I don't always do it, though.
  • cheshirster - Wednesday, June 2, 2021 - link

    "and certainly nobody ever saw that on the original Zen 3"
    It was there all the time.
  • fallaha56 - Tuesday, June 1, 2021 - link

    agree
  • fallaha56 - Tuesday, June 1, 2021 - link

    great summary of likely outcomes

    also the fact that Intel is now delaying Saphire Rapids also doesn't bode well
  • mode_13h - Tuesday, June 1, 2021 - link

    > The real question is how large the 5 nm Zen 4 CCD would be
    > in comparison to the 7 nm SRAM die.

    LOL. It'll have AVX-512, right? So, plenty big.
  • Kamen Rider Blade - Tuesday, June 1, 2021 - link

    16C/32T with 192 MiB of L3$:

    So that comes out to 12 MiB of L3$ per Core

    What a time to be alive =D
  • kbastomi - Monday, June 7, 2021 - link

    Zen3 Threadripper didnt released yet, could it roll to it? 32 cores with 384 MiB of L3$ would be blast
  • outsideloop - Tuesday, June 1, 2021 - link

    I don't whats more impressive: AMD announcing the v-cache or Ian just throwing together this analysis in the last hour??!
  • JayNor - Tuesday, June 1, 2021 - link

    Intel showed a prototype stacked SRAM done with hybrid bonding last year. Was it built at tsm?

    https://www.anandtech.com/show/15980/intel-next-ge...
  • TanjB - Tuesday, June 1, 2021 - link

    Cost of 7 nm may be much reduced keeping in mind the chip is dominated by SRAM, which needs only a subset of all possible processing steps. So long as they stay within those limits for the IO circuits the wafer cost may be much lower than the same wafer on 7N with an ASIC.
  • Wereweeb - Tuesday, June 1, 2021 - link

    Hmmmm.... I was really worried about the costs of this (+15% performance -in gaming- for +45% die size -and- the costs of 3D stacking) but if that is true, what is stopping them from just moving the entire L3 cache out into a separate stacked die?
  • ET - Tuesday, June 1, 2021 - link

    Probably not much. I think it's quite reasonable for a future die to be produced with the L3 cache on top instead of inside (and not as an addition). This seems quite beneficial, as it could potentially mix processes (5nm die + 7nm cache), and even allow for a powerful APU, where cache is typically compromised (and on the flip side, better scaling for mobile, from chips with little to no L3 cache to chips with a massive cache).
  • Hifihedgehog - Tuesday, June 1, 2021 - link

    > This seems quite beneficial (...) even allow for a powerful APU, where cache is typically compromised (and on the flip side, better scaling for mobile, from chips with little to no L3 cache to chips with a massive cache).

    Exactly! And to think at a sheer 20 times the bandwidth of Intel's "Crystal Well" eDRAM: 2 TB/s versus 100 GB/s.
  • schujj07 - Tuesday, June 1, 2021 - link

    While Crystal Well only had 100GB/s bandwidth, that was about equal to an AMD HD 7790 (mid-range GPU in 2013) when it was released. Overall it wasn't bandwidth limited it was space limited. In an APU setup this would act as a Z buffer like Crystal Well did. Will the extra bandwidth help more than extra cache, maybe. I wonder if the extra bandwidth is more helpful for other applications, specifically server applications. Think of an Epyc CPU that right now has 256MB L3, all of a sudden it has a massive 768MB L3 cache.
  • lightningz71 - Tuesday, June 1, 2021 - link

    Perhaps allowing twice as many cores in the CCD, to allow a product such as a 128 core EPYC with 512MB of L3 cache total (64MB x 8 CCDs) with a later iteration using improved 5nm capacity to produce 128MB Cache dies to give a full 1GB of L3 cache for the whole processor?
  • mode_13h - Tuesday, June 1, 2021 - link

    With EPYC, maybe you could get away with stacking the cache on top of cores, since their clock speeds are already reduced.
  • psychobriggsy - Tuesday, June 1, 2021 - link

    That would be the logical thing to do, if heat transfer from the CPU cores did not affect the SRAM stacked above or the bonding between the dies, and heat is still efficiently removed from the package. However it seems that you can segment your product line easier by removing the stacked SRAM from non-premium SKUs, which means you need some L3 on the base die anyway.
  • Fataliity - Tuesday, June 1, 2021 - link

    Making a chip with just SRAM is cheaper than making a full CPU with logic, SRAM, signaling, etc. The 45% die size should not be a 45% increase in cost.
  • mode_13h - Tuesday, June 1, 2021 - link

    > The 45% die size should not be a 45% increase in cost.

    But you still need the same about of substrate as for cores, right? With the looming substrate shortage, that's going to be an issue.
  • mode_13h - Tuesday, June 1, 2021 - link

    > what is stopping them from just moving the entire L3 cache out into a separate stacked die?

    They want to stack SRAM on SRAM -- not cores -- for thermal reasons. Also, they probably want the ability to sell cheaper versions without the extra layer, in which case the base die would need *some* L3 cache.
  • yeeeeman - Tuesday, June 1, 2021 - link

    zen 4 will come in late 2022. before that we will see zen 3+ on 6nm with probably this 3d cache tech.
  • Wereweeb - Tuesday, June 1, 2021 - link

    Source: of the Nile
  • mode_13h - Tuesday, June 1, 2021 - link

    Lake Victoria?
  • Matthias B V - Tuesday, June 1, 2021 - link

    This is a massive but expected step forward. Not just in performance but also efficiency! Moving data off the chip / SoC costs lots of performance and energy. So more cache is always welcome.

    So far it is just too expensive in monolithic designs as cache costs lots of space and scales bad with newer nodes. With chiplets and stacking we might even see GB of cache...

    Looking foeward for the puzzle to come together and show the big picture:

    Chiplet(s) for CPU + chiplet(s) for GPU + I/O chiplet + Cache chiplet + specific application chiplet. Might not just overcome current limits but allow for cheap customized solutions and more differentiation.

    Will apply for Meteor/LunarLake too I would expect!
  • Alistair - Tuesday, June 1, 2021 - link

    I mean, Intel is trying to charge an extra $200 for the i9 over the i7, even though they are basically the same. I'd gladly pay AMD $200 extra to add 15 percent more game performance to my 5800x.
  • ET - Tuesday, June 1, 2021 - link

    Having this available this year sounds quite interesting. Since it is, as you say, mainly beneficial for games, I assume that "high end" would mean high end for the standard platform, that is, a way to compete with Alder Lake for gaming. It might show up as new 5900X and 5950X variants, priced $100-150 higher (and called Ryzen 6000? Who knows).

    By the way, I now suspect that the reported B2 stepping for Ryzen 5000 was made for reasons of changing routing to support this stacking.
  • del42sa - Tuesday, June 1, 2021 - link

    I wonder if they add some thremal compound to the structural silicon to avoid overheating and to achieve a better thermal distribution to IHS
  • edzieba - Tuesday, June 1, 2021 - link

    The 'locked at 4GHz' (i.e. underclocked by ~1GHz) demo certainly raises a red flag around thermals - If the CPUs were already otherwise identical beyond the extra cache dies, why underclock them instead of just locking them to their normal operating frequency (or not locking them at all and letting them throttle up and down for a real-world comparison)?
    If insulating the compute die by stacking an extra layer of silicon on top of it prevents reaching the same turbo rates as a standard die, that could easily eat away any performance gain in the vast majority of workloads that do not benefit from enormous cache sizes.
  • psychobriggsy - Tuesday, June 1, 2021 - link

    It is a prototype, and locking the clock does allow direct comparison. Apart from silicon shortage, it is possible that heat transfer issues at high clock speeds killed off the rumoured Warhol consumer chip design that probably used this.

    I suspect this will form the basis of Milan-X, where clock speeds are less critical over sheer quantity of cores and cache.
  • Kevin G - Tuesday, June 1, 2021 - link

    The difficult thing is controlling for turbo speeds which are inherently dynamic as the system runs. They could still potentially have the same max turbo but how long they sustain it and when they boost upward will vary. AMD's testing methodology here is correct as it removes that variable from apples-to-apples testing. The downside is that this sets expectations higher as the real world turbo may not be able to be sustained as long or clock as high as existing chips. There almost certainly be a benefit to the additional cache here but it likely will be a bit less than the 15% they're claiming in the apple-to-apples comparison.
  • mode_13h - Tuesday, June 1, 2021 - link

    > The 'locked at 4GHz' demo certainly raises a red flag around thermal

    True. If the SRAM burns a lot of power, or the inactive part of its die does meaningfully compromise heat dissipation by the cores, then the results of an unlocked comparison could end up a lot closer.

    That makes me think this could be more of a server-oriented thing.
  • AntonErtl - Tuesday, June 1, 2021 - link

    Hmm 64MB in 36mm² in 7nm is about the same density as Intel's 128MB in 77mm² in 22nm; but IIRC Intel used DRAM, vs. AMD's SRAM.

    It seems to me that AMD could easily (and at less cost) have chiplets with the extra cache, so this looks more like a proof-of-concept than a preview of a product.
  • Kevin G - Tuesday, June 1, 2021 - link

    DRAM though is a one transistor and capacitor design where as SRAM typically is six transistors. Given the different designs and processes, this isn't terribly surprising.

    The main benefit of going SRAM is that it extends the existing SRAM used as the L3 cache on the Zen 3 chiplets. Due to the latency differences and packing technologies, Intel's usage of the external eDRAM made sense as a L4 cache. Only IBM with their SOI based manufacturing nodes years ago attempted to leverage eDRAM as a L3 cache.
  • konbala - Tuesday, June 1, 2021 - link

    Looks like a pirate...
  • Silver5urfer - Tuesday, June 1, 2021 - link

    This is a big surprise. But a highly welcomed and great move from AMD. Intel on the otherhand acts like L4 cache eDRAM designs do not even exist, 4980HQ BGA processor got that L4 cache and with BGA to rPGA socket that CPU is beating a stock 6700K and close to 7700K. Magic of on die L4. Even 5775C Broadwell did that but idiotic iGPU BS is needed right ? because wafer yields and same manufacturing line for those crappy BGA parts and Thin and Light market.

    I commend AMD for making such a move. BUT the problems / questions with this move - First of all why 4GHz fixed speed ? Yeah it's a prototype design, but that means AMD might be having a temperature problem with that, they stacked the cache instead of sharing the PCB space so this will definitely put more heat on the chip. And as mentioned in the article the logic is below the cache, meaning more heat trapped. So there will be a clockspeed problem, Zen 3 cannot afford to lose clocks, esp in an AM4 design for BGA and thin and light garbage maybe.

    Ultimately this is probably going to be that Zen3+ APU/BGA refresh for 2022 to combat Alder Lake mobile. As for Desktop I do not know what AMD is planning, if this is on Zen 4 with TSMC 5nm and that helps the heat and balance out the clockspeed with IPC, Zen 4 is going to shred ADL to pieces. Much better than that stupid x86 crappy phone cores in a Desktop socket.

    Zen3+ XT refresh is plausible but my doubts on heat / clockspeed are putting that rumor not viable in reality.

    Also where's Threadripper ? No damn news on that Genesis Peak TR4000. People really wanted to see how Zen 3 / Milan looks like when unleashed with pure performance. Damn it AMD. They are delaying it...could have launched it, So maybe Milan is still shipping not in very high volume to bin the dies for TR4000. It would have been amazing to see them in action.
  • nandnandnand - Tuesday, June 1, 2021 - link

    A surprise to be sure, but a welcome one.

    If there were major heat issues, I would expect even lower than 4 GHz and no talk of it being in shipping products. I think fixed 4 GHz was chosen to make the comparison purely about IPC from more cache hits and not any slight variations in the silicon quality.
  • msroadkill612 - Tuesday, June 1, 2021 - link

    AMD ar not very heat challenged w/ their 7nm cpuS, & dont forget their smart power management.

    The present APUs skipped pcie 4, perhaps for heat reasons. Perhaps playing w/ IO power usage is another option?
  • Silver5urfer - Tuesday, June 1, 2021 - link

    "This technology will be productized with 7nm Zen 3-based Ryzen processors. Nothing was said about EPYC."

    Woah, that means this is going to be that Zen3+ refresh ? I would buy the damn AMD but the idiotic USB problems still show up...now waiting for that B2 stepping and X570S or whatever AGESA.
  • davidefreeman - Wednesday, June 2, 2021 - link

    The SRAM chiplet is stacked above the +existing+ L3 cache, specifically to avoid thermal issues. While there is a passive silicon spacer above the logic, it will not add additional heat. Both dies have been thinned, so total thickness hasn't changed. I'd expect thermal transference hasn't significantly increased. They could probably make up for that difference by specifying a higher TDP and a better cooling solution over stock parts, assuming there is even a noticeable difference,
  • zodiacfml - Tuesday, June 1, 2021 - link

    Brings back memories when they were first with the IMC. If i'm correct, brought 5-8% improvement. This is a different beast though, a very pricey part.
  • msroadkill612 - Tuesday, June 1, 2021 - link

    Which probably means u can bump the Fabric clock speed, which means u can bump DDR 1:1 ram clock speeds.

    I have a problem w/ the automatic assumption that APUs are for peasants.

    I suspect there are many well heeled buyers who see their advantages too, and wont mind a significant premium for advanced examples.
  • Mil0 - Thursday, June 3, 2021 - link

    Ever since kaby-lake G I've been wondering why amd didn't do something like that with pure amd. Perhaps they were waiting for this combined with ddr 5. I'd say rdna @5nm + 256MB infinity cache would give amazing perf to the ultrabook format. And although my heels are dressed like a peasant, my laptop preferences are not.
  • msroadkill612 - Tuesday, June 1, 2021 - link

    I hope I am not reposting that AMD are making similar caches for their GPU card's Infinity Fabric Bus.

    Their vision seems to be a much tighter integration between system & GPU resources.

    Taken together:

    recent advances in system ram speed & capacity
    the double bandwidth of pcie 4
    this discussed l3 cache on both of the linked fabric buses

    add up to quite a paradigm shift potential.

    apps like MS Flight Simulator doing a reasonable simulation of 48GB GPU cache on a 16GB gpu card eg.
  • TanjB - Tuesday, June 1, 2021 - link

    Your analysis is confused. The direct bonding and the TSVs are different things. Only a fraction of the bonds go the TSVs. The bonds for the L3 data and command signals go directly to the circuits in face to face position, with the direct bonding (aka hybrid bonding) providing a minimal burden on the circuit for best efficiency and low latency.

    The fraction of the bonds which go to TSVs will be carrying power and also IO signals. This is because the cache chip will be on the bottom, and the compute chip reaches through those TSVs to reach the substrate. The compute chip is on top, not on bottom, so that it does not have any TSVs through itself and also so it remains closest to the heat sink.

    The TSVs are etched through the thickness of the cache chip. This is likely only 15 microns or so, but these TSVs are the size and load of microbumps. You cannot have nearly as many TSVs as you can of the signal bonds. When you stack more than to chips you cannot carry anything like the bandwidth, nor the low energy per bit, that are available to the direct signal bonds on the face-to-face pair.

    It is great technology and we will see more of it, but keep a clear view of what it offers.
  • Rudde - Tuesday, June 1, 2021 - link

    In the demo, the cache chip is on the top, which can easily be seen from the smaller square shaped bump on one of the chiplets. The slide is even more explicit about it, as it shows the cache in the middle with the cores on the sides with structural silicon on top. Simply put, AMD has the compute die on the bottom with cache on top.
  • davidefreeman - Wednesday, June 2, 2021 - link

    L3 die is above the compute die. Compute uses the most power, so keeping it as-is avoid transferring power a longer distance through TSV's. The L3 die is stacked above +existing+ L3 SRAM, avoiding thermal issues that would occur if stacked over compute logic.
  • SarahKerrigan - Tuesday, June 1, 2021 - link

    Maybe I'm missing something, but are tags stored on the main CPU die like in a lot of off-chip cache implementations?
  • Rudde - Tuesday, June 1, 2021 - link

    I've understood that AMD has the tags between memory banks. You can think of it like stripes of memory banks and tags.
  • name99 - Tuesday, June 1, 2021 - link

    "We’ve seen with other TSV stacked hardware, like HBM, that SRAM/memory/cache is the perfect vehicle for this as it doesn’t add that much to the thermal requirements of the processor. The downside is that the cache you stack on top is little more than just cache."

    Yes and no. We've heard about PiM (Processor in Memory) for years, to little practical effect.
    A more feasible route to the same end (not perfect, but "good enough", at least to get the ball rolling) might be PiC3 (Processor in L3 Cache), once 3D allows for enough cache that you can start thinking beyond just moar moar capacity?
  • mode_13h - Tuesday, June 1, 2021 - link

    > We've heard about PiM (Processor in Memory) for years, to little practical effect.

    Didn't Samsung recently announce HBM with embedded processing?

    AI accelerators tend to have lots of on-chip memory (usually/always SRAM). For instance, the Grayskull processor, by recently-profiled Tenstorrent, has 120 MB SRAM, and is made on a 12 nm process node.
  • ChaosFenix - Tuesday, June 1, 2021 - link

    So some things I think would make sense here. Dr. Cutress mentions that TSMCs stacking tech goes up to 12 layers currently. Wouldn't it make more sense to have the layers that generate the most heat up top with the lower powered layers on bottom. This should put them closer to the heat spreader. An effective design in my opinion would have something like layer 1-5 NAND, Layer 6-9 DRAM, Layer 10 SRAM, layer 11 efficiency cores, Layer 12 high power cores. I obviously have no idea what said stack would cost but even something as simple as having layer 1 SRAM and layer 2 Processing cores would be better than the inverse right?
  • IntelUser2000 - Tuesday, June 1, 2021 - link

    Please. This is wrong.

    "AMD claims that the total bandwidth of the L3 cache increases to beyond 2 TB/sec, which would technically be faster than the L1 cache on the die (but with higher latency)."

    Zen 3 can sustain 2x 256-bit loads per cycle which at 4GHz equals to 256GB/s bandwidth. But that's per core. If you multiply that by 16x, that's 4TB, or double the V-Cache.
  • Otritus - Wednesday, June 2, 2021 - link

    It doesn't work like that. Individual cores don't have access to the L1$ or L2$ of other cores, meaning the individual caches don't have access to each other. 2TB/s of bandwidth is more bandwidth than the L1$ because the CPU doesn't have 1 L1$, it has 16(or 32 if you break apart data and instructions).
  • serendip - Tuesday, June 1, 2021 - link

    ARM designs need this, huge high speed caches.
  • del42sa - Wednesday, June 2, 2021 - link

    yes and 8 ALU
  • 529th - Wednesday, June 2, 2021 - link

    So are we all thinking this will be for the 5k series desktops, too? Being demonstrated on a 5900X led me to believe so.
  • Santoval - Thursday, June 3, 2021 - link

    "This V-Cache chiplet is 64 MB of additional L3, with no stepped penalty on latency."
    Does this mean it is going to have the same latency as the in-die L3 cache despite being off-die (barely off-die but still off-die)? Is that really possible? What does "stepped" mean in the above context?
  • coffeeandknives - Friday, June 4, 2021 - link

    I'm curious what implications this has in conjunction with the Xilinx merger. Can you stack an FPGA in the loop and what is possible? I recently saw an article regarding an FPGA + CPU that had a continuously changing architecture which made it safe from certain hacks.
  • 529th - Friday, June 4, 2021 - link

    Wait, 192MB L3 cache for 5900X & 5950X or 128MB L3 cache?
  • 529th - Friday, June 4, 2021 - link

    NM, 64MB per chiplet so 192MB is correct.
  • tamalero - Friday, June 4, 2021 - link

    If the R6000 series of video cards had such problems with memory speed. Does this means that increasing the cache via 3d stacking.. could save space but also boost the performance?

Log in

Don't have an account? Sign up now