I spoke too soon. Earlier today I outlined AMD’s roadmap for 2010 - 2011. In 2011 AMD will introduce two next-generation microarchitectures: Bulldozer for the high end desktop and server space and Bobcat for the price/power efficient ultra mobile market. I originally said that AMD wasn’t revealing any more about its next-gen architectures, but AMD just proved me wrong as they unveiled the first block diagrams of both cores.

First up, Bulldozer. I hinted at the architecture in this afternoon’s article:

“A major focus is going to be improving on one of AMD’s biggest weaknesses today: heavily threaded performance. Intel addresses it with Hyper Threading, AMD is throwing a bit more hardware at the problem. The dual integer clusters you may have heard of are the route AMD is taking...”

And here’s the block diagram:


Bulldozer: AMD's Latest Leap Forward, will it be another K8 to Intel's Sandy Bridge?

This is a single Bulldozer core, but notice that it has two independent integer clusters, each with its own L1 data cache. The single FP cluster shares the L1 cache of the two integer clusters.

Within each integer “core” are four pipelines, presumably half for ALUs and half for memory ops. That’s a narrower width than a single Phenom II core, but there are two integer clusters on a single Bulldozer core.

Bulldozer will also support AVX, hinted at by the two 128-bit FMAC units behind the FP scheduler. AMD is keeping the three level cache hierarchy of the current Phenom II architecture.

A single Bulldozer core will appear to the OS as two cores, just like a Hyper Threaded Core i7. The difference is that AMD is duplicating more hardware in enabling per-core multithreading. The integer resources are all doubled, including the schedulers and d-caches. It’s only the FP resources that are shared between the threads. The benefit is you get much better multithreaded integer performance, the downside is a larger core.

Doubling the integer resources but not the FP resources works even better when you look at AMD’s whole motivation behind Fusion. Much heavy FP work is expected to be moved to the GPU anyway, there’s little sense in duplicating FP hardware on the Bulldozer core when it will eventually have a fully capable GPU sitting on the same piece of silicon. While the first incarnation of Bulldozer, the Zambezi CPU, won't have an on-die GPU, presumably future APUs will use the new core. In those designs the Bulldozer cores and the GPU will most likely even share the L3 cache. It’s really a very elegant design and the basis for what AMD, Intel and NVIDIA have been talking about for years now. The CPU will do what it does best while the GPU does what it is good at.

Fascinating.

AMD’s Next-Generation Ultramobile Core: Bobcat

Next up is Bobcat:

AMD says that a single Bobcat is capable of scaling down to less than one watt of power. Typically a single microarchitecture is capable of efficiently scaling to an order of magnitude of TDP. If Bobcat can go low as 0.5W, the high end would be around 5W. If it’s closer to 1W at the low end then 10W would be the upper portion. Either way, it’s too low to compete in current mainstream notebooks, meaning that Bobcat is strictly a netbook/ultraportable core as AMD indicated in its slides. Eventually Bulldozer will probably scale down to take care of the mainstream mobile market.

AMD provided very little detail here other than it delivers 90% of today’s mainstream performance in less than half of the silicon area. If AMD views mainstream as an Athlon II X2, then Bobcat would deliver 90% of that performance in a die area of less than 60mm^2.

Clearly this is bigger than Atom, but that’s just a guess. Either way, the performance targets sound impressive. SSE1-3 are supported as well as hardware virtualization.

AMD wouldn’t tell me what process it would be made on but they did hint that Bobcat would be easily synthesizable. I take that to mean it will be built on a bulk 28nm process at Globalfoundries and not 32nm SOI.

Both of these cores will be out in 2011. We just need to make it through 2010 first.

Comments Locked

68 Comments

View All Comments

  • smilingcrow - Friday, November 13, 2009 - link

    There seems to be a lot of confusion over what the Bulldozer block diagram represents. Fudzilla seemed to suggest that is of a complete Octo core CPU made up of two quad core blocks. Techreport seemed to suggest that it represents a ‘true’ dual core CPU block and that actual CPUs will consist of one or more of these which seems wrong as that suggests AMD would release single/dual thread versions of Bulldozer.
    Anandtech’s interpretations seems more logical and the big surprise is that it will seemingly be a 4/8 core/thread design although performance may well be much closer to that of an 8 core. But if it’s up against an Intel 8/16 design surely it will struggle unless the other enhancements can close the IPC gap.

    As for those expecting to see a very high performance GPU (relative to whatever current discrete cards offer) on the same die as the CPU well just forget it. The socket would be enormous and the TDP ridiculous. You’d be looking at up to 350W which sounds impractical.
  • JFAMD - Thursday, November 19, 2009 - link

    Each Bulldozer die will be made up of multiple modules. Each module has 2 integer cores and a shared FPU.

    An 8-core bulldozer die has 4 modules and 8 total integer cores.
    The product will be marketed as an 8-core processor
    The system (hardware)will see 8 cores.
    The OS will see 8 cores.
    The applications will see 8 cores.

    Don't get hung up on modules. They only exist to the designers and in powerpoint slides. They will not be visible to the system or to the software.
  • smilingcrow - Friday, November 20, 2009 - link

    A single Bulldozer "module" looks to the OS like a single processor core with simultaneous multithreading (SMT) enabled.
  • JFAMD - Friday, November 20, 2009 - link

    No, a single bulldozer module looks like two individual cores, it does not look like a hyperthreaded core. The OS cannot see the module, it only sees integer cores.

    With HyperThreading, the job scheduler needs to know how to deal with "full cores" and "shared cores."

    For instance, if you have 4 cores with HT, if you are only running 4 threads, you want them each spread out on to individual cores, you don't want all 4 of them sharing 2 cores while other cores are sitting idle. That is a relatively low level of scheduler complexity (relatively speaking.)

    But when 2 more threads come through, as you start sharing cores, you want to try to optimize the threads around execution. Let's say you have all 4 cores active with threads, 2 "heavy" threads and 2 "light" threads. If you need to "double up" the 2 new threads, you'd rather put them on the 2 cores that have light loads right now. So, this is a much more complex level of complexity.

    So, with HyperThreading, you have an added level of complexity. You have to think about WHERE to put threads vs. just looking for the next available resource.

    Both could be considered the same if you are contending that HyperThreading is unoptimized and there would be shared cores with other cores fully idle. But I do not believe this is the case.

    With Bulldozer, each new thread has the same shot at resources and each resource acts the same, because your schedulers are dedicated.
  • Zool - Monday, November 23, 2009 - link

    I dont know where do you get the bulldozer module thingy. From the picture its quite unclear. If they would have more modules it would be on the picture. The slides say "two tightly linked cores" each with its own L1 cache. But the L2 and L3 are all shared so ading new modules look like ading new L3 and L2 and thats stupid.
    For me from the picture it looks more like amd hit the nail right on the head.
    Maybe the pipelines they reffer are execution units of seweral cores together tied to one L1 cache. If u think about it why would you need 8 complete cpu dies with all the transistors just to hawe 8 times the execution units ? So why not just put together the most main core logic u can in one core and conect it on one L1 cache and the rest is all shared. Those 8 cores would be actualy in 2 super wide cores without the need for os or programers to take care of each physical core.

  • Zool - Monday, November 23, 2009 - link

    It would also reduce the die size for a similar 8 core cpu and with it the cost of course.It would be also much faster than 8 cores with its own L1 and L2 cache comunicating trough external bus (QPI or HyperTransport link). Maybe AMD did learn something from ATI and the gpu-s. Why would you increase the gap betwen cores with increasing the amount of them when u could hide them to software and put them closer right next to each other. Like anyone would need a 16 core nehalem when majority of programs cant use even 4 of them.
  • dew111 - Sunday, May 30, 2010 - link

    This diagram represents one block that acts as two cores. Zool, modern x86 cores need multiple functional units to support efficient out of order execution and keep the IPC high. Modern x86 architectures translate x86 instructions into multiple 'micro-ops' which are simple, RISC-like instructions. These can be executed in parallel much easier. The 'integer scheduler' will only hold micro-ops for one thread. Thus, the pipelines in the diagram are not individual cores. Superscalar (multiple parallel pipelines) design has been common for over 15 years now in CPU's.

    Therefore, as stated in the article, this block will be replicated on-die to create 4, 6, 8+ core processors. The processor would almost certainly have a monolithic L3 cache, and would probably have separate L2 caches for each '2-core' block.
  • Zool - Monday, November 16, 2009 - link

    There isnt to much programs that can use more than 2 cores if u dont count graphic rendering , encodings. On a single computer in benchmarks the only thing that is increasing performance near linear are rendering and encodings for quite a time. For a average user using a average program the intel 8/16 is overkill. There arent to many people runing 8 active working programs at once. And for virtualization double the integer performance is much better.
    Also high performance gpu on cpu wont hapen until they dont solve the bandwith for it.(something over 150 GB/s right now for high end)
    With curent cpu memory bandwith it will be enough for low end gpu and a fast coprocesor.
  • Risforrocket - Monday, November 16, 2009 - link

    I don't think anyone expects this first APU to provide strong graphics. However, it might surprise if the cpu cores are involved sufficiently but that might lead to other disappointments. I am hoping the gpu core will be used not so much for graphics but for the types of computing tasks that it will do well. I will certainly be using a discrete graphics card as I always have. And I'm sure it will come in 4 core and 8 core versions, anything less will be a step back.
  • qcmadness - Friday, November 13, 2009 - link

    I would expect IPC to be enhanced.

    The "good old" K7 integer and floating point units will be redesigned.

Log in

Don't have an account? Sign up now