Microprocessor architectures these days are largely limited, and thus defined, by power consumption. When it comes to designing an architecture around a power envelope the rule of thumb is any given microprocessor architecture can scale to target an order of magnitude of TDPs. For example, Intel’s Core architectures (Sandy/Ivy Bridge) effectively target the 13W - 130W range. They can surely be used in parts that consume less or more power, but at those extremes it’s more efficient to build another microarchitecture to target those TDPs instead.

Both AMD and Intel feel similarly about this order of magnitude rule, and thus both have two independent microprocessor architectures that they leverage to build chips for the computing continuum. From Intel we have Atom for low power, and Core for high performance. In 2010 AMD gave us Bobcat for its low power roadmap, and Bulldozer for high performance.

Both the Bobcat and Bulldozer lines would see annual updates. In 2011 we saw Bobcat used in Ontario and Zacate SoCs, as a part of the Brazos platform. Last year AMD announced Brazos 2.0, using slightly updated versions of those very same Bobcat based SoCs. Today AMD officially launches Kabini and Temash, APUs based on the first major architectural update to Bobcat: the Jaguar core.

Gallery: 6 Corner

Jaguar: Improved 2-wide Out-of-Order
 

At the core-level, Jaguar still looks a lot like Bobcat. The same dual-issue, out-of-order architecture that AMD introduced in 2010 remains intact with Jaguar. The same L1 cache, front end and execution blocks are all still here. Given the ARM transition from a dual-issue, out-of-order core with Cortex A9 to a three-issue, OoO design with the Cortex A15, I expected something similar from AMD. Despite moving to a smaller manufacturing process (28nm), AMD was very focused on increasing performance within the same TDP or lower with Jaguar. The driving motivator? While Bobcat ended up in netbooks, nettops and other low cost, but thick machines, Jaguar needed to go into even thinner form factors: tablets. AMD still has no intentions of getting into the smartphone SoC space, but the Windows 8 (and Android?) tablet market is fair game. Cellular connectivity isn’t a requirement there, particularly at the lower price points, and AMD can easily be a second source alternative to Intel Atom based designs.

The average number of instructions executed per clock (IPC) is still below 1 for most client workloads. There’s a certain amount of burst traffic to be expected but given the types of dependencies you see in most use cases, AMD felt the gain from making the machine wider wasn’t worth the power tradeoff. There’s also the danger of making the cat-cores too powerful. While just making them 3-issue to begin with wouldn’t dramatically close the gap between the cat-cores and the Bulldozer family, there’s still a desire for there to be clear separation between the two microarchitectures.

The move to a three-issue design would certainly increase performance, but AMD’s tablet ambitions and power sensitivity meant it would save that transition for another day. I should point out that ARM is increasingly looking like the odd-man-out here, with both Jaguar and Intel’s Silvermont retaining the dual-issue design of their predecessors. Part of this has to do with the fact that while AMD and Intel are very focused on driving power down, ARM has aspirations of moving up in the performance/power chain.

The width of the front end is only one lever AMD could have used to increase performance. While it was a pretty big lever that AMD chose not to pull, there are other smaller levers that were exercised in Jaguar.

There’s now a 4 x 32-byte loop buffer for the instruction cache. Whenever a loop is detected, instead of fetching instructions executed in the loop from the L1 I-cache over and over again, they’re serviced from this small loop buffer. If this sounds like a trace cache or decoded micro-op cache, don’t get too excited, Jaguar’s loop buffer is neither of these things. There are no pipeline savings or powered down fetch/decode units. The only benefit to the new loop buffer is the instruction cache doesn’t have to be fired up during every iteration of a buffered loop. In other words, this is a very specific play to reduce power consumption - not to improve performance.

All microprocessors see tons of simulation work before they’re ever brought to market. Even once a design is done, additional profiling is used to identify bottlenecks, which are then prioritized for addressing in future designs. All bottleneck removal has to be vetted against power, cost and schedule constraints. Given an infinite budget across all vectors you could eliminate all bottlenecks, but you’d likely take an infinite amount of time to complete the design. Taking all of those realities into account usually means making tradeoffs, even when improving a design.

We saw the first example of a clear tradeoff when AMD stuck with a 2-issue front end for Jaguar. Not including a decoded micro-op cache and opting for a simpler loop buffer instead is an example of another. AMD likely noticed a lot of power being wasted during loops, and the addition of a loop buffer was probably the best balance of complexity, power savings and cost.

AMD also improved the instruction cache prefetcher, not because of an over abundance of bandwidth but by revisiting the Bobcat design and spending some more time on the implementation in Jaguar. The IC prefetcher improvements are simply AMD doing things better in Jaguar, not being under the same pressure to introduce a brand new architecture as was the case with Bobcat.

The instruction buffer between the instruction cache and decoders grew in size with Jaguar, a sort of half step towards the more heavily decoupled fetch/decode stages in Bulldozer.

Jaguar adds support for new instructions (SSE4.1/4.2, AES, CLMUL, MOVBE, AVX, F16C, BMI1) as well as 40-bit physical addressing.

The final change to the front of Jaguar was the addition of another decode stage, purely for frequency gains. It turns out that in Bobcat the decoder was one of the critical paths limiting maximum frequency. Adding another decode stage simply gave AMD enough wiggle room to hit their frequency targets for Jaguar at 28nm.

Integer & FP Units, Load/Store Improvements
Comments Locked

78 Comments

View All Comments

  • Spunjji - Friday, May 24, 2013 - link

    Every CPU manufacturer does that... why would they include numbers they have no control over?
  • araczynski - Thursday, May 23, 2013 - link

    Can anyone clue me in as to how AMD got the rights to make 'PU's for both of the consoles? Was it just bang/$ vs Intel/IBM/etc? Not a fanboy of either camp (amd/intel), just curious.
  • Despoiler - Thursday, May 23, 2013 - link

    It was purely the fact that they have an APU with a high end GPU on it. Intel is nowhere near AMD in terms of top tier graphics power. Nvidia doesn't have x86. The total package price for an APU vs CPU/GPU made it impossible for an Intel/Nvidia solution to compete. The complexity is also much less on an APU system than CPU/GPU. The GPU needs a slot on the mobo. You have to cool it as well as the CPU. Less complexity = less cost.
  • araczynski - Thursday, May 23, 2013 - link

    thanks.
  • tipoo - Thursday, May 23, 2013 - link

    Multiple reasons, AMD has historically been better with console contracts than Nvidia or Intel though, those two want too much control over their own chips while AMD licences them out and lets Sony or MS do whatever with them. They're probably also cheaper, and no one else has an all in one APU solution with this much GPU grunt yet.
  • araczynski - Thursday, May 23, 2013 - link

    thanks.
  • WaltC - Thursday, May 23, 2013 - link

    Good article!--as usual, it's mainly Anand's conclusions that I find wanting...;) Nobody's "handing" AMD anything, as far as I can see. AMD is far, far ahead of Intel on the gpu front and has been for years--no accident there. AMD earned whatever position it now enjoys--and it's the only company in existence to go head-to-head with Intel and beat them, and not just "once," as Anand points out. Indeed, we can thank AMD for Core 2 and x86-64; had it been Intel's decision to make we'd all have been puttering happily away on dog-slow, ultra-expensive Itanium derivatives of one kind or another. (What a nightmare!) Intel invested billions in world-wide retool for Rdram while AMD pushed the superior market alternative, DDR Sdram. AMD won out there, too. There are many expamples of AMD's hard work, ingenuity, common sense and lack of greed besting Intel--far more than just two. It's no accident that AMD is far ahead of Intel here: as usual, AMD's been headed in one direction and Intel in another, and AMD gets there first.

    But I think I know what Anand means, and that's that AMD can not afford to sit on its laurels. There's nothing here to milk--AMD needs to keep the R&D pedal to the medal if the company wants to stay ahead--absolutely. Had the company done that pre-Core 2, while Intel was telling us all that we didn't "need" 64-bits on the desktop, AMD might have remained out front. The company is under completely different management now, so we can hope for the best, as always. Competition is the wheel that keeps everything turning, etc.
  • Sabresiberian - Thursday, May 23, 2013 - link

    The point Anandtech was trying to make is that no one is stepping up to compete with AMD's Jaguar, and so they are handing that part of the business to AMD - just as AMD handed their desktop CPU business to Intel by deciding not to step up on that front. If you don't do what it takes to compete, you are "handing" the business to those who do. This is a complement to AMD and something of a slam on the other guys, not a suggestion that AMD needed some kind of charity to stay in business here.

    I want to suggest that you are letting a bit of fanboyism color your reaction to what others say. :)

    Perhaps if AMD had been a bit more "greedy" like Intel is in your eyes, they wouldn't have come so close to crashing permanently. Whatever, it has been very good to see them get some key people back, and that inspires hope in me for the company and the competition it brings to bear. We absolutely need someone to kick Intel in the pants!

    Good to see them capture the console market (the two biggest, anyway). Unfortunately, as a PC gamer that hates the fact that many games are made at console levels, I don't see the new generation catching up like they did back when the PS3 and Xbox 360 were released. It looks to me like we will still have weaker consoles to deal with - better than the previous gen, but still not up to mainstream PC standards, never mind high-end. The fact that many developers have been making full-blown PC versions from the start instead of tacking on rather weak ports a year later is more hopeful than the new console hardware, in my opinion.
  • blacks329 - Thursday, May 23, 2013 - link

    I honestly expect the fact that both PS4 and X1 are x86 will benefit PC games quite significantly as well. Last gen devs initially developed for 360 and ported over to the PS3 and PC and later in the gen shifted to PS3 as the lead architect with some using PCs. I expect now, since porting to PS4 and X1 will be significantly easier, PC will eventually become the lead platform and will scale down accordingly for the PS4 and X1.

    As someone who games more on consoles than PCs, I'm really excited for both platforms as devs can spend less time tweaking on a per platform basis and spend more time elsewhere.
  • FearfulSPARTAN - Thursday, May 23, 2013 - link

    Actually im pretty sure 90% still made the games on xbox first then ported to other platforms, however with all of them (excluding the wii u) being x86, the idea of them porting down from pc is quit possible and I didnt think about that. It would probably start to happen mid to late in the gen though.

Log in

Don't have an account? Sign up now