Integer & FP Execution

On the integer execution side, units and pipelines look largely unchanged from Bobcat. The big performance addition here is the use of Llano’s hardware divider. Bobcat had a microcoded integer divider capable of one bit per cycle, while Jaguar moves to a 2-bits-per-cycle divider. The hardware is all clock gated, so when it’s not in use there’s no power penalty.

The schedulers and re-order buffer are incrementally bigger in Jaguar. Some scheduling changes and other out of order resource increases are at work here as well.

Integer performance wasn’t a huge problem with Bobcat to begin with, but floating point performance was a different issue entirely. In our original Brazos review we found that heavily threaded FP workloads were barely faster on Bobcat than they were on Atom. A big part of that had to do with Atom’s support for Hyper Threading. AMD addressed both issues by beefing up FP execution and doubling up the maximum number of CPU cores with Jaguar (more on this later).

Bobcat’s FP execution units were 64-bits wide. Any 128-bit FP operations had to be chunked up and worked on in stages. In Jaguar, AMD moved all of its units to 128-bits wide. AVX operations complete as 2 x 128-bit operations, while all other 128-bit operations can execute without multiple passes through the pipeline. The increase in vector width is responsible for the gains in FP performance.

The move to 128-bit vectors in the FPU forced AMD to add another pipeline stage here as well. The increase in FPU size meant that some signals needed a little extra time to get from one location to the next, hence the extra stage.

Load/Store

The out-of-order load/store unit in Bobcat was the first one AMD had ever done (Bobcat beat Bulldozer to market, so it gets the claim to fame there). As such there was a good amount of room for improvement, which AMD capitalized on in Jaguar. The second gen OoO load/store unit is responsible for a good amount of the ~15% gains in IPC that AMD promises with Jaguar.

Jaguar: Improved 2-wide Out-of-Order The Jaguar Compute Unit & Physical Layout/Synthesis
Comments Locked

78 Comments

View All Comments

  • Wolfpup - Wednesday, June 12, 2013 - link

    Ironically when I see an Intel sticker on a tablet (unless it's a Core i part), I avoid it like the plague. Bobcat would have been perfect for tablets, and a BIG selling point.
  • Wolfpup - Wednesday, June 12, 2013 - link

    Yeah, I really have no interest in an Atom tablet, partially even just because of the horrible video.

    I've got an 11.6" AMD c50 (lowest end Bobcat) based notebook, and while it's slow, it's still impressive how it runs anything, and in a pinch can even function as a main PC. AMD's got an even lower power Bobcat part with the exact same performance for tablets, but I don't know of shipping computers that used it, and it really would have been perfect. These new ones of course will be even better.

    I wonder if the companies building these understand that using AMD would be a selling point... I see "Atom" and my eyes glaze over....
  • codedivine - Thursday, May 23, 2013 - link

    4 DP FMAs per 16 cycles? Why even bother putting them in :|
  • Tuna-Fish - Thursday, May 23, 2013 - link

    Because it's expected by the spec, and some compute loads use it for very rarely used things.
  • Exophase - Thursday, May 23, 2013 - link

    "I should point out that ARM is increasingly looking like the odd-man-out here, with both Jaguar and Intel’s Silvermont retaining the dual-issue design of their predecessors."

    It's not just ARM, it's three different current gen ARM cores.. if you're going to pose it as ISA shouldn't it then just be ARM vs x86 and not ARM vs Silvermont and Jaguar?

    Besides, MIPS is 3-way in its CPUs targeting this power budget too (proAptiv), and so is PowerPC (e600 for instance). The reason why Silvermont and Jaguar is 2-way is really undeniable: x86 decoders are substantially more expensive than those for any of these ISAs, even Thumb-2. There's some validity to the argument that x86 instructions are more powerful (after first negating where they aren't - most critically, lack of three-way addressing adds a lot of extra move instructions for non-AVX processors) but nowhere close to 50% more powerful.
  • lmcd - Thursday, May 23, 2013 - link

    Isn't Qualcomm Krait 2-way?
  • Exophase - Thursday, May 23, 2013 - link

    Qualcomm hasn't said an awful lot about the internals of the uarch but several sources report 3-way decode and I haven't seen any say 2-way. It's possible that isn't fully symmetric or limited in some other way, we don't really know.
  • Krysto - Friday, May 24, 2013 - link

    Pretty sure it's 3-way.
  • tiquio - Thursday, May 23, 2013 - link

    I don't really understand the point about unique macros. What are macros in reference to CPU architecture.
  • quasi_accurate - Thursday, May 23, 2013 - link

    Don't worry, I had no idea either until I started working in the industry :) It just means custom circuits that are hand crafted by a human. This is as opposed to "synthesis", in which the RTL code (written in a hardware description language such as Verilog) are "synthesized" by design software into circuits.

Log in

Don't have an account? Sign up now