Integer & FP Execution

On the integer execution side, units and pipelines look largely unchanged from Bobcat. The big performance addition here is the use of Llano’s hardware divider. Bobcat had a microcoded integer divider capable of one bit per cycle, while Jaguar moves to a 2-bits-per-cycle divider. The hardware is all clock gated, so when it’s not in use there’s no power penalty.

The schedulers and re-order buffer are incrementally bigger in Jaguar. Some scheduling changes and other out of order resource increases are at work here as well.

Integer performance wasn’t a huge problem with Bobcat to begin with, but floating point performance was a different issue entirely. In our original Brazos review we found that heavily threaded FP workloads were barely faster on Bobcat than they were on Atom. A big part of that had to do with Atom’s support for Hyper Threading. AMD addressed both issues by beefing up FP execution and doubling up the maximum number of CPU cores with Jaguar (more on this later).

Bobcat’s FP execution units were 64-bits wide. Any 128-bit FP operations had to be chunked up and worked on in stages. In Jaguar, AMD moved all of its units to 128-bits wide. AVX operations complete as 2 x 128-bit operations, while all other 128-bit operations can execute without multiple passes through the pipeline. The increase in vector width is responsible for the gains in FP performance.

The move to 128-bit vectors in the FPU forced AMD to add another pipeline stage here as well. The increase in FPU size meant that some signals needed a little extra time to get from one location to the next, hence the extra stage.

Load/Store

The out-of-order load/store unit in Bobcat was the first one AMD had ever done (Bobcat beat Bulldozer to market, so it gets the claim to fame there). As such there was a good amount of room for improvement, which AMD capitalized on in Jaguar. The second gen OoO load/store unit is responsible for a good amount of the ~15% gains in IPC that AMD promises with Jaguar.

Jaguar: Improved 2-wide Out-of-Order The Jaguar Compute Unit & Physical Layout/Synthesis
Comments Locked

78 Comments

View All Comments

  • skatendo - Friday, May 24, 2013 - link

    Not entirely true. The Wii U CPU is highly customized and has enhancements not found in typical PowerPC processors. It's been completely tailored for gaming. I'm not saying it's the power of the newer Jaguar chipsets, but the beauty of custom silicon is that you can do much more with less (Tegra 3's quad-core, 12-core GPU vs. Apple's A5 dual core CPU/GPU anyone? yeah A5 kicked its arse for games) that's why Nintendo didn't release tech specs because they tailored a system for games and performance will manifest with upcoming games (not these sloppy ports we've seen so far).
  • tipoo - Friday, May 24, 2013 - link

    I'm aware it would be highly customized, but a plethora of developers have also come out and said the CPU sucks.
  • skatendo - Saturday, May 25, 2013 - link

    Also the "plethora" of developers that said it sucked (namely the Metro: Last Light dev) said they had an early build of the Wii U SDK and said it was "slow". Having worked for a developer, they base their opinions on how fast/efficient they can port over their game. The Wii U is a totally different infrastructure that lazy devs don't want to take the time to learn, especially with a newer GPGPU.
  • Kevin G - Sunday, May 26, 2013 - link

    If a developer wants to do GPGPU, the PS4 and Xbox One will be highly preferable due to unified virtual memory space. If GPGPU was Nintendo's strategy, they shouldn't have picked a GPU from the Radeon 6000 generation. Sure, it can do GPU but there are far more compromises to hand off the workload.
  • Simen1 - Thursday, May 23, 2013 - link

    What is the TDP and die size of the APUs in X-Box One and Playstation 4?
  • haukionkannel - Thursday, May 23, 2013 - link

    Douple the 1.6 Ghz 4 core version and you are near. The wider memory controller eats some extra energy to, so maybe you have to add 0.2 to 0.3 calculation...
  • fellix - Thursday, May 23, 2013 - link

    "The L2 cache is also inclusive, a first in AMD’s history."

    Not exactly correct. The very first Athon (K7) on Slot A with off-die L2 used inclusive cache hierarchy. All models after that moved to exclusive design.
  • Exophase - Thursday, May 23, 2013 - link

    Bulldozer is also mostly inclusive. Not strictly inclusive, but certainly not exclusive (you really wouldn't get such a thing from a writethrough L1 cache)
  • whyso - Thursday, May 23, 2013 - link

    Ahh amd, I love your marketing slides. Lets compare battery life and EXCLUDE the screen. Never mind that the screen consumes a large amount of power and that when you add it to the total battery life savings go down tremendously. (That's why sandy-> ivy bridge didn't improve battery life that much on mobile). Lets also leave out the Rest of system power and soc power for brazos. It also looks like the system is using an SSD to generate these numbers which looking at the target market almost no OEM will do.
  • extide - Thursday, May 23, 2013 - link

    It's a perfectly valid comparison to make. All laptops will include a screen and the screen has nothing to do with AMD (or Intel).

Log in

Don't have an account? Sign up now