Sensible Scaling: OoO Atom Remains Dual-Issue

The architectural progression from Apple, ARM and Qualcomm have all been towards wider, out-of-order cores, to varying degrees. With Swift and Krait, Apple and Qualcomm both went wider. From Cortex A8 to A9 ARM went OoO and then from A9 to A15 ARM introduced a significantly wider architecture. Intel bucks the trend a bit by keeping the overall machine width unchanged with Silvermont. This is still a 2-wide architecture.

At the risk of oversimplifying the decision here, Intel had to weigh die area, power consumption as well as the risk of making Atom too good when it made the decision to keep Silvermont’s design width the same as Bonnell. A wider front end would require a wider execution engine, and Intel believed it didn’t need to go that far (yet) in order to deliver really good performance.

Keeping in mind that Intel’s Bonnell core is already faster than ARM’s Cortex A9 and Qualcomm’s Krait 200, if Intel could get significant gains out of Silvermont without going wider - why not? And that’s exactly what’s happened here.

If I had to describe Intel’s design philosophy with Silvermont it would be sensible scaling. We’ve seen this from Apple with Swift, and from Qualcomm with the Krait 200 to Krait 300 transition. Remember the design rule put in place back with the original Atom: for every 2% increase in performance, the Atom architects could at most increase power by 1%. In other words, performance can go up, but performance per watt cannot go down. Silvermont maintains that design philosophy, and I think I have some idea of how.

Previous versions of Atom used Hyper Threading to get good utilization of execution resources. Hyper Threading had a power penalty associated with it, but the performance uplift was enough to justify it. At 22nm, Intel had enough die area (thanks to transistor scaling) to just add in more cores rather than rely on HT for better threaded performance so Hyper Threading was out. The power savings Intel got from getting rid of Hyper Threading were then allocated to making Silvermont an out-of-order design, which in turn helped drive up efficient use of the execution resources without HT. It turns out that at 22nm the die area Intel would’ve spent on enabling HT was roughly the same as Silvermont’s re-order buffer and OoO logic, so there wasn’t even an area penalty for the move.

The Original Atom microarchitecture

Remaining a 2-wide architecture is a bit misleading as the combination of the x86 ISA and treating many x86 ops as single operations down the pipe made Atom physically wider than its block diagram would otherwise lead you to believe. Remember that with the first version of Atom, Intel enabled the treatment of load-op-store and load-op-execute instructions as single operations post decode. Instead of these instruction combinations decoding into multiple micro-ops, they are handled like single operations throughout the entire pipeline. This continues to be true in Silvermont, so the advantage remains (it also helps explain why Intel’s 2-wide architecture can deliver comparable IPC to ARM’s 3-wide Cortex A15).

While Silvermont still only has two x86 decoders at the front end of the pipeline, the decoders are more capable. While many x86 instructions will decode directly into a single micro-op, some more complex instructions require microcode assist and can’t go through the simple decode paths. With Silvermont, Intel beefed up the simple decoders to be able to handle more (not all) microcoded instructions.

Silvermont includes a loop stream buffer that can be used to clock gate fetch and decode logic in the event that the processor detects it’s executing the same instructions in a loop.

Execution

Silvermont’s execution core looks similar to Bonnell before it, but obviously now the design supports out-of-order execution. Silvermont’s execution units have been redesigned to be lower latency. Some FP operations are now quicker, as well as integer multiplies.

Loads can execute out of order. Don’t be fooled by the block diagram, Silvermont can issue one load and one store in parallel.

 

OoOE & The Pipeline ISA, IPC & Frequency
Comments Locked

174 Comments

View All Comments

  • xTRICKYxx - Tuesday, May 7, 2013 - link

    You're right. Intel has nothing to show at all.... Its not like they have the most powerful mobile and desktop consumer processors available.
  • R0H1T - Tuesday, May 7, 2013 - link

    Yeah, now sit & watch that market(x86) die a slow death at the hands of mobile/tablets that are powered by "good enough" ARM which doesn't need teraflop level of performance to sell their stuff unlike Intel !
  • misaki - Monday, May 6, 2013 - link

    Wow, clearly you are a new reader. This is an architecture overview, not a performance article, which means the information HAS to come from Intel. They have done these type of articles with every architecture redesign since the 90's.

    When chips are available to test that is when the real world performance articles will come out.
  • Ortanon - Monday, May 6, 2013 - link

    SERIOUSLY.
  • kyuu - Monday, May 6, 2013 - link

    Yes, but a lot of performance claims are being made in the article, and Anand really seemed to just be taking Intel's marketing speak for gospel. That's how it read, at least.
  • xTRICKYxx - Tuesday, May 7, 2013 - link

    Not really. He clearly states to take the graphs cautiously. Also Intel may be slightly misleading, but nothing in the graphs are lies. They chose the best possible scenario for the greatest advantage.
  • R0H1T - Tuesday, May 7, 2013 - link

    Like how they(AT) claimed Intel's "SDP" was superior after stress testing an Exynos Octa, yup loved that fairytale !
  • Kevin G - Monday, May 6, 2013 - link

    The article mentions that the IDI is similar to internal bus found on the Nehalem and later desktop processor. IDI here is mentioned as a point-to-point interconnect where as everything is linked via a ring bus in recent Core processors. Of course you can loop multipe point-to-point interfaces into a loop but the article's wording allows for other topologies.

    For example, each Silvermont module could have its down dedicated point-to-point link to the system agent. In Nehalem, the system against logically appears as another hop in the internal ring bus.
  • Exophase - Monday, May 6, 2013 - link

    Small correction:

    "Remember that with the first version of Atom, Intel enabled the fusion of load-op-store and load-op-execute instructions. Instead of these instruction combinations decoding into three and two micro-ops respectively, they would be fused post-fetch and treated like single operations throughout the entire pipeline."

    Atom (the current one anyway) doesn't have instruction (macro-op) fusion. It does handle load + op and load + op + store are one issue down the pipeline but they still came from single instructions that are a natural part of the x86 ISA. While these may be considered fused micro-ops in Intel's other CPUs that terminology doesn't fit Atom.

    These operations do need multiple instructions on most more RISCy ISAs like ARM. But the same is true the other way around (notably, 3 address arithmetic). I very much doubt you'll find typical x86 programs only need 2 instructions for every 3 ARM instructions on average, or at least any papers I've seen that measure micro-ops vs instructions on high-end CPUs are nowhere close to 1.5 (and a uop isn't generally more powerful than an ARM instruction, sometimes less when you consider two are needed for a store). But there are lots of other places that caused stalls on Atom that weren't related to decode, that it's easy to see how you could still gain a lot of perf/MHz without increasing it - as Bobcat and Jaguar have shown. All the details here do seem to point to a Bobcat-like design only with a much lower L2 cache latency and branch mispredict penalty which can only help more.
  • Anand Lal Shimpi - Monday, May 6, 2013 - link

    Er you're very correct. Atom doesn't break these instructions down further, they're treated like single ops throughout the pipeline. I've updated the section. Thank you!

Log in

Don't have an account? Sign up now