Sensible Scaling: OoO Atom Remains Dual-Issue

The architectural progression from Apple, ARM and Qualcomm have all been towards wider, out-of-order cores, to varying degrees. With Swift and Krait, Apple and Qualcomm both went wider. From Cortex A8 to A9 ARM went OoO and then from A9 to A15 ARM introduced a significantly wider architecture. Intel bucks the trend a bit by keeping the overall machine width unchanged with Silvermont. This is still a 2-wide architecture.

At the risk of oversimplifying the decision here, Intel had to weigh die area, power consumption as well as the risk of making Atom too good when it made the decision to keep Silvermont’s design width the same as Bonnell. A wider front end would require a wider execution engine, and Intel believed it didn’t need to go that far (yet) in order to deliver really good performance.

Keeping in mind that Intel’s Bonnell core is already faster than ARM’s Cortex A9 and Qualcomm’s Krait 200, if Intel could get significant gains out of Silvermont without going wider - why not? And that’s exactly what’s happened here.

If I had to describe Intel’s design philosophy with Silvermont it would be sensible scaling. We’ve seen this from Apple with Swift, and from Qualcomm with the Krait 200 to Krait 300 transition. Remember the design rule put in place back with the original Atom: for every 2% increase in performance, the Atom architects could at most increase power by 1%. In other words, performance can go up, but performance per watt cannot go down. Silvermont maintains that design philosophy, and I think I have some idea of how.

Previous versions of Atom used Hyper Threading to get good utilization of execution resources. Hyper Threading had a power penalty associated with it, but the performance uplift was enough to justify it. At 22nm, Intel had enough die area (thanks to transistor scaling) to just add in more cores rather than rely on HT for better threaded performance so Hyper Threading was out. The power savings Intel got from getting rid of Hyper Threading were then allocated to making Silvermont an out-of-order design, which in turn helped drive up efficient use of the execution resources without HT. It turns out that at 22nm the die area Intel would’ve spent on enabling HT was roughly the same as Silvermont’s re-order buffer and OoO logic, so there wasn’t even an area penalty for the move.

The Original Atom microarchitecture

Remaining a 2-wide architecture is a bit misleading as the combination of the x86 ISA and treating many x86 ops as single operations down the pipe made Atom physically wider than its block diagram would otherwise lead you to believe. Remember that with the first version of Atom, Intel enabled the treatment of load-op-store and load-op-execute instructions as single operations post decode. Instead of these instruction combinations decoding into multiple micro-ops, they are handled like single operations throughout the entire pipeline. This continues to be true in Silvermont, so the advantage remains (it also helps explain why Intel’s 2-wide architecture can deliver comparable IPC to ARM’s 3-wide Cortex A15).

While Silvermont still only has two x86 decoders at the front end of the pipeline, the decoders are more capable. While many x86 instructions will decode directly into a single micro-op, some more complex instructions require microcode assist and can’t go through the simple decode paths. With Silvermont, Intel beefed up the simple decoders to be able to handle more (not all) microcoded instructions.

Silvermont includes a loop stream buffer that can be used to clock gate fetch and decode logic in the event that the processor detects it’s executing the same instructions in a loop.

Execution

Silvermont’s execution core looks similar to Bonnell before it, but obviously now the design supports out-of-order execution. Silvermont’s execution units have been redesigned to be lower latency. Some FP operations are now quicker, as well as integer multiplies.

Loads can execute out of order. Don’t be fooled by the block diagram, Silvermont can issue one load and one store in parallel.

 

OoOE & The Pipeline ISA, IPC & Frequency
Comments Locked

174 Comments

View All Comments

  • t.s. - Tuesday, May 7, 2013 - link

    If Intel play fair few years back, maybe now we have competitive offerings from AMD. That practice Intel's doing hurt AMD alot. Until now.
  • Homeles - Monday, May 6, 2013 - link

    I'm sure Anand would be drawing plenty of comparisons if he had a Temash tablet in hand.
  • Bob Todd - Monday, May 6, 2013 - link

    As an owner of two Bobcat systems (laptop/mini-itx), I don't think a 25% boost from Jaguar is going to get us into the realm of "good enough" cpu performance for general computing in Windows. The same goes for Intel unless Silvermont is significantly faster than Jaguar. I'm excited that Intel is finally bringing something interesting to the table, even if we end up two to three generations away from a good experience in Windows with their (and AMD's) mobile offerings. This sounds like it will make for a beastly dual core Android phone though, even at lower clocks.
  • jjj - Monday, May 6, 2013 - link

    Hilarious difference in attitude when it comes to Intel.
    Tegra 4 gets into phones by "aggressively limiting frequency." while Intel " Max clock speeds should be lower than what’s possible in a tablet, but not by all that much thanks to good power management. "
    Objectivity at it's best.
  • Homeles - Monday, May 6, 2013 - link

    Your scenario is a false equivalency.
  • Krysto - Monday, May 6, 2013 - link

    Is it? I wouldn't accuse Anand of "objectivity" when it comes to Intel, whether it's on purpose, or involuntary.
  • nunomoreira10 - Monday, May 6, 2013 - link

    The point is tegra 4 was not exactly made for phones while Intel was, for that you have tegra4i

    its not exacly nvidia fault, everybody complained that tegra 3 was lacking, now tegra 4 which is competitive consumes to much, atleast there is a choice.
  • Homeles - Monday, May 6, 2013 - link

    A15s are big cores in relation to its relatives. The only way to fit not 2, not 4, but *5* of them in a phone on 28nm is to downclock them agressively. Just like the only way to fit Ivy Bridge in a tablet is to downclock it agressively.

    Anand did point out that the "the only A15 SoCs we've seen have been very leaky designs optimized for high frequency," and that if power consumption were prioritized instead (which I believe Tegra 4i is supposed to be), it would be less of a blowout.

    It's silly getting defensive about stock ARM cores anyways. It's not an attack on Nvidia by saying their stock ARM cores aren't all too spectacular -- it's not like they poured blood, sweat and tears into making their A15s the best thing ever.

    Finally, Tegra 4 is on a process that is rather significantly inferior to Intel's 22nm process. You think Nvidia would have to downclock agressively if they were on a level playing field and using Intel's 22nm process? I sure don't. But jjj and others here feel the need to get defensive whenever songs of praise are being sung about Intel, even when it's well deserved.
  • extide - Tuesday, May 7, 2013 - link

    I am in agreeance with what you said, but I do believe Tegra 4i is Cortex A9, not A15 like Tegra 4.
  • Wilco1 - Tuesday, May 7, 2013 - link

    The Korean Galaxy S4 has a 1.8GHz Exynos Octa, Tegra 4 does 1.9GHz. In what way are these "aggressively downclocked"? They run at their maximum frequency!

Log in

Don't have an account? Sign up now