Sensible Scaling: OoO Atom Remains Dual-Issue

The architectural progression from Apple, ARM and Qualcomm have all been towards wider, out-of-order cores, to varying degrees. With Swift and Krait, Apple and Qualcomm both went wider. From Cortex A8 to A9 ARM went OoO and then from A9 to A15 ARM introduced a significantly wider architecture. Intel bucks the trend a bit by keeping the overall machine width unchanged with Silvermont. This is still a 2-wide architecture.

At the risk of oversimplifying the decision here, Intel had to weigh die area, power consumption as well as the risk of making Atom too good when it made the decision to keep Silvermont’s design width the same as Bonnell. A wider front end would require a wider execution engine, and Intel believed it didn’t need to go that far (yet) in order to deliver really good performance.

Keeping in mind that Intel’s Bonnell core is already faster than ARM’s Cortex A9 and Qualcomm’s Krait 200, if Intel could get significant gains out of Silvermont without going wider - why not? And that’s exactly what’s happened here.

If I had to describe Intel’s design philosophy with Silvermont it would be sensible scaling. We’ve seen this from Apple with Swift, and from Qualcomm with the Krait 200 to Krait 300 transition. Remember the design rule put in place back with the original Atom: for every 2% increase in performance, the Atom architects could at most increase power by 1%. In other words, performance can go up, but performance per watt cannot go down. Silvermont maintains that design philosophy, and I think I have some idea of how.

Previous versions of Atom used Hyper Threading to get good utilization of execution resources. Hyper Threading had a power penalty associated with it, but the performance uplift was enough to justify it. At 22nm, Intel had enough die area (thanks to transistor scaling) to just add in more cores rather than rely on HT for better threaded performance so Hyper Threading was out. The power savings Intel got from getting rid of Hyper Threading were then allocated to making Silvermont an out-of-order design, which in turn helped drive up efficient use of the execution resources without HT. It turns out that at 22nm the die area Intel would’ve spent on enabling HT was roughly the same as Silvermont’s re-order buffer and OoO logic, so there wasn’t even an area penalty for the move.

The Original Atom microarchitecture

Remaining a 2-wide architecture is a bit misleading as the combination of the x86 ISA and treating many x86 ops as single operations down the pipe made Atom physically wider than its block diagram would otherwise lead you to believe. Remember that with the first version of Atom, Intel enabled the treatment of load-op-store and load-op-execute instructions as single operations post decode. Instead of these instruction combinations decoding into multiple micro-ops, they are handled like single operations throughout the entire pipeline. This continues to be true in Silvermont, so the advantage remains (it also helps explain why Intel’s 2-wide architecture can deliver comparable IPC to ARM’s 3-wide Cortex A15).

While Silvermont still only has two x86 decoders at the front end of the pipeline, the decoders are more capable. While many x86 instructions will decode directly into a single micro-op, some more complex instructions require microcode assist and can’t go through the simple decode paths. With Silvermont, Intel beefed up the simple decoders to be able to handle more (not all) microcoded instructions.

Silvermont includes a loop stream buffer that can be used to clock gate fetch and decode logic in the event that the processor detects it’s executing the same instructions in a loop.

Execution

Silvermont’s execution core looks similar to Bonnell before it, but obviously now the design supports out-of-order execution. Silvermont’s execution units have been redesigned to be lower latency. Some FP operations are now quicker, as well as integer multiplies.

Loads can execute out of order. Don’t be fooled by the block diagram, Silvermont can issue one load and one store in parallel.

 

OoOE & The Pipeline ISA, IPC & Frequency
Comments Locked

174 Comments

View All Comments

  • tech4real - Tuesday, May 7, 2013 - link

    "Absolute performance"? Do we consider power constraint here at all? Atom is optimized for power-efficiency. All the current information I've seen so far suggest Silvermont will outperform A15 by a large margin in terms of power efficiency. If we throw away power constraint, Intel has Core to take care of that.
  • Wilco1 - Tuesday, May 7, 2013 - link

    I was talking about peak performance, but yes, power consumption matters too. What we've seen so far is Intel marketing suggesting that in 6-9 months time Silvermont will be more efficient than A15 was 12 months earlier. However that's not what Silvermont will have to compete with. At the end of this year A15 will have had 2 process shrinks down to 20nm in addition to a lot of tuning, so it will be far more efficient than it was 12 months ago. And A15 is just one example, Apple, QC and ARM will have new cores as well. It's reasonable to say that Intel will finally be able to compete with Silvermont, but it is far from clear that it is the overall winner like their marketing claims.
  • tech4real - Wednesday, May 8, 2013 - link

    TSMC's 20nm process is still in the works, your Q4'13 volumn production estimate seems way too optimistic, especially considering TSMC's pain in 28nm ramp. Also 28nm->20nm shrink without finfet significantly reduces its benefit.
  • Wilco1 - Wednesday, May 8, 2013 - link

    TSMC have learnt from the 28nm problems. They appear very aggressive this time, and so far the reports are they are 2 months ahead of schedule. Even if it ends up delayed to Q2'14 it's still around the same time Intel is planning to come out with Silvermont phones. The gains are not as large as with FinFETs, but enough to reduce power significantly.
  • tech4real - Wednesday, May 8, 2013 - link

    my understanding is Q2'14 volume production with high yield is almost TSMC 20nm best case scenario. Of course, the term "high yield" is such a subjective thing vendors love to manipulate with almost infinite freedom...
  • zeo - Wednesday, May 8, 2013 - link

    TSMC 20nm isn't set up for such optimization, but rather focused on cost reductions... The number of nodes, variations supported, etc will be fewer than they did with 28nm as they want to avoid the problems that caused the 28nm delays and that has resulted in a much more streamlined setup.

    While power leakage issues increase as FAB size is decreased... So without a solution like FinFET the power efficiency would be increasingly harder to keep it where it is, let alone reduce it...

    It's one of the reasons why ARM is trying to push other options like Big.LITTLE to boost operational efficiencies and not rely as much on FAB improvements.

    While it's also why not all ARM SoCs have moved to 28nm yet as for many the power leakage was still too much of a issue for their designs to make the switch right away, so there could be additional delays for 20nm releases.

    Though ARM should get FinFET in time for for the 64bit release... but by that time Intel would be on its way to 14nm...
  • Jaybus - Wednesday, May 8, 2013 - link

    Think of it as 2-wide x86 vs. 3-wide RISC. Rather than translating the x86 microcoded instruction into 2 or 3 RISC-like instructions, Intel's decode keeps it a single instruction down the pipeline. The RSIC architecture has to decode more instructions, so needs the 3-wide to keep up with the x86 2-wide.

    The point about the frequency scaling is this. The tri-gate design has a gate on top of 2 vertical gates. This gives it 3x the surface area as compared to FinFET. The greater surface area allows more electrons to flow within a given area of the die, and that allows a greater range of voltages and/or frequencies for which it can operate efficiently.
  • Wilco1 - Thursday, May 9, 2013 - link

    Eventhough macro-ops helps decode, they need to be expanded before they are executed. So in terms of execution, macro-ops don't help. Also as I mentioned in an earlier post, most ARMs also support macro-ops, allowing a 2-way ARM to behave like a 4-way. So macro-ops don't give x86 an advantage over RISC.
  • jemima puddle-duck - Monday, May 6, 2013 - link

    Without wishing to be overly cynical, Anandtech has a history of 'NOW Intel will win the mobile war' articles, which get recycled then forgotten in time for the next launch. It's all very clever stuff, but curiously underwhelming also.
  • Roffles12 - Monday, May 6, 2013 - link

    I don't remember reading any 'NOW Intel will win the mobile war' articles on Anandtech. Perhaps your perception is skewed. Intel articles are typically of a technical nature discussing the inner workings of the architecture and fab process or discussing benchmarks. Intel is really the only company so completely open about how their technology works, so why not make it a point of discussion on a website on a website dedicated to the subject? If your head is clouded by fud from competing companies and the constantly humming rumor mill, maybe you need to back off for a while. At the end of the day, it's up to you to digest this information and form an opinion.

Log in

Don't have an account? Sign up now