Yesterday AMD revealed that in 2014 it would begin production of its first ARMv8 based 64-bit Opteron CPUs. At the time we didn't know what core AMD would use, however today ARM helped fill in that blank for us with two new 64-bit core announcements: the ARM Cortex-A57 and Cortex-A53.

You may have heard of ARM's Cortex-A57 under the codename Atlas, while A53 was referred to internally as Apollo. The two are 64-bit successors to the Cortex A15 and A7, respectively. Similar to their 32-bit counterparts, the A57 and A53 can be used independently or in a big.LITTLE configuration. As a recap, big.LITTLE uses a combination of big (read: power hungry, high performance) and little (read: low power, lower performance) ARM cores on a single SoC. 

By ensuring that both the big and little cores support the same ISA, the OS can dynamically swap the cores in and out of the scheduling pool depending on the workload. For example, when playing a game or browsing the web on a smartphone, a pair of A57s could be active, delivering great performance at a high power penalty. On the other hand, while just navigating through your phone's UI or checking email a pair of A53s could deliver adequate performance while saving a lot of power. A hypothetical SoC with two Cortex A57s and two Cortex A53s would still only appear to the OS as a dual-core system, but it would alternate between performance levels depending on workload.

ARM's Cortex A57

Architecturally, the Cortex A57 is much like a tweaked Cortex A15 with 64-bit support. The CPU is still a 3-wide/3-issue machine with a 15+ stage pipeline. ARM has increased the width of NEON execution units in the Cortex A57 (128-bits wide now?) as well as enabled support for IEEE-754 DP FP. There have been some other minor pipeline enhancements as well. The end result is up to a 20 - 30% increase in performance over the Cortex A15 while running 32-bit code. Running 64-bit code you'll see an additional performance advantage as the 64-bit register file is far simplified compared to the 32-bit RF.

The Cortex A57 will support configurations of up to (and beyond) 16 cores for use in server environments. Based on ARM's presentation it looks like groups of four A57 cores will share a single L2 cache.


ARM's Cortex A53

Similarly, the Cortex A53 is a tweaked version of the Cortex A7 with 64-bit support. ARM didn't provide as many details here other than to confirm that we're still looking at a simple, in-order architecture with an 8 stage pipeline. The A53 can be used in server environments as well since it's ISA compatible with the A57.

ARM claims that on the same process node (32nm) the Cortex A53 is able to deliver the same performance as a Cortex A9 but at roughly 60% of the die area. The performance claims apply to both integer and floating point workloads. ARM tells me that it simply reduced a lot of the buffering and data structure size, while more efficiently improving performance. From looking at Apple's Swift it's very obvious that a lot can be done simply by improving the memory interface of ARM's Cortex A9. It's possible that ARM addressed that shortcoming while balancing out the gains by removing other performance enhancing elements of the core.

Both CPU cores are able to run 32-bit and 64-bit ARM code, as well as a mix of both so long as the OS is 64-bit.

Completed Cortex A57 and A53 core designs will be delivered to partners (including AMD and Samsung) by the middle of next year. Silicon based on these cores should be ready by late 2013/early 2014, with production following 6 - 12 months after that. AMD claimed it would have an ARMv8 based Opteron in production in 2014, which seems possible (although aggressive) based on what ARM told me.

ARM expects the first designs to appear at 28nm and 20nm. There's an obvious path to 14nm as well.

It's interesting to note ARM's commitment to big.LITTLE as a strategy for pushing mobile SoC performance forward. I'm curious to see how the first A15/A7 designs work out. It's also good to see ARM not letting up on pushing its architectures forward.

POST A COMMENT

117 Comments

View All Comments

  • mayankleoboy1 - Tuesday, October 30, 2012 - link

    Great for ARM that it has so many performance improving avenues open. Unlike x86, which has basically stagnated.... Reply
  • A5 - Tuesday, October 30, 2012 - link

    Yeah, 10% IPC improvements every year is "stagnation".

    ARM is just going through the same super-growth period that x86 went through in the 90s.
    Reply
  • kylewat - Tuesday, October 30, 2012 - link

    Someone will have to tell me how much intel's manufacturing accounts for x86 performance vs ARM. Does intels fab and design provide an edge over those who are not as specialized? Does this account for some of their superiority and contribute to monopolistic behavior in regards to x86? Reply
  • extide - Tuesday, October 30, 2012 - link

    In a nutshell, yes. Reply
  • dcollins - Tuesday, October 30, 2012 - link

    Yes and no. Yes, Intel's fab prowess can be harnessed to either performance or reduce power consumption.

    No, it is not the reason for the vastly superior performance of x86 chips versus ARM. Intel has been developing CPU architectures for a long time and their designs are extremely good, regardless of fab improvements. Intel is stating to see diminishing returns in IPC improvement, but that's because they're so far ahead of ARM and making architectural improvements gets more and more difficult.
    Reply
  • Klimax - Wednesday, October 31, 2012 - link

    Just small reminder that small base high percentage change can be similar to high base small percentage change. (But high numbers are often looking more "impressive" to those who don't know...) Reply
  • Kogies - Saturday, November 17, 2012 - link

    Ten years ago it was all about transistor count: nowadays the adage is, "it's not the size of your transistor count, but its what you do with it that counts!"

    With up to and over 1 billion transistors to play with. Optimisation of an x86 core could continue ad infinitum, except that any architecture needs change in order to keep up with modern applications/usage scenarios. Intel's experience and resources aid them with this, but let's not forget that AMD too has a lot of both. The bottom line however is that Intel are able to do more with their transistors, better efficiency makes for a better chip, and better efficiency comes from any of the thousand decisions that make a core.

    How much does Intel's manufacturing and design help? I doubt there is a simple answer to this question. Because there is no end to the possible efficiency, there is always room for one architecture to trump another at any given time (because of clever people and clever decisions), it doesn't take much to be caught napping and have your Net-gun Burst and so become entangled in it yourself (if you follow!). This is why the silicon arms race (or Arm's race to some) is so darn interesting!
    Reply
  • ssj4Gogeta - Wednesday, October 31, 2012 - link

    It continues to amaze me how Intel keep increasing the IPC every year while AMD desperately try to tackle the problem by throwing more cores at it. Reply
  • Pipperox - Wednesday, October 31, 2012 - link

    This is not entirely true.
    What you say is correct with the 6x core Thuban, but not with Bulldozer and Piledriver/Trinity architecture.
    They didn't simply "throw more cores at it", they designed new cores which are smaller and simpler and share resources, so that they could increase the number of cores without creating a monster chip in terms of size (which would have been difficult and expensive to produce, and dissipate a lot more power).
    So in the end Intel and AMD took 2 different strategies to increase performance: Intel increases the efficiency of the existing cores, AMD changed architecture creating cores which individually are slower but where it is easier to pack more of them on a chip.

    Also with Trinity/Piledriver AMD achieved a solid 15% IPC improvement, without increasing core count (sort of like an Intel's "tock").
    Reply
  • maroon1 - Friday, November 02, 2012 - link

    "without creating a monster chip in terms of size"

    The die size of Bulldozer/Piledriver is 315mm^2

    Ivy bridge with HD4000 iGPU is only 160mm^2

    Thats almost twice as big as ivy bridge
    Reply

Log in

Don't have an account? Sign up now