The ARM Diaries, Part 2: Understanding the Cortex A12

Name: The ARM Diaries, Part 2: Understanding the Cortex A12
Item: The ARM Diaries, Part 2: Understanding the Cortex A12
Author: Anand Lal Shimpi

by Anand Lal Shimpi on July 17, 2013 12:30 PM EST

Posted in
CPUs
Arm
SoCs
Cortex A12

65 Comments | Add A Comment

65 Comments

Introduction to Cortex A12 & The Front End

At a high level ARM’s Cortex A12 is a dual-issue, out-of-order microarchitecture with integrated L2 cache and multi-core capable.

The Cortex A12 team all previously worked on Cortex A9. ARM views the resulting design as not being a derivative of Cortex A9, but clearly inspired by it. At a high level, Cortex A12 features a 10 - 12 stage integer pipeline - a lengthening of Cortex A9’s 8 - 11 stage pipeline. The architecture is still 2-wide out-of-order, but unlike Cortex A9 the new tweener is fully out of order including load/store (within reason) and FP/NEON.

Cortex A12 retains feature and ISA compatibility with ARM’s Cortex A7 and A15, making it the new middle child in the updated microprocessor family. All three parts support 40-bit physical addressing, the same 128-bit AXI4 bus interface and the same 32-bit ARM-v7A instruction set (NEON is standard on Cortex A12). The Cortex A12 is so compatible with A7 and A15 that it’ll eventually be offered in a big.LITTLE configuration with a cluster of Cortex A7 cores (initial versions lack the coherent interface required for big.LITTLE).

In the Cortex A9, ARM had a decoupled L2 cache that required some OS awareness. The Cortex A12 design moves to a fully integrated L2, similar to the A7/A15. The L2 cache operates on its own voltage and frequency planes, although the latter can be in sync with the CPU cores if desired. The L2 cache is shared among up to four cores. Larger core count configurations are supported through replication of quad-core clusters.

The L1 instruction cache is 4-way set associative and configurable in size (32KB or 64KB). The cache line size in Cortex A12 was increased to 64 bytes (from 32B in Cortex A9) to better align with DDR memory controllers as well as the Cortex A7 and A15 designs. Similar to Cortex A9 there’s a fully associative instruction micro TLB and unified main TLB, although I’m not sure if/how the sizes of those two structures have changed.

The branch predictor was significantly improved over Cortex A9. Apparently in the design of the Cortex A12, ARM underestimated its overall performance and ended up speccing it out with too weak of a branch predictor. About three months ago ARM realized its mistake and was left with the difficult situation of either shipping a less efficient design, or quickly finding a suitable branch predictor. The Cortex A12 team went through the company looking for a designed predictor it could use, eventually finding one in the Cortex A53. The A53’s predictor got pulled into the Cortex A12 and with some minor modifications will be what the design ships with. Improved branch prediction obviously improves power efficiency as well as performance.

The ARM CPU Portfolio & Dynamic Range Back End Improvements

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

65 Comments

View All Comments

haukionkannel - Wednesday, July 17, 2013 - link
I am interested in how this A12 compares to A53 in speed wise... A53 has wider registers (64 bit) but A12 run in higher freguensis and has more cores?
Wilco1 - Wednesday, July 17, 2013 - link
A12 will beat A53 by about 40%: A53 delivers performance comparable to A9 and A12 is 40% faster than A9. Note that 64-bit is not relevant and certainly doesn't provide a big speedup - even on x86 almost all software is still 32-bit as there is little to gain from going to 64-bit.
jwcalla - Wednesday, July 17, 2013 - link
Well this isn't entirely accurate as some of us are running almost completely pure 64-bit systems. And it does appear that video encoding and decoding are common operations that can benefit from 64-bit software from some of the benchmarks I've seen, but that might actually have been from compiler options so maybe not. But otherwise yes, the 64-bit ARM chips are only really important for server type workloads where people already have 64-bit software that they don't want to rework.
Wilco1 - Wednesday, July 17, 2013 - link
64-bit has pros and cons. x64 provides more registers so some applications run faster when built for 64-bit - that may well the video codecs you mention. However there are downsides as well, as all your pointers double in size, which slows down pointer heavy code. On 64-bit ARM things will be similar.

Note the main reason for 64-bit is allowing more than 4GB of memory. The latest 32-bit ARMs already support this, so for mobiles/tablets etc there is no need to change to 64-bit.
wumpus - Thursday, July 18, 2013 - link
Er, no. Just no.
From what I can tell, 32 bit ARM chips use (from a developer's view) the exact same mechanism as x86 and PAE. This might use 4G of RAM efficiently (and for those OSs that like to leave all apps in RAM, it might work well for a bit more).
Trying to address more memory than an integer register can map to is always going to be an unholy kludge (although I would personally recommend all computer architects design such a system into the architecture because it *will* happen if the architecture succeeds at all). Since ARM chips tend to go into machines that rarely allow memory to be upgraded, no vendor really should be selling machines with >4G RAM and 32 bit accessing. The size/power/cost tradeoff isn't worth it

google "Support for ARM LPAE". All my links go straight to the pdfs and I wind up with all the google code inbedded in my links.
Calinou__ - Thursday, July 18, 2013 - link
PAE doesn't allow for more than 3 GB per process. ;)
Wilco1 - Thursday, July 18, 2013 - link
A15 based servers will have 8-32GB. So yes it does go well over 4GB, that's the whole point of PAE. Mobiles will end up with 4GB RAM soon and because of PAE there is no need to use 64-bit CPUs (which would be way overkill for a mobile).

Yes I know about ARM LPAE and that it is supported in Linux.
fteoath64 - Friday, July 19, 2013 - link
"PAE there is no need to use 64-bit CPUs". Agreed. More effort to be spend on optimizing for speed and power efficiency.
jensend - Friday, July 19, 2013 - link
"Cortex A12 retains the two integer pipelines of the Cortex A9, but adds support for integer divides (like the A7 and A15, other A-series architectures generally lacked support for hardware int divides)"

The parenthetical material is very unclear. It sounds like it's saying "The A12 has hardware divides, the A15 and A7 don't, and other A-series archs likewise don't." A simple edit makes the sentence much more clear and slightly more concise:

"Cortex A12 retains the two integer pipelines of the Cortex A9; however, like the A7 and A15, it adds support for hardware integer divides (which previous A-series architectures generally lacked)."
phoenix_rizzen - Friday, July 19, 2013 - link
An even simpler edit is to just change the parenthetical comma into a semi-colon:

"Cortex A12 retains the two integer pipelines of the Cortex A9, but adds support for integer divides (like the A7 and A15; other A-series architectures generally lacked support for hardware int divides)"

The ARM Diaries, Part 2: Understanding the Cortex A12

Introduction to Cortex A12 & The Front End

Post Your Comment

65 Comments

View All Comments

haukionkannel - Wednesday, July 17, 2013 - link

Wilco1 - Wednesday, July 17, 2013 - link

jwcalla - Wednesday, July 17, 2013 - link

Wilco1 - Wednesday, July 17, 2013 - link

wumpus - Thursday, July 18, 2013 - link

Calinou__ - Thursday, July 18, 2013 - link

Wilco1 - Thursday, July 18, 2013 - link

fteoath64 - Friday, July 19, 2013 - link

jensend - Friday, July 19, 2013 - link

phoenix_rizzen - Friday, July 19, 2013 - link

Log in

Don't have an account? Sign up now