The Move to 64-bit

Prior to the iPhone 5s launch, I heard a rumor that Apple would move to a 64-bit architecture with its A7 SoC. I initially discounted the rumor given the pain of moving to 64-bit from a validation standpoint and the upside not being worth it. Obviously, I was wrong.

In the PC world, most users are familiar with the 64-bit transition as something AMD started in the mid-2000s. The primary motivation back then was to enable greater memory addressability by moving from 32-bit addresses (2^32 or 4GB) to 64-bit addresses (2^64 or 16EB). Supporting up to 16 exabytes of memory from the get go seemed a little unnecessary, so AMD’s x86-64 ISA only uses 48-bits for unique memory addresses (256TB of memory). Along with the move from x86 to x86-64 came some small performance enhancements thanks to more available general purpose registers in 64-bit mode.

In the ARM world, the move to 64-bit is motivated primarily by the same factor: a desire for more memory. Remember that ARM and its partners have high hopes of eating into Intel’s high margin server business, and you really can’t play there without 64-bit support. ARM has already announced its first two 64-bit architectures: the Cortex A57 and Cortex A53. The ISA itself is referred to as ARMv8, a logical successor to the present day 32-bit ARMv7.

Unlike the 64-bit x86 transition, ARM’s move to 64-bit comes with a new ISA rather than an extension of the old one. The new instruction set is referred to as A64, while a largely backwards compatible 32-bit format is called A32. Both ISAs can be supported by a single microprocessor design, as ARMv8 features two architectural states: AArch32 and AArch64. Designs that implement both states can switch/interleave between the two states on exception boundaries. In other words, despite A64 being a new ISA you’ll still be able to run old code alongside it. As always, in order to support both you need an OS with support for A64. You can’t run A64 code on an A32 OS. It is also possible to do an A64/AArch64-only design, which is something some server players are considering where backwards compatibility isn’t such a big deal.

Cyclone is a full implementation of ARMv8 with both AArch32 and AArch64 states. Given Apple’s desire to maintain backwards compatibility with existing iOS apps and not unnecessarily fragment the ARM ecosystem, simply embracing ARMv8 makes a lot of sense.

The motivation for Apple to go 64-bit isn’t necessarily one of needing more address space immediately. A look at Apple’s historical scaling of memory capacity tells us everything we need to know:

At best Apple doubled memory capacity between generations, and at worst it took two generations before doubling. The iPhone 5s ships with 1GB of LPDDR3, keeping memory capacity the same as the iPhone 5, iPad 3 and iPad 4. It’s pretty safe to assume that Apple will go to 2GB with the iPhone 6 (and perhaps iPad 5), and then either stay there for the 6s or double again to 4GB. The soonest Apple would need 64-bit from a memory addressability standpoint in an iOS device would be 2015, and the latest would be 2016. Moving to 64-bit now preempts Apple’s hardware needs by 2 full years.

The more I think about it, the more the timing actually makes a lot of sense. The latest Xcode beta and LLVM compiler are both ARMv8 aware. Presumably all apps built starting with the official iOS 7 release and going forward could be built 64-bit aware. By the time 2015/2016 rolls around and Apple starts bumping into 32-bit addressability concerns, not only will it have navigated the OS transition but a huge number of apps will already be built for 64-bit. Apple tends to do well with these sorts of transitions, so starting early like this isn’t unusual. The rest of the ARM ecosystem is expected to begin moving to ARMv8 next year.

Apple isn’t very focused on delivering a larger memory address space today however. As A64 is a brand new ISA, there are other benefits that come along with the move. Similar to the x86-64 transition, the move to A64 comes with an increase in the number of general purpose registers. ARMv7 had 15 general purpose registers (and 1 register for the program counter), while ARMv8/A64 now has 31 that are each 64-bits wide. All 31 registers are accessible at all times. Increasing the number of architectural registers decreases register pressure and can directly impact performance. The doubling of the register space with x86-64 was responsible for up to a 10% increase in performance.

The original ARM architecture made all instructions conditional, which had a huge impact on the instruction space. The number of conditional instructions is far more limited in ARMv8/A64.

The move to ARMv8 also doubles the number of FP/NEON registers (from 16 to 32) as well as widens all of them registers to 128-bits (up from 64-bits). Support for 128-bit registers can go a long way in improving SIMD performance. Whereas simply doubling register count can provide moderate increases in performance, doubling the size of each register can be far more significant given the right workload. There are also new advanced SIMD instructions that are a part of ARMv8. Double precision SIMD FP math is now supported among other things.

ARMv8 also adds some new cryptographic instructions for hardware acceleration of AES and SHA1/SHA256 algorithms. These hardware AES/SHA instructions have the potential for huge increases in performance, just like we saw with the introduction of AES-NI on Intel CPUs a few years back. Both the new advanced SIMD instructions and AES/SHA instructions are really designed to enable a new wave of iOS apps.

Many A64 instructions mode can also work with 32-bit operands, with properly implemented designs simply power gating unused bits. The A32 implementation in ARMv8 also adds some new instructions, so it’s possible to compile AArch32 apps in ARMv8 that aren’t backwards compatible. All existing ARMv7 and 32-bit Thumb code should work just fine however.

On the software side, iOS 7 as well as all first party apps ship already compiled for AArch64 operation. In fact, at boot, there isn’t a single AArch32 process running on the iPhone 5s:

Safari, Mail, everything all made the move to 64-bit right away. Given the popularity of these first party apps, it’s not just the hardware that’s 64-bit ready but much of the software is as well. The industry often speaks about Apple’s vertically integrated advantage, this is quite possibly the best example of that advantage. In many ways it reminds me of the Retina Display transition on OS X.

Running A32 and A64 applications in parallel is seamless. On the phone itself, it’s impossible to tell when you’re running in a mixed environment or when everything you’re running is 64-bit. It all just works.

I didn’t run into any backwards compatibility issues with existing 32-bit ARMv7 apps either. From an end user perspective, navigating the 64-bit transition is as simple as buying an iPhone 5s.

64-bit Performance Gains

Geekbench 3 was among the first apps to be updated with ARMv8 support. There are some minor changes between the new version of Geekbench 3 and its predecessor (3.1/3.0), however the tests themselves (except for the memory benchmarks) haven't changed. What this allows us to do is look at the impact of the new ARMv8 A64 instructions and larger register space. We'll start with a look at integer performance:

Apple A7 - AArch64 vs. AArch32 Performance Comparison
  32-bit A32 64-bit A64 % Advantage
AES 91.5 MB/s 846.2 MB/s 825%
AES MT 180.2 MB/s 1640.0 MB/s 810%
Twofish 59.9 MB/s 55.6 MB/s -8%
Twofish MT 119.1 MB/s 110.2 MB/s -8%
SHA1 138.0 MB/s 477.3 MB/s 245%
SHA1 MT 275.7 MB/s 948.9 MB/s 244%
SHA2 86.1 MB/s 102.2 MB/s 18%
SHA2 MT 171.3 MB/s 203.7 MB/s 18%
BZip2 Compress 4.36 MB/s 4.52 MB/s 3%
BZip2 Compress MT 8.57 MB/s 8.86 MB/s 3%
BZip2 Decompress 5.94 MB/s 7.56 MB/s 27%
BZip2 Decompress MT 11.7 MB/s 15.0 MB/s 28%
JPEG Compress 15.5 MPixels/s 16.8 MPixels/s 8%
JPEG Compress MT 30.8 MPixels/s 33.3 MPixels/s 8%
JPEG Decompress 36.0 MPixels/s 40.3 MPixels/s 11%
JPEG Decompress MT 71.3 MPixels/s 78.1 MPixels/s 9%
PNG Compress 0.84 MPixels/s 1.14 MPixels/s 35%
PNG Compress MT 1.67 MPixels/s 2.26 MPixels/s 35%
PNG Decompress 13.9 MPixels/s 15.2 MPixels/s 9%
PNG Decompress MT 27.4 MPixels/s 29.8 MPixels/s 8%
Sobel 59.3 MPixels/s 58.0 MPixels/s -3%
Sobel MT 116.6 MPixels/s 114.6 MPixels/s -2%
Lua 1.25 MB/s 1.33 MB/s 6%
Lua MT 2.47 MB/s 2.49 MB/s 0%
Dijkstra 5.35 MPairs/s 4.05 MPairs/s -25%
Dijkstra MT 9.67 MPairs/s 7.26 MPairs/s -25%

The AES and SHA1 gains are a direct result of the new cryptographic instructions that are a part of ARMv8. The AES test in particular shows nearly an order of magnitude performance improvement. This is similar to what we saw in the PC space with the introduction of Intel's AES-NI support in Westmere. The Dijkstra workload is the only real regression. That test in particular appears to be very pointer heavy, and the increase in pointer size from 32 to 64-bit increases cache pressure and causes the reduction in performance. The rest of the gains are much smaller, but still fairly significant if you take into account the fact that we're just looking at what you get from a recompile. Add these gains to the ones you're about to see over Apple's A6 SoC and A7 is looking really good from a performance standpoint.

If the integer results looked good, the FP results are even better:

Apple A7 - AArch64 vs. AArch32 Performance Comparison
  32-bit A32 64-bit A64 % Advantage
BlackScholes 4.73 MNodes/s 5.92 MNodes/s 25%
BlackScholes MT 9.57 MNodes/s 12.0 MNodes/s 25%
Mandelbrot 930.2 MFLOPS 929.9 MFLOPS 0%
Mandelbrot 1840 MFLOPS 1850 MFLOPS 0%
Sharpen Filter 805.1 MFLOPS 857 MFLOPS 6%
Sharpen Filter MT 1610 MFLOPS 1710 MFLOPS 6%
Blur Filter 1.08 GFLOPS 1.26 GFLOPS 16%
Blur Filter MT 2.15 GFLOPS 2.47 GFLOPS 14%
SGEMM 3.09 GFLOPS 3.34 GFLOPS 8%
SGEMM MT 6.08 GFLOPS 6.56 GFLOPS 7%
DGEMM 0.56 GFLOPS 1.66 GFLOPS 195%
DGEMM MT 1.11 GFLOPS 3.24 GFLOPS 191%
SFFT 0.72 GFLOPS 1.59 GFLOPS 119%
SFFT MT 1.44 GFLOPS 3.17 GFLOPS 120%
DFFT 1.41 GFLOPS 1.47 GFLOPS 4%
DFFT MT 2.78 GFLOPS 2.91 GFLOPS 4%
N-Body 460.8 KPairs/s 582.6 KPairs/s 26%
N-Body MT 917.6 KPairs/s 1160.0 KPairs/s 26%
Ray Trace 1.52 MPixels/s 2.31 MPixels/s 51%
Ray Trace MT 3.04 MPixels/s 4.64 MPixels/s 52%

The DGEMM operations aren't vectorized under ARMv7, but they are under ARMv8 thanks to DP SIMD support so you get huge speedups there from the recompile. The SFFT workload benefits handsomely from the increased register space, significantly reducing the number of loads and stores (there's something like a 30% reduction in instructions for the A64 codepath compared to the A32 codepath here). The conclusion? There are definitely reasons outside of needing more memory to go 64-bit.

A7 and OS X

Before I spent time with the A7 I assumed the only reason Apple would go 64-bit in mobile is to prepare for eventually deploying these chips into larger machines. A couple of years ago, when the Apple/Intel relationship was at its rockiest I would've definitely said that's what was going on. Today, I'm far less convinced. 

Apple continues to build its own SoCs and invest in them because honestly, no one else seems up to the job. Only recently do we have GPUs competitive with what Apple has been shipping, and with the A7 Apple nearly equals Intel's performance with Bay Trail on the CPU side. As far as Macs go though, there's still a big gap between the A7 and where Intel is at with Haswell. The deficiency that Intel had in the ultra mobile space simply doesn't translate to its position with the big Core chips. I don't see Apple bridging that gap anytime soon. On top of that, the Apple/Intel relationship is very good at this point.

Although Apple could conceivably keep innovating to the point where an A-series chip ends up powering a Mac, I don't think that's in the cards today.

After Swift Comes Cyclone CPU Performance
POST A COMMENT

466 Comments

View All Comments

  • MatthiasP - Tuesday, September 17, 2013 - link

    Wow, first real review on the web AND deep as always, a very nice job from Anand. :) Reply
  • sfaerew - Wednesday, September 18, 2013 - link

    Benchmarks(GFXBench 2.7,3DMark.Basemark X.etc.) are AArch64 version?
    There are 30~40% performance gap between v32geekbench and v64geekbench.
    INT(ST)1471 vs 1065.
    FP(ST)1339 vs 983
    Reply
  • Wilco1 - Wednesday, September 18, 2013 - link

    And Bay Trail Geekbench at 2.4GHz: 1063 (INT), 866 (FP)

    So A7 has beaten BT already by a huge margin despite BT not even being for sale yet...
    Reply
  • TraderHorn - Wednesday, September 18, 2013 - link

    You're comparing 64bit A7 vs 32bit BT. The 32bit #s are dead even. It'll be interesting to see if BT gets a similar performance boost when Win8 64bit versions are released in 1h 2014. Reply
  • Wilco1 - Wednesday, September 18, 2013 - link

    BT's 32-bit result includes hardware accelerated AES, which skews its score (without it, its score is ~936). The 64-bit A7 result does also use hardware acceleration, so it is more comparable.

    Yes BT will get a speedup from 64-bit as well, but won't be nearly as much as A7 gets: its 32-bit result already has the AES acceleration, and x64 nearly isn't as different from x86 as A64 is from A32.

    However the interesting things is that not even in 32-bit A7 wins by a good margin, but that it wins despite running at almost half the frequency of Bay Trail... Forget about Bay Trail, this is Haswell territory - the MacBook Air with the 15W 3.3GHz i7-4650U scores 3024 INT and 3003 FP.

    Now imagine a quad core tablet/laptop version of the A7 running at 2GHz on TSMC 20nm next year.
    Reply
  • smartypnt4 - Wednesday, September 18, 2013 - link

    Why does the frequency matter? If the TDP of the chips are similar (Bay Trail was tested and verified by Anand as using 2.5W at the SoC level under load), who gives a flip about the frequency?

    If Apple wanted to double the frequency of the chip, they'd need something on the order of 4x the amount of power it already consumes (assuming a back-of-the-napkin quadratic relationship, which is approximately correct), putting it at ~6-8W or so at full load. That's assuming such a scaling could even be done, which is unlikely given that Apple built the thing to run at 1.3GHz max. You can't just say "oh, I want these to switch faster, so let's up the voltage." There's more that goes in to the ability to scale voltage than just the process node you're on.

    Now, I will agree that this does prove that if Apple really wanted to, they could build something to compete with Haswell in terms of raw throughput. Next year's A8 or whatever probably will compete directly with Haswell in raw theoretical integer and FP throughput, if Apple manages to double performance again. That's not a given since they had to use ~50% more transistors to get a performance doubling from the A6 to the A7, and building a 1.5B transistor chip is nontrivial since yields are inversely proportional to the number of transistors you're using.

    Next year will be really interesting, though. What with Apple's next stuff, Broadwell, the first A57 designs, Airmont, and whatever Qualcomm puts out (haven't seen anything on that, which is odd for Qualcomm.)
    Reply
  • Wilco1 - Wednesday, September 18, 2013 - link

    Frequency & process matters. Current phones use about 2W at max load without the screen (see recent Nexus 7 test), so the claimed 2.5W just for BT is way too much for a phone. That means (as you explained) it must run at a lower frequency and voltage to get into phones - my guess we won't see anything faster than the Z3740 with a max clock of 1.8GHz. Therefore the A7 will extend its lead even further.

    According to TSMC 20nm will give a 30% frequency boost at the same power. So I'd expect that a 2GHz A7 would be possible on 20nm using only 35% more power. That means the A7 would get 75% more performance at a small cost in power consumption. This is without adding any extra transistors.

    Add some tweaks (like faster memory) and such a 2GHz A7 would be similar in performance as the 15W Haswell in MacBook Air. So my point is that with a die shrink and a slight increase in power they already have a Haswell competitor.
    Reply
  • smartypnt4 - Wednesday, September 18, 2013 - link

    Frequency and process matter in that they affect power consumption. If Intel can get Bay Trail to do 2.4GHz on something like 1.0V, then the power should be fine. Current Haswell stuff tops out its voltage around 1.1V or so in laptops (if memory serves), so that's not unreasonable.

    All of this assumes Geekbench is valid for comparing HSW on Win8 to ARMv8/Cyclone on iOS, which I have serious reservations about attempting to do.

    The other issue I have is this: you're talking about a 50% clock boost giving a 100% increase in performance if we look at the Geekbench scores. That's simply not possible. Had you said "raise the clock to 1.6-1.7GHz and give it 4 cores," I'd be right behind you in a 2x theoretical performance increase. But a 50% clock boost will never yield a 100% increase with the same core, even if you change the memory controller.

    Also, somehow your math doesn't add up for power... Are you hypothesizing that a 2GHz A7 (with 75% of the performance of Haswell 15W, not the same - as per Geekbench) can pull 2.6W while Haswell needs 15W to run that test? Granted, Haswell integrates things that the A7 doesn't. Namely, more advanced I/O (PCIe, SATA, USB, etc.), and the PCH. Using very fuzzy math, you can claim all of that uses 1/2 the power of the chip.

    That brings Haswell's power for compute down to 7-8W, more or less. And you're going to tell me that Apple has figured out how to get 75% of the performance of a 7W part in 2.6W, and Intel hasn't? Both companies have ~100k employees. One is working on a ton of different stuff, and one makes processors, basically exclusively (SSDs and WiFi stuff too, but processors is their main drive). You're telling me that a (relatively) small cadre of guys at Apple have figured out how to do it, and Intel hasn't done it yet on a part that costs ~6x as much after trying to get deep into the mobile space for years. I find that very hard to believe.

    Even with the 14nm shrink next year, you're talking about a 30% power savings for Intel's stuff. That brings the 15W total down to 10.5W, and the (again, super, ridiculously fuzzy) computing power to ~5-6W. On a full node smaller than what Apple has access to. And you're saying they'd hypothetically compete in throughput with a 2.6W part. I'm not sure I believe that.

    Then again, I suppose theoretical bandwidth could be competitive. That's simply a factor of your peak IPC, not your average IPC while the device is running. I don't know enough about the low level architecture of the A7 (no one does), so I'll just leave it here I guess.

    I'm gonna go now... I'm starting to reason in circles.
    Reply
  • Wilco1 - Wednesday, September 18, 2013 - link

    The sort of "simple" tweaks I was thinking of are: an improved memory controller and prefetcher, doubling of L2, larger branch predictor tables. Assuming a 30% gain due to those tweaks, the result is a 100% speedup at 2GHz (1.3 to 2.0 GHz is a 54% speedup, so you get 1.54 * 1.3 = 2.0x perf). The 30% gain due to tweaks is pure speculation of course, however NVidia claims 15-30% IPC gain for similar tweaks in Tegra 4i, so it's not entirely implausible. As you say a much simpler alternative would be just to double the cores, but then your single threaded performance is still well below that of Haswell.

    You can certainly argue some reduction in the 15W TDP of Haswell due to IO, however with Turbo it will try to use most of that 15W if it can (the Air goes up to 3.3GHz after all).

    Yes I am saying that a relative newcomer like Apple can compete with Intel. Intel may be large, but they are not infallible, after all they made the P4, Itanium and Atom. A key reason AMD cited for moving into ARM servers was that designing an ARM CPU takes far less effort than an equivalent performing x86 one. So the ISA does still matter despite some claiming it no longer does.
    Reply
  • smartypnt4 - Wednesday, September 18, 2013 - link

    My point wasn't that Apple can't compete; far from it. If anything, the A7 shows they can compete for the most part. However, what you suggest is that Apple could theoretically have the same performance as Intel on a full node process larger at half the power. I

    have no illusions that Intel is infallible. Stuff like Larrabee and the underwhelming GPU in Bay Trail prove that they aren't. I just seriously doubt that Apple could beat Intel at its own game. Specifically, in CPU performance, which is an area it's dominated for years. It's possible, but I find it relatively unlikely, especially this early in Apple's lifetime as a chip designer.

    On a different note, after looking at the Geekbench results more, I feel like it's improperly weighted. The massive performance improvement in AES and SHA encryption may be skewing the overall result... I need to dig more in to Geekbench before coming to an actual conclusion. I'm also still not convinced that comparing cross-platform results is actually valid. I'd like to believe it is, but I've always had reservations about it.
    Reply

Log in

Don't have an account? Sign up now