Every fall, a new iPhone. The schedule of Apple’s mobile hardware releases has become pretty predictable by now, but they more than make up for this timing predictability by the sheer unpredictability of what they introduce each year, and this fall they outdid themselves. Between TouchID and the M7 coprocessor, the iPhone 5S had plenty of surprises, but what most intrigued and excited many people in the development community was its new, 64-bit processor. But many have openly wondered what the point of this feature was, exactly; including some of the same people excited by it, there is no contradiction in that. So I set out to collect answers, and here is what I found.
Before we begin, I strongly recommend you read Friday Q&A 2013-09-27: ARM64 and You, then get the The ARMv8-A Reference Manual (don’t worry about the “registered ARM customers” part, you only need to create an account and agree to a license in order to download the PDF): have it on hand to refer to it whenever I will mention an instruction or architectural feature.
iOS devices have always been based on ARM architecture processors. So far, ARM processors have been strictly 32 bit machines: 32-bit general registers, addresses, most calculations, etc.; but in 2011, ARM Holdings announced ARMv8, the first version of the ARM architecture to allow ARM native 64-bit processing and in particular 64-bit addresses in programs. It was clearly, and by their own admission, done quite ahead of the time where it would actually be needed, so that the whole ecosystem would have time to adopt it (board vendors, debug tools, ISVs, open-source projects, etc.), and in fact ARM did not announce at the time any of their own processor implementations of the new architecture (which they also do ahead of time), leaving in fact some of their partners, server processor ones in particular, the honor of releasing the first ARMv8 processor implementations. I’m not sure any device using an ARMv8 processor design from ARM Holdings has even shipped yet.
All that to say that while many people knew about 64-bit ARM for some time, almost no one expected Apple to release an ARMv8-based consumer device so ahead of when everyone thought such a thing would be actually needed; Apple was not merely first to market with an ARMv8 handheld, but lapped every other handset maker and their processor suppliers in that regard. But that naturally raises the question of what exactly Apple gets from having a 64-bit ARM processor in the iPhone 5S, since after all none of its competitors or the suppliers of these saw what benefit would justify rushing to ARMv8 as Apple did. And this is a legitimate question, so let us see what benefits we can identify.
First Candidate: a Larger Address Space
Let us start by the obvious: the ability for a single program to address more than 4 GB of address space. Or rather, let us start by killing the notion that this is only about devices with more than 4 GB of physical RAM. You do not, absolutely not, need a 64-bit processor to build a device with more than 4GB or RAM; for instance, the Cortex A15, which implements ARMv7-A with a few extensions, is able of using up to 1 TB of physical RAM, even though it is a decidedly 32-bit processor. What is true is that, with a 32-bit processor, no single program is able of using more than 4 GB of that RAM at once, so while such an arrangement is very useful for server or desktop multitasking scenarios, its benefits are more limited on a mobile device, where you don’t typically expect background programs to keep using a lot of memory. So handsets and tablets will likely need to go with a 64-bit processor when they will start packing more than 4 GB of RAM, so that the frontmost program can actually make use of that RAM.
However, that does not mean the benefits of large virtual address space are limited to that situation. Indeed, using virtual memory an iOS program can very well use large datasets that do not actually fit in RAM, using mmap(2) to map the files containing this data into virtual memory, and leaving the virtual memory subsystem handle the RAM as a cache. Currently, it is problematic on iOS to map files more than a few hundred megabytes in size, because the program address space is limited to 4GB (and what is left is not necessarily in a big continuous chunk you can map a single file in).
That being said, in my opinion the usefulness of being able to map gigabyte-scale files on 64-bit iOS devices will currently be limited to a few niche applications, if only because, while these files won’t necessarily have to fit in RAM, they will have to fit on the device Flash storage in the first place, and the largest capacity you can get on an iOS device at the time of this writing being 128 GB, with most people settling for less, you’d better not have too many such applications installed at once. That said, for those applications which need it, likely in vertical markets mostly, the sole feature of a larger address space means 64-bit ARM is a godsend for them.
One sometimes heard benefit of Apple already pushing ordinary applications (those which don’t map big files) to adopt 64-bit ARM is that, the day Apple releases a device featuring more than 4 GB of RAM, these applications will benefit without them needing to be updated. However, I don’t buy it. It is in the best interest of iOS applications to not spontaneously occupy too much memory, for instance so that they do not get killed first when they are in the background, so in order to take advantage of an iOS device with more RAM than you can shake a stick at, they would have change behavior and be updated anyway, so…
Verdict: inconclusive for most apps, a godsend for some niche apps.
Second Candidate: the
ARMv8 AArch64 A64 ARM64 Instruction Set
The ability of doing 64-bit processing on ARM comes as part of a new instruction set, called…
- Well it’s not called ARMv8: not only does ARMv8 also bring improvements to the ARM and Thumb instruction sets (more on that later), which for the occasion have been renamed to A32 and T32, respectively; but also ARMv9 whenever it will come out will also feature 64-bit processing, so we can’t refer to 64-bit ARM as ARMv8.
- AArch64 is the name of the execution mode of the processor where you can perform 64-bit processing, and it’s a mouthful so I won’t be using the term.
- A64 is the name ARM Holdings gives to the new instructions set, a peer to A32 and T32; but this term does not make it clear we’re talking about ARM so I won’t be using it.
- Apple in Xcode uses ARM64 to designate the new instruction set, so that’s the name we will be using
The instruction set is quite a change from ARM/A32; for one, the instruction encodings are completely different, and a number of features, most notably pervasive conditional execution, have been dropped. On the other hand, you get access to 31 general purpose registers that are of course 64-bit wide (as opposed to the 14 general purpose registers which you get from ARM/A32, once you remove the PC and SP), some instructions can do more, you get access to much more 64-bit math, and what does remain supported has the same semantics as in ARM/A32 (for instance the classic NZCV flags are still here and behave the exact same way), so it’s not completely unfamiliar territory.
ARM64 could be the subject of an entire blog post, so let us stick to some highlights:
Ah, let’s get that one out of the way first. You might have noticed this little guy in the ARMv8-A Reference Manual (page 449). Bad news, folks: support for this instruction is optional, and the iPhone 5S processor does not in fact support it (trust me, I tried). Maybe next time.
ARM64 features 31 general purpose registers, as opposed to 14, which helps the compiler avoid running out of registers and having to “spill” variables to memory, which can be a costly operation in the middle of a loop. If you remember, the expansion of 8 to 16 registers for the x86 64-bit transition on the Mac back in the day measurably improved performance; however here the impact will be less, as 14 was already enough for most tasks, and ARM already had a register-based parameter passing convention in 32-bit mode. Your mileage will vary.
While you could do some 64-bit operations in previous versions of the ARM architecture, this was limited and slow; in ARM64, 64-bit math is natively and efficiently supported, so programs using 64-bit math will definitely benefit from ARM64. But which programs are those? After all, most of the amounts a program ever needs to track do not go over a billion, and therefore fit in a 32-bit integer.
Besides some specialized tasks, two notable example of tasks that do make use of 64-bit math are MP3 processing and most cryptography calculations. So those tasks will benefit from ARM64 on the iPhone 5S.
But on the other hand, all iPhones have always had dedicated MP3 processing hardware (which is relatively straightforward to use with Audio Queues and CoreAudio), which in particular is more power efficient to use than the main processor for this task, and ARMv8 also introduces dedicated AES and SHA accelerator instructions, which are theoretically available from ARM/A32 mode, so ARM64 was not necessary to improve those either.
But on the other other hand, there are other tasks that use 64-bit math and do not have dedicated accelerators. Moreover, standards evolve. Some will be phased out and others appear, and a perfect example is the recently announced SHA-3 standard, based on Keccak. Such new standards generally take time to make their way as dedicated accelerators, and obviously such accelerators cannot be introduced to devices released before the standardization. But software has no such limitations, and it just so happens that Keccak benefits from a 64-bit processor, for instance. 64-bit math matters for emerging and future standards which could benefit from it, even if they are specialized enough to warrant their own dedicated hardware, as software will always be necessary to deploy them after the fact.
ARMv8 also brings improvements to NEON, and while some of the new instructions are also theoretically available in 32-bit mode, such as VRINT, surprisingly some improvements are exclusive to ARM64, for instance the ability to operate on double-precision floating point data, and interesting new instructions to accumulate unsigned values to an accumulator which will saturate as a signed amount, and conversely (contrast with x86, where all vector extensions so far, including future ones like AVX-512, are equally available in 32-bit and 64-bit mode even though for 32-bit mode this requires incredible contortions, given how saturated the 32-bit x86 instruction encoding map is). Moreover, in ARM64 the number of 128-bit vector registers increases from 16 to 32, which is much more useful than the similar increase of number of general-purpose registers as SIMD calculations typically involve many vectors. I will talk about this some more in a future update to my NEON post
Pointer size doubling
It has to be mentioned: a tradeoff of ARM64 is that pointers are twice as large, taking more space in memory and the caches. Still, iOS programs are more media-heavy than pointer-heavy, so it shouldn’t be too bad (just make sure to monitor the effect when you will start building for ARM64).
Verdict: a nice win on NEON, native 64-bit math is a plus for some specialized tasks and the future, other factors are inconclusive: in my limited testing I have not observed performance changes from just switching (non-NEON, non-64 bit math) code from ARMv7 compilation to ARM64 compilation and running them on the same hardware (iPhone 5S).
Non-Candidate: Unrelated Processor Implementation Improvements
Speaking of which. Among the many dubious “evaluations” on the web of the iPhone 5S 64-bit feature, I saw some try to isolate the effect of ARM64 by comparing the run of an iOS App Store benchmark app (that had been updated to have an ARM64 slice) on an iPhone 5S to a run of that same app… on an iPhone 5. Facepalm. As if the processor and SoC designers at Apple had not been able to work on anything else than implementing ARMv8 in the past year. As a result, what was reported as being ARM64 improvements were in fact mostly the effect of unrelated improvement such as better caches, faster memory, improved microarchitecture, etc. Honestly, I’ve been running micro benchmarks of my devising on my iPhone 5S, and as far as I can tell the “Cyclone” processor core of the A7 is smoking hot (so to speak), including when running 32-bit ARMv7 code, so completely independently of ARMv8 and ARM64. The Swift core of the A6 was already impressive for a first public release, but here Cyclone knocks my socks off, my hat is off to the Apple semiconductor architecture guys.
Third Candidate: Opportunistic Apple Changes
Mike Ash talked about those, and I have not attempted to measure their effect, so I will defer to him. I will just comment that to an extent, these improvements can also be seen as Apple being stuck with inefficiencies they cannot get rid of for application compatibility reasons on ARM/A32, and they found a solution to not have these inefficiencies in the first place, but only for ARM64 (and therefore, ARM64 is better ;). I mean, we in the Apple development community are quick to point and laugh at Windows being saddled with a huge number of application compatibility hacks and a culture of fear of breaking anything that caused this OS to become prematurely fossilized (and I mean, I’m guilty as charged too), and I think it’s only fair we don’t blindly give Apple a pass on these things.
So, remember the non-fragile Objective-C ABI? The iPhone has always had it (though we did not necessarily realize as the Simulator did not have it at first), so why can’t Apple use it to add an inline retain count to NSObject in 32-bit ARM? I’m willing to bet that for such a fundamental Cocoa object, non-direct effects start playing a role whenever attempting to make even such an ostensibly simple change; I’m sure for instance that some shipping iOS apps allocate an enormous amount of small objects, and would therefore run out of memory if Apple added 4 bytes of inline retain count to each NSObject, and therefore to each such object. Mind you, the non-fragile ABI has likely been useful elsewhere on derivable Apple classes that are under less app compatibility pressure, but it was not enough to solve the inline retain count problem by itself.
Verdict: certainly a win, but should we give credit to ARM64 itself?
Fourth Candidate: the Floating-Point ABI
This one is a fun one. It could also be considered an inefficiency Apple is stuck with on 32-bit ARM, but since ARM touts that a hard float ABI is a feature of ARM64, I’m willing to cut Apple some slack here.
When the iPhone SDK was released, all iOS devices had an ARM11 processor which supported floating-point in hardware; however, Apple allowed and even set by default Thumb to be used for iOS apps, and Thumb on the ARM11 could not access the floating-point hardware, not even to put a floating-point value in a floating-point register. And in order to allow Thumb code to call the APIs, the APIs had to take their floating-point parameters from the general-purpose registers, and return their floating-point result to a general-purpose register, and in fact all function calls, including between ARM functions, had to behave that way, because they could always potentially be called by Thumb code: this is called the soft-float ABI. And when with the iPhone 3GS Thumb gained the ability to use floating-point hardware, it had to remain compatible with existing code, and therefore had to forward floating-point parameters in general-purpose registers. Today on 32-bit ARM parameters are still passed that way for compatibility with the original usages.
This can represent a performance penalty, as typically transferring from the floating-point register file to the general-purpose register file is often expensive (and sometimes the converse too). It is often small in comparison to the time needed for the called function to execute, but not always. ARM64 does not allow such a soft-float ABI, and I wanted to see if I could make the overhead visible, and then if switching to ARM64 would eliminate the overhead.
I created a small function that adds successively
3.0f, etc. up to
(1<<20)*1.0f to a single-precision floating-point accumulator that starts at 0.0, and another function which does the same thing except it calls another function to perform the addition, through a function pointer to decrease the risk of it being inlined. Then I compiled the code to ARMv7, ran it on the iPhone 5S, and measured and compared the time taken by each function; then the same process, except the code was also compiled for ARM64. Here are the results:
|Addition by function call
Yup, we have managed to make the effect observable and evaluate the penalty, and when trying on ARM64 the penalty is decimated; there is some overhead left, probably the overhead of the function call itself, which would be negligible in a real situation.
Of course, this was a contrived example designed to isolate the effect, where the least possible work is done for each forced function call, so the effect won’t be so obvious in real code, but it’s worth trying to build for ARM64 and see if improvements can be seen in code that passes around floating-point values.
Verdict: A win, possibly a nice win, in the right situations.
Non-Candidate: Apple-Enforced Limitations
Ah, yes, I have to mention that before I go. Remember when I mentioned that some ARMv8 features were theoretically available in 32-bit mode? That’s because Apple won’t let you use them in practice: you cannot create a program slice for 32-bit ARM that will only run on Cyclone and later (with the other devices using another slice). It is simply impossible (and I have even tried dirty tricks to do so, to no avail). If you want to take advantage of ARMv8 features like the AES and SHA accelerator instructions, you have to port your code to ARM64. Period. So Sayeth Apple.
That means, for instance, that Anand’s comparison on the same hardware of two versions of Geekbench: the one just before and the one just after addition of ARM64 support, while clever and the best he could do, is not really a fair comparison of ARM64v8 and ARM32v8, but in fact a comparison between ARM64v8 and ARMv7. When you remove the impressive AES and SHA1 advantages from the comparison table, you end up with something that may be fair, though it’s still hard to know for sure.
So these Apple-enforced limitations end up making us conflate the difference between ARM32v8 to ARM64v8 with the jump from ARMv7 to ARM64v8. Now mind you, Apple (and ARM, to an extent) gets full credit for both of them, but it is important to realize what we are talking about.
Plus, just a few paragraphs earlier I was chastising Apple for their constraining legacy, and what they did here, by not only shipping a full 64-bit ARMv8 processor, but also immediately a full 64-bit OS, is say: No. Stop trying to split hairs and try and use these new processor features while staying in 32-bit. Go ARM64, or bust. You have no excuse, the 64-bit environment was available as the same time as the new processor. And that way, it’s one less “ARM32v8” slice type in the wild, so one less legacy, to worry about.
Well… No. I’m not going to conclude one way or the other: neither that the 64-bit aspect of the iPhone 5S is a marketing gimmick, or that it is everything Apple implied it would be. I won’t enter this game. Because what I do see here is the result of awesome work from both the processor side, the OS side, and the toolchain side at Apple to seamlessly get us a full 64-bit ARM environment (with 32-bit compatibility) all at once, without us having to double-guess the next increment. The shorter the transition, the better we’ll all be, and Apple couldn’t do shorter than that. For comparison, on the Mac the first 64-bit machine, the Power Mac G5, shipped in 2003; and while Leopard ostensibly added support for 64-bit graphical applications, iTunes actually ended up requiring Lion, released in 2011, to run as a 64-bit process. So overall on the Mac the same transition took 8 years. ARM64 is the future, and the iPhone 5S + iOS 7 is a clear investment in that future. Plus, with the iPhone 5S I was able to tinker on ARMv8 way ahead of when I thought I would be able to do so, so I must thank Apple for that.