I, for one, welcome our new, more inclusive Apple

In case you have not been following closely, at this year’s WWDC Apple introduced a number of technologies that reverse many long-standing policies on what iOS apps were, or to be more accurate, were not allowed to do: technologies such as app extensions, third-party keyboards, Touch ID APIs, manual camera controls, Cloud Kit, or simply the ability to sell apps in bundles on the iOS App Store. I would be remiss if I did not mention a few of my pet peeves that apparently remain unaddressed, such as searching inside third-party apps from the iOS Springboard, real support for trials on the iOS App Store and the Mac App Store (more on that in a later post), any way to distribute iOS apps outside the iOS App Store, or the fact many of the changes in Mac OS X Yosemite are either better integration with iOS, or Lion and Mountain Lion-style “iOS-ification”, both of which would be better solved by transitioning the Mac to iOS, etc.

But in the end, the attitude change from Apple matters more than the specifics of what will come in iOS 8. And it was (as Matt Drance wrote) not just the announcements themselves: for instance with the video shown at the start of the keynote where iPhone and iPad users praise apps and the developers who made them, Apple wants us to know that they care for us developers and want us to succeed, which is a welcome change from the lack of visible consideration developers were treated with so far (with the limitation that this video frames the situation as developers directly providing their wares to users: don’t expect any change to how Apple sees middleware suppliers).

So I welcome this attitude change from Apple, and like Matt Drance, I am glad this seems to be coming from a place of confidence rather than concession (indeed, while the Google Play Store is much more inclusive1, the limited willingness of Android users to pay for apps means Apple probably does not feel much pressure in this area), which means that it’s likely only the first step: what we did not get at this WWDC, we can always hope to get in iOS 9, and at least the situation evolves in the right direction. I do not know where this change of heart comes from, I do not think any obvious event triggered it, I am just thankful that the Powers That Be at Apple decided to be pragmatic and cling less tightly to principles that, while potentially justified five years ago, were these days holding back the platform.

A caveat, though, is that I see one case where a new iOS 8 functionality, rather than giving me hope for the future, will actually hamper future improvements: iCloud Drive. While that feature may appear to address one of my longstanding pet peeves, anyone who thinks we were clamoring for merely a return to the traditional files and folders organization hasn’t really read what I or others have written on the matter; but this is exactly what iCloud Drive proposes (even if only documents are present in there, and even for just the files shared between different iOS apps, we expected better than that). Besides not improving on the current desktop status quo, the issue is that shipping it as such will create compatibility constraints (both from a user interface and API standpoint) which will make it hard for Apple to improve on it in the future, whereas Apple could have taken advantage of its experience and of the hindsight coming from having been without that feature for all this time to propose a better fundamental organization paradigm.

For instance, off the top of my head I can think of two ways to improve the experience of working on the same document from different apps:

  • Instead of (or on top of) “open in…”, have “also open in…”, which would also work by selecting an app among the ones supporting that document type. After that command, the document would appear in a specific section of the document picker of the first app, section with would be marked with the icon of the second app: in other words, this section would contain all documents shared between the first and second app. The same would go in the second app: the shared document would appear in a section marked with the icon of the first app. That way some sort of intuitive organization would be automatically provided. A document shared between more than two apps could appear in two sections at the same time, or could be put in the area where documents are available to all apps.
  • Introduce see-through folders. A paradox of hierarchical filing is that, as you start creating folders to organize your documents so as to more easily find something in the future, you may make documents harder to locate because they become “hidden” behind a folder. With see-through folders any folder you create would start with being just a roundrect drawn around the documents it would contain (say up to 4 contained documents), with the documents still being visible in their full size from the top level view, except there would be this roundrect around them. Then as the folder starts containing more and more documents, these documents would appear smaller and smaller from the top level view, so in practice you would have to “focus” on the folder by tapping on the folder name, so as to list only the documents contained in that folder, in full size, in order to select one document. When you have more than one level of folders, this would allow quickly scanning subfolders that contain only a few documents, since these documents would appear at full size when browsing the parent of these subfolders, so the document could either quickly be found in there, or else we would know it is in the “big” subfolder.

There are of course many other ways this could be improved, such as document tagging, or other metadata-based innovations. There are so many ways hierarchical document storage could be improved that Apple announcing they would merely go with pretty much the status quo for multi-app document collaboration tells me that in all likelihood no one who matters at Apple really cares about document management, which I find sad: even if not all such concocted improvements are actually viable, there is bound to be some that are and that they could have used instead.

(As for Swift, it is a subject with a very different scope that is deserving of its own post.)

But overall, these new developments seen at WWDC 2014 make me optimistic for the future of the Apple platforms and Apple in general. Even if it is not necessarily everything we wanted, change always starts with first steps like these.


  1. “Open” implies a binary situation, where a platform would be either “open” or “closed”; but situations are clearly more nuanced, with a whole continuum of “openness” between different cases such as game consoles, the iOS platform, the Android platform, or Windows. So I refer to platforms as being “more inclusive” or “less inclusive”, which allows for a range of, well, inclusiveness, rather than use “open” and the absolutes it implies.

Porting to the NEON intrinsics from experience

Hey you. Yes, you. Did you, inspired by my introduction to NEON on iPhone, write ARM NEON code, or are you maintaining ARM NEON code in some way? Is this NEON code written as ARM32 assembly? If you answered yes to both questions, then I hope you realize that any app that has your NEON code as a dependency is currently unable of taking advantage of ARM64 on supported hardware (now there may or may not be any real benefit for the app from doing so, but that is beside the point). ARM64, at the very least, is the future, so you will have to do something about that code so that it can run in ARM64 mode, but porting it to ARM64 assembly is not going to be straightforward, as the structure of the NEON register file has changed in ARM64 mode. Rather, I propose here porting your NEON ARM32 assembly algorithms to NEON intrinsics which can compile to both ARM32 and ARM64, and present here the outcome of my experience doing such a port, so that you can learn from it.

An introduction to the ARM NEON intrinsic support

The good thing about ARM NEON intrinsics is that they apply equally well in ARM32 and ARM64 mode, in fact you don’t have to follow any specific rule to support both with the same intrinsics source file: correct NEON intrinsics code that works on ARM32 will also work on ARM64 for free. At the most fundamental level, NEON intrinsics code is simply a C source file that includes <arm_neon.h> and uses a number of specific functions and types. The documentation for the ARM NEON intrinsics can be found here, on the ARM Information Center. This documentation ostensibly covers ARM DS-5, but in fact for iOS clang implements the same support; if you target other platforms in addition to or instead of iOS, you will have to check your toolchain compiler documentation, but if it supports any ARM NEON intrinsics at all it ought to have the same support as ARM DS-5.

Unfortunately, this document pretty much only documents the intrinsic function names and the types: for documentation on the operations these functions perform, it is still necessary to refer to the NEON instructions descriptions in the ARM instruction set document (don’t worry about the “registered ARM customers” part, you only need to create an account and agree to a license in order to download the PDF); furthermore, most material online (including my introduction to NEON on iPhone, if you need to get up to speed with NEON) will discuss NEON in terms of the instruction names rather than in terms of the C intrinsics, so it is a good idea to get used to locating the intrinsic function that corresponds to a given instruction; the most straightforward way is to open arm_neon.h in Xcode (add it as an include, compile once to refresh the index, then open it as one of this file’s includes in the “Related Files” menu), and just do a search for the instruction name: this will turn up the different intrinsic function variants that implement the instruction’s functionality, as the intrinsic function name is based on the instruction name. There is a trick situation, however, as for some instructions there is no matching intrinsic, these cases are documented here, with what you should do to get the equivalent functionality.

The converse also exists, where some intrinsics provide a functionality not provided by a particular instruction, or where the name does not match any instruction, such as:

In particular, the last two are what you will use in replacement of the parts of your ARM32 NEON algorithm where you would put results in, say, d6 and d7, and then the next operation would use q3, which is aliased to these two D registers. Indeed, it is important to realize (in particular if you are coming from NEON assembly coding) that these intrinsics work functionally, rather than procedurally over a register file; notably, the input variables are never modified. So stop worrying about placement and just write your NEON intrinsic code in functional fashion: factor_vec = vrsqrteq_f32(vmlaq_f32(vmulq_f32(x_vec, x_vec), y_vec, y_vec)); (assuming the initial reciprocal square root estimate is enough for your purposes). Things should come naturally once you integrate this way of thinking.

Variables should be reserved for results that you want to use more than once. Those need to be typed correctly, as the whole system is typed, with such fun variable type names as uint8x16_t; this explains the various vcombine_tnn variants, from vcombine_s8 to vcombine_p16, which in fact all come down to the same thing: the sole purpose of the variants is to preserve the correct element typing between the inputs and the output. I personally welcome the discipline: even if you think you know what you are doing, it’s so easy to get it subtly wrong in the middle of your algorithm, and you are left wondering at the end where you wrongly took a left turn (it was at Albuquerque. It is always at Albuquerque).

Less pleasant to use are the types that represent an array of vectors, of the form uint8x16x4_t for instance. Indeed, some intrinsics rely on these types, such as the transpositions ones, but also the deinterleaving loads and stores vld#/vst# (I presented them in my introduction to NEON on iPhone), which are just as indispensable when using intrinsics as they are when programming in assembly, and so when using these intrinsics you have to contend with these variables that represent multiple vectors at once (and that you of course cannot directly use as the input of another intrinsic); fortunately taking the individual vectors of those (for further calculations) is done using normal C array subscripting: coords_vec_arr.val[1], but this makes expressions less straightforward and elegant than they could otherwise have been.

Note that loading and storing vectors to memory without deinterleaving is not performed with an intrinsic, but simply by casting the pointer (typically one to your input/output element arrays) to a pointer to the correct vector type, and dereferencing that; this will result in the correct vector loads and stores being generated.

In practice

I am not going to share the code I ported or the actual benchmark results, but I can share the experience of porting a non-trivial NEON algorithm from ARM32 assembly to NEON intrinsics.

First, if the assembly code is competently commented (in particular with a clear register map), porting it is just a matter of following the algorithm main sequence and is rather straightforward, translating instructions one by one, with the addition of the occasional vcombine when two D vectors become a Q vector; your activity will mostly consist in finding the correct name for the intrinsic function for the given input element type, and finding variable names for these previously unnamed intermediate results (again, for these intermediate results which are only used once, save yourself the trouble of defining a variable and directly use the intrinsic output as the input for the next intrinsic). This was completed quickly.

But this is only the start. The next order of business is running the original algorithm and the new one on test inputs, and compare the results. For integer-only algorithms such as the one I ported, the results must match bit for bit between the original algorithm, the new one compiled as ARM32, and the new one compiled as ARM64; in my case they did. For algorithms that involve floating-point calculations they might not match bit for bit because of the different rounding control in ARM64, so compare within a tolerance that is appropriate for your purposes.

Once this check is done, you might wish to take a look at the assembly code generated from your intrinsics. In my case I discovered the ARM32 compiled version needed more vector storage than there are available registers, and as a result was performing various extra vectors loads and stores from memory at various points in the algorithm. The reason for this is that the automatic register allocation clang performed (at least in this case) just could not compare with my elaborate work in the original ARM32 NEON assembly code to tightly squeeze the necessary work data to never take more than 12 Q vectors at any given time (even avoiding the use of q4-q7, saving the trouble of having to preserve and restore them); also, it appears that, with clang, the intrinsics that use a scalar as one input do not actually generate the scalar-using instruction, but instead require the scalar to be splat on a vector register, harming register usage.

I have not been able to improve the situation by changing the way the intrinsic code was written; it seems it is the compiler which is going to have to improve. However, the ARM64 compiled version had no need for temporary storage beyond the NEON registers: twice as many vector registers are available in this mode, easing the pressure on the compiler register allocator.

But in the end what really matters is the actual performance of the code, so even if you take a look at the compiled code it is only by benchmarking the code (again, comparing between the original algorithm, the new version compiled as ARM32, and the new version compiled as ARM64) that you can reasonably decide which improvements are necessary. Don’t skimp on that part, you could be surprised. In my case, it turned out that the “inefficient”, ARM32 compiled version of the ported algorithm performed just as well as the original NEON ARM32 assembly. The probable reason is that my algorithm (and likely yours too) is in fact memory bandwidth constrained, and taking more time to perform the computations does not really matter when you then have to wait for the memory transfers to or from the level 3 cache or main memory to complete anyway.

As a result, in my case I could just replace the original algorithm by the new one without any performance regression. But that might not always be the case, and so if doing so would result in a performance regression, one course of action would be to keep using the original NEON assembly version in ARM32 mode, and use the new intrinsic-based algorithm only in ARM64 mode; use conditional compilation to select which code is used in each mode (I have a preprocessor macro defined for this purpose in the Xcode build settings, whose value depends on an architecture-dependent build setting). Fortunately, given the number of NEON registers available in ARM64, you should never see a performance regression on ARM64 capable hardware between the original ARM32 NEON assembly algorithm and the new one compiled as ARM64.

It worked

So your mileage may vary, certainly. But in my experience porting a NEON algorithm from ARM32 assembly to C intrinsics gave an adequate result, and was a quick and straightforward process, while writing an ARM64 assembly version would have been much more time consuming and would have required maintaining both versions in the future. And remember, no app that depends on your NEON algorithms can ship as a 64-bit capable app as long as you only have an ARM32 assembly version of these algorithms; if they haven’t been ported already, by now you’d better get started.

By the way, I should mention that today I also updated Introduction to NEON on iPhone and A few things iOS developers ought to know about the ARM architecture to take into account ARM64 and the changes it introduces; please have a look.

Besides fused multiply-add, what is the point of ARMv7s?

This post is part of a series on why an iOS developer would want to add a new architecture target to his iOS app project. The posts in this series so far cover ARMv7, ARMv7s (this one), ARM64 (soon!).

You probably remember the kerfuffle when, at the same time the iPhone 5 was announced (it was not even shipping yet), Apple added ARMv7s as a default architecture in Xcode without warning. But just what is it that ARMv7s brings, and why would you want to take advantage of it?

One thing that ARMv7s definitely brings is support for the VFPv4 and VFPv3 half-precision extensions, which consists of the following: fused floating-point multiply-add, and half-precision floating-point values (only for converting to and from single precision, no other operation supports the half-precision format), as well as the vector versions of these operations. Both of these have potential applications, even if they are not universally useful, and therefore it was indispensable for Apple to define an ARM architecture version so that apps could make use of them in practice if they desired: had Apple not defined ARMv7s, even if the iPhone 5 hardware would have been able to run these instructions, no one could have used them in practice as there would have been no way to safely include them in an iOS app (that is, in a way that does not cause the app to crash when run on earlier devices).

So we have determined that it was necessary for Apple to define ARMv7s, or this new functionality of the iPhone 5 processor would have been added for nothing, got it. But what if you are not taking advantage of these new floating-point instructions? It is important to realize that you are not taking advantage of these new floating-point instructions unless you full well know you do: indeed, the compiler will never generate these instructions, so the only way to benefit from this functionality is if your project includes algorithms that were specifically developed to take advantage of these instructions. And if it is not actually the case, then as far as I can tell using ARMv7s… is simply pointless. That is, there is no tangible benefit.

First, let us remember that adding an ARMv7s slice will almost double the executable binary size compared to shipping a binary with only ARMv7, which may or may not be a significant cost depending on whether other data (art assets, outer resources) already dominates the executable binary size, but remains something to pay attention to. So already the decision to include an ARMv7s slice starts in the red.

Go forth and divide

So let us see what other benefits we can find. The other major improvement of ARMv7s is integer division in hardware. So let us try and see how much it improves things.

int ZPDivisions(void* context)
{
    uint32_t i, accum = 0;
    uint32_t iterations = *((uint32_t*)context);
    
    for (i = 0; i < 4*iterations; i+=1)
    {
        accum += (4*iterations)/(i+1);
    }
    
    return accum;
}

OK, let us measure how much time it takes to execute (iterations equals 1000000, running on an iPhone 5S, averaged over three runs):

ARMv7 ARMv7s
divisions 24.951 ms 25.028 ms

…No difference. That can’t be?! The ARMv7 version includes this call to __udivsi3, which should be slower, let us see in the debugger what happens when this is called:

libcompiler_rt.dylib`__udivsi3:
0x3b767740:  tst    r1, r1
0x3b767744:  beq    0x3b767750                ; __udivsi3 + 16
0x3b767748:  udiv   r0, r0, r1
0x3b76774c:  bx     lr
0x3b767750:  mov    r0, #0
0x3b767754:  bx     lr

D’oh! When run on an ARMv7s device, this runtime function simply uses the hardware instruction to perform the division. Indeed, on other platforms such a function may be provided in a library statically linked to your app, whose code is then frozen. But that is not how Apple rolls. On iOS, even such a seemingly trivial functionality is actually provided by the OS, and the call to __udivsi3 is a dynamic library call which is actually resolved at runtime, and uses the most efficient implementation for the hardware, even if your binary only has ARMv7. In other words, on devices which would run your ARMv7s slice, you already benefit from the hardware integer division even without providing an ARMv7s slice.

Take 2

But wait, surely this dynamic library function call has some overhead compared to directly using the instruction, could we not reveal this by improving the test? We need to go deeper. Let’s find out, by performing four divisions during each loop, which will reduce the looping overhead:

int ZPUnrolledDivisions(void* context)
{
    uint32_t i, accum = 0;
    uint32_t iterations = *((uint32_t*)context);
    
    for (i = 0; i < 4*iterations; i+=4)
    {
        accum += (4*iterations)/(i+1) + (4*iterations)/(i+2) + (4*iterations)/(i+3) + (4*iterations)/(i+4);
    }
    
    return accum;
}

And the results are… drum roll… (iterations equals 1000000, running on an iPhone 5S, averaged over three runs):

ARMv7 ARMv7s
unrolled divisions 24.832 ms 24.930 ms

“Come on now, you’re messing with me here, right?” Nope. There is actually a simple explanation for this: even in hardware, integer division is very expensive. A good rule of thumb for the respective costs of execution for the elementary mathematical operations on integers is this:

operation (on 32-bit integers) + - × ÷
cost in cycles 1 1 3 or 4 20+

This approximation remains valid across many different processors. And the amortized cost of a dynamic library function call is pretty low (only slightly more than a regular function call), so it is dwarfed by the execution time of the division instruction itself.

Take 3

I had one last idea of where we could actually look for to observe a penalty when function calls are involved. We need to go even deeper: having these calls to __udivsi3 forces the compiler to put the input variables into the same hardware registers before each division, so the processor is not going to be able run the divisions in parallel, so let us modify the code so that, in ARMv7s, the divisions could actually run in parallel:

int ZPParallelDivisions(void* context)
{
    uint32_t i, accum1 = 0, accum2 = 0, accum3 = 0, accum4 = 0;
    uint32_t iterations = *((uint32_t*)context);
    
    for (i = 0; i < 4*iterations; i+=4)
    {
        accum1 += (4*iterations)/(i+1);
        accum2 += (4*iterations)/(i+2);
        accum3 += (4*iterations)/(i+3);
        accum4 += (4*iterations)/(i+4);
    }
    
    return accum1 + accum2 + accum3 + accum4;
}

(iterations equals 1000000, running on an iPhone 5S, averaged over three runs):

ARMv7 ARMv7s
parallel divisions 25.353 ms 24.977 ms

…I give up (the difference has no statistical significance).

There might be other benefits to avoiding a function call for each integer division, such as the compiler not needing to consider the values stored in caller-saved registers as being lost across the call, but honestly I do not see these effects as having any measurable impact on real-world code.

If you want to reproduce my results, you can get the source for these tests on Bitbucket.

What else?

We have already looked pretty far in trying to find benefits in directly using the new integer division instruction, what if we set that aside for now and try and see what else ARMv7s brings? Technically, nothing else: ARMv7s brings VFPv4 and VFPv3-HP, their vector counterparts, integer division in hardware, and that’s it for unprivileged instructions as far as anyone can tell.

However, when compiling an ARMv7s slice, Clang will apparently take advantage of this to optimize the code specifically for the Swift core (according these patches, via Stack Overflow). These optimizations are of the tuning variety, so do not expect that much from them, but the main limitation with those is that not that many iOS devices run on Swift, in the grand scheme of things. If you check the awesome iOS Support Matrix (ARMv7s column), you will see for instance that no iPod Touch model runs it, and that the iPad mini skipped it entirely (going directly from ARMv7 to ARM64). So is it worth optimizing specifically for the 4th generation iPad, the iPhone 5, and the iPhone 5C? Maybe not.

What compiling for ARMv7s won’t bring you

And now it’s time for our regular segment, “let us dispel some misconceptions about what a new ARM architecture version really brings”. Adding an ARMv7s slice will not:

  • make your code run more efficiently on ARMv7 devices, since those will still be running the ARMv7 compiled code; this means it could potentially improve your code only on devices where your app already runs faster.
  • improve performance of the Apple frameworks and libraries: those are already optimized for the device they are running on, even if your code is compiled only for ARMv7 (we saw this effect earlier with __udivsi3).
  • There are a few cases where ARMv7s devices run code less efficiently than ARMv7 ones; this will happen on these devices even if you only compile for ARMv7, so adding (or replacing by) an ARMv7s slice will not help or hurt this in any way.
  • If you have third-party dependencies with libraries that provide only an ARMv7 slice (you can check with otool -vf <library name>), the code of this dependency won’t become more efficient if you compile for ARMv7s (if they do provide an ARMv7s slice, compiling for ARMv7s will allow you to use it, maybe making it more efficient).

I need you

Seems clear-cut, right? Not so fast. Sure, we explored some places where we thought direct use of hardware integer division could have improved things, but maybe there are actual improvements in places I did not explore, places with a more complex mix between integer division and other operations. Maybe tuning for Swift does improve things for Cyclone too (which is represented by the ARM64 column devices in iOS Support Matrix), and maybe it could be worth it. Maybe I am wrong and Clang can take advantage of fused multiply-add without you needing to do a thing about it. Maybe I completely missed some other instructions that ARMv7s brings.

And most of all, I have not actually run any real benchmark here, for one good reason: I have little idea of the kind of algorithms iOS apps spend significant CPU time on (outside of the frameworks), so I do not know what kind of benchmark to run in the first place (as for Geekbench, I do not think it really represents tasks commonly done in iOS apps, and in addition I am wary of CPU benchmarks I cannot see the source code of). A good benchmark would avoid us missing the forest for the trees, in case that is what is happening here.

So I need you. I need you to run your app with, and without, an ARMv7s slice on a Swift device (as well as a Cyclone device, if you are so inclined), and report the outcome (such as increased performance or decreased processor usage, the latter is important for battery life). Failing that, I need you to tell me the improvements you remember seeing on Swift devices when you added an ARMv7s slice, or what were the conclusions of the evaluation to add an ARMv7s slice, what you can share of it at least. I need you to tell me if I missed something.

And that is why I am exceptionally going to allow comments on this post. In fact, they should appear immediately without having to go through moderation, in order to facilitate the conversation. But first, be wary of Akismet: if your comment is flagged as spam, try and rework it a bit and post again. Second, comments with nothing to do with the matter at hand will be subject to instant vaporization.

So have at it:

What benefits does the iPhone 5S get from being 64-bit?

Every fall, a new iPhone. The schedule of Apple’s mobile hardware releases has become pretty predictable by now, but they more than make up for this timing predictability by the sheer unpredictability of what they introduce each year, and this fall they outdid themselves. Between TouchID and the M7 coprocessor, the iPhone 5S had plenty of surprises, but what most intrigued and excited many people in the development community was its new, 64-bit processor. But many have openly wondered what the point of this feature was, exactly; including some of the same people excited by it, there is no contradiction in that. So I set out to collect answers, and here is what I found.

Before we begin, I strongly recommend you read Friday Q&A 2013-09-27: ARM64 and You, then get the The ARMv8-A Reference Manual (don’t worry about the “registered ARM customers” part, you only need to create an account and agree to a license in order to download the PDF): have it on hand to refer to it whenever I will mention an instruction or architectural feature.

Some Context

iOS devices have always been based on ARM architecture processors. So far, ARM processors have been strictly 32 bit machines: 32-bit general registers, addresses, most calculations, etc.; but in 2011, ARM Holdings announced ARMv8, the first version of the ARM architecture to allow ARM native 64-bit processing and in particular 64-bit addresses in programs. It was clearly, and by their own admission, done quite ahead of the time where it would actually be needed, so that the whole ecosystem would have time to adopt it (board vendors, debug tools, ISVs, open-source projects, etc.), and in fact ARM did not announce at the time any of their own processor implementations of the new architecture (which they also do ahead of time), leaving in fact some of their partners, server processor ones in particular, the honor of releasing the first ARMv8 processor implementations. I’m not sure any device using an ARMv8 processor design from ARM Holdings has even shipped yet.

All that to say that while many people knew about 64-bit ARM for some time, almost no one expected Apple to release an ARMv8-based consumer device so ahead of when everyone thought such a thing would be actually needed; Apple was not merely first to market with an ARMv8 handheld, but lapped every other handset maker and their processor suppliers in that regard. But that naturally raises the question of what exactly Apple gets from having a 64-bit ARM processor in the iPhone 5S, since after all none of its competitors or the suppliers of these saw what benefit would justify rushing to ARMv8 as Apple did. And this is a legitimate question, so let us see what benefits we can identify.

First Candidate: a Larger Address Space

Let us start by the obvious: the ability for a single program to address more than 4 GB of address space. Or rather, let us start by killing the notion that this is only about devices with more than 4 GB of physical RAM. You do not, absolutely not, need a 64-bit processor to build a device with more than 4GB or RAM; for instance, the Cortex A15, which implements ARMv7-A with a few extensions, is able of using up to 1 TB of physical RAM, even though it is a decidedly 32-bit processor. What is true is that, with a 32-bit processor, no single program is able of using more than 4 GB of that RAM at once, so while such an arrangement is very useful for server or desktop multitasking scenarios, its benefits are more limited on a mobile device, where you don’t typically expect background programs to keep using a lot of memory. So handsets and tablets will likely need to go with a 64-bit processor when they will start packing more than 4 GB of RAM, so that the frontmost program can actually make use of that RAM.

However, that does not mean the benefits of large virtual address space are limited to that situation. Indeed, using virtual memory an iOS program can very well use large datasets that do not actually fit in RAM, using mmap(2) to map the files containing this data into virtual memory, and leaving the virtual memory subsystem handle the RAM as a cache. Currently, it is problematic on iOS to map files more than a few hundred megabytes in size, because the program address space is limited to 4GB (and what is left is not necessarily in a big continuous chunk you can map a single file in).

That being said, in my opinion the usefulness of being able to map gigabyte-scale files on 64-bit iOS devices will currently be limited to a few niche applications, if only because, while these files won’t necessarily have to fit in RAM, they will have to fit on the device Flash storage in the first place, and the largest capacity you can get on an iOS device at the time of this writing being 128 GB, with most people settling for less, you’d better not have too many such applications installed at once. That said, for those applications which need it, likely in vertical markets mostly, the sole feature of a larger address space means 64-bit ARM is a godsend for them.

One sometimes heard benefit of Apple already pushing ordinary applications (those which don’t map big files) to adopt 64-bit ARM is that, the day Apple releases a device featuring more than 4 GB of RAM, these applications will benefit without them needing to be updated. However, I don’t buy it. It is in the best interest of iOS applications to not spontaneously occupy too much memory, for instance so that they do not get killed first when they are in the background, so in order to take advantage of an iOS device with more RAM than you can shake a stick at, they would have change behavior and be updated anyway, so…

Verdict: inconclusive for most apps, a godsend for some niche apps.

Second Candidate: the ARMv8 AArch64 A64 ARM64 Instruction Set

The ability of doing 64-bit processing on ARM comes as part of a new instruction set, called…

  • Well it’s not called ARMv8: not only does ARMv8 also bring improvements to the ARM and Thumb instruction sets (more on that later), which for the occasion have been renamed to A32 and T32, respectively; but also ARMv9 whenever it will come out will also feature 64-bit processing, so we can’t refer to 64-bit ARM as ARMv8.
  • AArch64 is the name of the execution mode of the processor where you can perform 64-bit processing, and it’s a mouthful so I won’t be using the term.
  • A64 is the name ARM Holdings gives to the new instructions set, a peer to A32 and T32; but this term does not make it clear we’re talking about ARM so I won’t be using it.
  • Apple in Xcode uses ARM64 to designate the new instruction set, so that’s the name we will be using

The instruction set is quite a change from ARM/A32; for one, the instruction encodings are completely different, and a number of features, most notably pervasive conditional execution, have been dropped. On the other hand, you get access to 31 general purpose registers that are of course 64-bit wide (as opposed to the 14 general purpose registers which you get from ARM/A32, once you remove the PC and SP), some instructions can do more, you get access to much more 64-bit math, and what does remain supported has the same semantics as in ARM/A32 (for instance the classic NZCV flags are still here and behave the exact same way), so it’s not completely unfamiliar territory.

ARM64 could be the subject of an entire blog post, so let us stick to some highlights:

CRC32

Ah, let’s get that one out of the way first. You might have noticed this little guy in the ARMv8-A Reference Manual (page 449). Bad news, folks: support for this instruction is optional, and the iPhone 5S processor does not in fact support it (trust me, I tried). Maybe next time.

More Registers

ARM64 features 31 general purpose registers, as opposed to 14, which helps the compiler avoid running out of registers and having to “spill” variables to memory, which can be a costly operation in the middle of a loop. If you remember, the expansion of 8 to 16 registers for the x86 64-bit transition on the Mac back in the day measurably improved performance; however here the impact will be less, as 14 was already enough for most tasks, and ARM already had a register-based parameter passing convention in 32-bit mode. Your mileage will vary.

64-bit Math

While you could do some 64-bit operations in previous versions of the ARM architecture, this was limited and slow; in ARM64, 64-bit math is natively and efficiently supported, so programs using 64-bit math will definitely benefit from ARM64. But which programs are those? After all, most of the amounts a program ever needs to track do not go over a billion, and therefore fit in a 32-bit integer.

Besides some specialized tasks, two notable example of tasks that do make use of 64-bit math are MP3 processing and most cryptography calculations. So those tasks will benefit from ARM64 on the iPhone 5S.

But on the other hand, all iPhones have always had dedicated MP3 processing hardware (which is relatively straightforward to use with Audio Queues and CoreAudio), which in particular is more power efficient to use than the main processor for this task, and ARMv8 also introduces dedicated AES and SHA accelerator instructions, which are theoretically available from ARM/A32 mode, so ARM64 was not necessary to improve those either.

But on the other other hand, there are other tasks that use 64-bit math and do not have dedicated accelerators. Moreover, standards evolve. Some will be phased out and others appear, and a perfect example is the recently announced SHA-3 standard, based on Keccak. Such new standards generally take time to make their way as dedicated accelerators, and obviously such accelerators cannot be introduced to devices released before the standardization. But software has no such limitations, and it just so happens that Keccak benefits from a 64-bit processor, for instance. 64-bit math matters for emerging and future standards which could benefit from it, even if they are specialized enough to warrant their own dedicated hardware, as software will always be necessary to deploy them after the fact.

NEON Improvements

ARMv8 also brings improvements to NEON, and while some of the new instructions are also theoretically available in 32-bit mode, such as VRINT, surprisingly some improvements are exclusive to ARM64, for instance the ability to operate on double-precision floating point data, and interesting new instructions to accumulate unsigned values to an accumulator which will saturate as a signed amount, and conversely (contrast with x86, where all vector extensions so far, including future ones like AVX-512, are equally available in 32-bit and 64-bit mode even though for 32-bit mode this requires incredible contortions, given how saturated the 32-bit x86 instruction encoding map is). Moreover, in ARM64 the number of 128-bit vector registers increases from 16 to 32, which is much more useful than the similar increase of number of general-purpose registers as SIMD calculations typically involve many vectors. I will talk about this some more in a future update to my NEON post

Pointer size doubling

It has to be mentioned: a tradeoff of ARM64 is that pointers are twice as large, taking more space in memory and the caches. Still, iOS programs are more media-heavy than pointer-heavy, so it shouldn’t be too bad (just make sure to monitor the effect when you will start building for ARM64).

Verdict: a nice win on NEON, native 64-bit math is a plus for some specialized tasks and the future, other factors are inconclusive: in my limited testing I have not observed performance changes from just switching (non-NEON, non-64 bit math) code from ARMv7 compilation to ARM64 compilation and running them on the same hardware (iPhone 5S).

Non-Candidate: Unrelated Processor Implementation Improvements

Speaking of which. Among the many dubious “evaluations” on the web of the iPhone 5S 64-bit feature, I saw some try to isolate the effect of ARM64 by comparing the run of an iOS App Store benchmark app (that had been updated to have an ARM64 slice) on an iPhone 5S to a run of that same app… on an iPhone 5. Facepalm. As if the processor and SoC designers at Apple had not been able to work on anything else than implementing ARMv8 in the past year. As a result, what was reported as being ARM64 improvements were in fact mostly the effect of unrelated improvement such as better caches, faster memory, improved microarchitecture, etc. Honestly, I’ve been running micro benchmarks of my devising on my iPhone 5S, and as far as I can tell the “Cyclone” processor core of the A7 is smoking hot (so to speak), including when running 32-bit ARMv7 code, so completely independently of ARMv8 and ARM64. The Swift core of the A6 was already impressive for a first public release, but here Cyclone knocks my socks off, my hat is off to the Apple semiconductor architecture guys.

Third Candidate: Opportunistic Apple Changes

Mike Ash talked about those, and I have not attempted to measure their effect, so I will defer to him. I will just comment that to an extent, these improvements can also be seen as Apple being stuck with inefficiencies they cannot get rid of for application compatibility reasons on ARM/A32, and they found a solution to not have these inefficiencies in the first place, but only for ARM64 (and therefore, ARM64 is better ;). I mean, we in the Apple development community are quick to point and laugh at Windows being saddled with a huge number of application compatibility hacks and a culture of fear of breaking anything that caused this OS to become prematurely fossilized (and I mean, I’m guilty as charged too), and I think it’s only fair we don’t blindly give Apple a pass on these things.

So, remember the non-fragile Objective-C ABI? The iPhone has always had it (though we did not necessarily realize as the Simulator did not have it at first), so why can’t Apple use it to add an inline retain count to NSObject in 32-bit ARM? I’m willing to bet that for such a fundamental Cocoa object, non-direct effects start playing a role whenever attempting to make even such an ostensibly simple change; I’m sure for instance that some shipping iOS apps allocate an enormous amount of small objects, and would therefore run out of memory if Apple added 4 bytes of inline retain count to each NSObject, and therefore to each such object. Mind you, the non-fragile ABI has likely been useful elsewhere on derivable Apple classes that are under less app compatibility pressure, but it was not enough to solve the inline retain count problem by itself.

Verdict: certainly a win, but should we give credit to ARM64 itself?

Fourth Candidate: the Floating-Point ABI

This one is a fun one. It could also be considered an inefficiency Apple is stuck with on 32-bit ARM, but since ARM touts that a hard float ABI is a feature of ARM64, I’m willing to cut Apple some slack here.

When the iPhone SDK was released, all iOS devices had an ARM11 processor which supported floating-point in hardware; however, Apple allowed and even set by default Thumb to be used for iOS apps, and Thumb on the ARM11 could not access the floating-point hardware, not even to put a floating-point value in a floating-point register. And in order to allow Thumb code to call the APIs, the APIs had to take their floating-point parameters from the general-purpose registers, and return their floating-point result to a general-purpose register, and in fact all function calls, including between ARM functions, had to behave that way, because they could always potentially be called by Thumb code: this is called the soft-float ABI. And when with the iPhone 3GS Thumb gained the ability to use floating-point hardware, it had to remain compatible with existing code, and therefore had to forward floating-point parameters in general-purpose registers. Today on 32-bit ARM parameters are still passed that way for compatibility with the original usages.

This can represent a performance penalty, as typically transferring from the floating-point register file to the general-purpose register file is often expensive (and sometimes the converse too). It is often small in comparison to the time needed for the called function to execute, but not always. ARM64 does not allow such a soft-float ABI, and I wanted to see if I could make the overhead visible, and then if switching to ARM64 would eliminate the overhead.

I created a small function that adds successively 1.0f, 2.0f, 3.0f, etc. up to (1<<20)*1.0f to a single-precision floating-point accumulator that starts at 0.0, and another function which does the same thing except it calls another function to perform the addition, through a function pointer to decrease the risk of it being inlined. Then I compiled the code to ARMv7, ran it on the iPhone 5S, and measured and compared the time taken by each function; then the same process, except the code was also compiled for ARM64. Here are the results:

ARMv7 ARM64
Inlined addition 6.103 ms 5.885 ms
Addition by function call 12.999 ms 6.920 ms

Yup, we have managed to make the effect observable and evaluate the penalty, and when trying on ARM64 the penalty is decimated; there is some overhead left, probably the overhead of the function call itself, which would be negligible in a real situation.

Of course, this was a contrived example designed to isolate the effect, where the least possible work is done for each forced function call, so the effect won’t be so obvious in real code, but it’s worth trying to build for ARM64 and see if improvements can be seen in code that passes around floating-point values.

Verdict: A win, possibly a nice win, in the right situations.

Non-Candidate: Apple-Enforced Limitations

Ah, yes, I have to mention that before I go. Remember when I mentioned that some ARMv8 features were theoretically available in 32-bit mode? That’s because Apple won’t let you use them in practice: you cannot create a program slice for 32-bit ARM that will only run on Cyclone and later (with the other devices using another slice). It is simply impossible (and I have even tried dirty tricks to do so, to no avail). If you want to take advantage of ARMv8 features like the AES and SHA accelerator instructions, you have to port your code to ARM64. Period. So Sayeth Apple.

That means, for instance, that Anand’s comparison on the same hardware of two versions of Geekbench: the one just before and the one just after addition of ARM64 support, while clever and the best he could do, is not really a fair comparison of ARM64v8 and ARM32v8, but in fact a comparison between ARM64v8 and ARMv7. When you remove the impressive AES and SHA1 advantages from the comparison table, you end up with something that may be fair, though it’s still hard to know for sure.

So these Apple-enforced limitations end up making us conflate the difference between ARM32v8 to ARM64v8 with the jump from ARMv7 to ARM64v8. Now mind you, Apple (and ARM, to an extent) gets full credit for both of them, but it is important to realize what we are talking about.

Plus, just a few paragraphs earlier I was chastising Apple for their constraining legacy, and what they did here, by not only shipping a full 64-bit ARMv8 processor, but also immediately a full 64-bit OS, is say: No. Stop trying to split hairs and try and use these new processor features while staying in 32-bit. Go ARM64, or bust. You have no excuse, the 64-bit environment was available as the same time as the new processor. And that way, it’s one less “ARM32v8” slice type in the wild, so one less legacy, to worry about.

Conclusion

Well… No. I’m not going to conclude one way or the other: neither that the 64-bit aspect of the iPhone 5S is a marketing gimmick, or that it is everything Apple implied it would be. I won’t enter this game. Because what I do see here is the result of awesome work from both the processor side, the OS side, and the toolchain side at Apple to seamlessly get us a full 64-bit ARM environment (with 32-bit compatibility) all at once, without us having to double-guess the next increment. The shorter the transition, the better we’ll all be, and Apple couldn’t do shorter than that. For comparison, on the Mac the first 64-bit machine, the Power Mac G5, shipped in 2003; and while Leopard ostensibly added support for 64-bit graphical applications, iTunes actually ended up requiring Lion, released in 2011, to run as a 64-bit process. So overall on the Mac the same transition took 8 years. ARM64 is the future, and the iPhone 5S + iOS 7 is a clear investment in that future. Plus, with the iPhone 5S I was able to tinker on ARMv8 way ahead of when I thought I would be able to do so, so I must thank Apple for that.

Let me put something while you’re waiting…

A short note to let you know that I am currently working on a copiously researched post about the big new thing of the iPhone 5S processor: 64-bit ARM, as well as related updates to keep “Introduction to NEON on iPhone” and of course “A few things iOS developers ought to know about the ARM architecture” up-to-date. I already have a lot of interesting info about that aspect of the iPhone 5S, either from experimenting on it or found online, and while searching I happened to stumble upon this post, which is so true and exactly what I’ve been practicing for so long that I wish I would have thought of posting this myself, so I’m sharing it with you here.

In fact, rather than compile then disassemble as in that post, I go one step further and compile to assembly (either use clang -S, or use the view assembly feature of Xcode:)

screenshot of part of the Xcode window, with the menu item for the view assembler feature highlighted

Indeed, sometimes I don’t merely need to figure out the instructions for a particular task, but also the assembler directives for it, for instance in order to generate the proper relocation when referencing a global variable, or sometimes simply to know what my ARM64 assembly files need to start with in order to work…

A highly recommended technique. Don’t leave home without it:

ARM Processors: How do I do <x> in assembler?

How does ETC work? A sad story of application compatibility

This post is, in fact, not quite like the others. It is a parody of the Old New Thing I wrote for parody week, so take it with a big grain of salt…

Commenter Contoso asked: “What’s the deal with ETC? Why is it so complicated?“

First, I will note that ETC (which stands for Coordinated Eternal Time) is in fact an international standard, having been adopted by ISO as well as national and industrial standard bodies. The specification is also documented on MSDN, but that’s more for historical reasons than anything else at this point, really. But okay, let’s discuss ETC, seeing as that’s what you want me to do.

ETC is not complicated at all if you follow from the problem to its logical conclusion. The youngest among you might not realize it, but the year 2000 bug was a Big Deal. When it began to be discussed in the public sphere starting in 1996 or so, most people laughed it off, but we knew, and always knew that if nothing was done computers, and all civilization in fact, would be headed for a disaster of biblical proportions. Real wrath of God type stuff. The dead rising from the grave! Human sacrifice! Dogs and cats living together… mass hysteria!

The problem originated years before that when some bright software developers could not be bothered to keep track of the whole year, and instead only kept track of the last two digits; so for instance, 1996 would be stored as just 96 in memory, and when reading it it was implicitly considered to have had the “19” before it, and so would be restored as “1996” for display, processing, etc. But it just happened to work because the years they saw started in “19”, and things would go wrong as soon as years would no longer do so, starting with 2000.

What happened (or rather, would have happened if we let it happen, this was run in controlled experiment conditions in our labs) in this case was that, for starters, these programs would print the year in the date as “19100”. You might think that would not be too bad, even though that would have some regulatory and other consequences, and would result in customers blaming us, and not the faulty program.

But that would in fact be if they even got as far as printing the date.

Most of them just fell over and died from some “impossible” situation long before that: some would take the date given by the API, convert it to text, blindly take the last two digits without checking the first two, and when comparing with the date in its records to see how old the last save was would end up with a negative age since it did 0 – 99 as far as the year was concerned, and the program would crash on a logic error; others would try and behave better by computing the difference with the year 1900 and the one returned by our API, but when they tried to process their “two-digit” year, which was now “100”, for display, they would take up one more byte than expected and end up corrupting whatever data was after it, which quickly led them to a crash.

And that was if you were lucky: some programs would appear to work correctly, but in fact have subtle yet devastating problems, such as computing interest backwards or outputting the wrong people ages.

We could not ignore the problem: starting about noon, 31st of December 1999 UTC, when the first parts of the world would start being in 2000, we would have been inundated with support requests for these defective products, never mind that the problem was not with us.

And we could not just block the faulty software: even if we did not already suspect that was the case, a survey showed every single one of our (important) customers was using at least one program which we know would exhibit issues come year 2000, with some customers using hundreds of such programs! And that’s without accounting for internally-developed software by the customer, and after requesting some sample we found out most of this software would be affected as well. Most of the problematic software was considered mission-critical and could not just be abandoned and had to keep working past 1999, come hell or high water.

Couldn’t the programs be fixed and customers get updated version? Well, for one in the usual case the company selling the program would be happy to do so, provided customers would pay for the upgrade to the updated version of the software, and customers reacted badly to that scenario.

And that assumes the company that developed the software was still in business.

In any case, the program might have been written in an obsolete programming language like Object Pascal, using the 16-bit APIs, and could no longer be built for lack of a surviving install of the compiler, or even lack of a machine able of running the compiler. Some of these programs could not be fixed without fixing the programming language they used or a library they relied on, repeating the problem recursively on the suppliers of these which may have become out of business. Even if the program could technically be rebuilt, maybe its original developer was long gone from the company and no one else could have managed to do it.

But a more common case was that the source code for the program was in fact simply lost to the ages.

Meanwhile, we were of course working on the solution. We came up with an elegant compatibility mechanism by which any application or other program which did not explicitly declare itself to support the years 2000 and after would get dates from the API in ETC instead of UTC. ETC was designed so that 1999 is the last year to ever happen. It simply never ends. You should really read the specification if you want the details, but basically how it works is that in the first half of 1999, one ETC second is worth two UTC seconds, so it can represent one UTC year; then in the first half of what is left of 1999, which is a quarter year, one ETC second is worth four UTC seconds, so again in total one UTC year, and in the first half of what is left after that, one ETC second is worth eight UTC seconds, etc. So we can fit an arbitrary number of UTC years into what seems to be one year in ETC, and therefore from the point of view of the legacy programs. Clever, huh? Of course, this means the resolution of legacy programs decreases as time goes on, but these programs only had a limited number of seconds they could ever account for in the future anyway, so it is making best use of the limited resource they have left. Things start becoming a bit more complex when we start dividing 1999 into amounts that are no longer integer amounts of seconds, but the general principle remains.

Of course, something might seem off in the preceding description, and you might guess that things did not exactly come to be that way. And indeed, when we deployed the solution in our usability labs, we quickly realized people would confuse ETC dates coming from legacy apps with UTC dates, for instance copying an ETC date and pasting it where a UTC date was expected, etc., causing the system to be unusable in practice. That was when we realized the folly of having two calendar systems in use at the same time. Something had to be done.

Oh, there was some resistance, of course. Some countries in particular dragged their feet. But in the end, when faced with the perspective of a digital apocalypse, everyone complied eventually, and by 1998 ETC was universally adopted as the basis for official timekeeping, just in time for it to be deployed. Because remember: application compatibility is paramount.

And besides, aren’t you glad it’s right now the 31st of December, 23:04:06.09375? Rather than whatever it would be right now had we kept “years”, which would be something in “2013” I guess, or another thing equally ridiculous.

Creating discoverable user defaults

John C. Welch recently lashed out at Firefox and Chrome for using non-standard means of storing user settings, and conversely praised Safari for using the property list format, the standard on Mac OS X, to do so. A particular point of contention with the Chrome way was the fact there is no way to change plug-in settings by manipulating the preferences file, unless the user changed at least one such setting in the application beforehand, while this is not the case with Safari.

Or is it? As mentioned in the comments, there is at least one preference in Safari that behaves in a similar way, remaining hidden and absent from the preferences file until explicitly changed. What gives?

In this post I am going to try and explain what is going on here, and provide a way for Mac (and iOS — you never know when that might turn out to be useful) application developers to allow the user defaults that their applications use to be discoverable and changeable in the application property list preferences file.

The first thing is that it is an essential feature of the Mac OS X user defaults system (of which the property list format preference files are a part) that preferences files need not contain every single setting the application relies upon: if an application asks the user defaults system for a setting which is absent from the preference plist for this application, the user default system will merely answer that there no value for that setting, and the application can then take an appropriate action. This is essential because it allows, for instance, to seamlessly update apps while preserving their settings, even when the update features new settings, but also to just as seamlessly downgrade such apps, and even to switch versions in other ways: suppose you were using the Mac App Store version of an app, and you switch to the non-Mac App Store one, which uses Sparkle for updates; using the Mac OS X user defaults system, the non-Mac App Store version will seamlessly pick up the user settings, while when Sparkle will ask for its settings (it has a few) in the application preferences it will find they are not set, and Sparkle will act as if there was no Sparkle activity previously for this application, in other words behave as expected. This is much more flexible than, for instance, a script to upgrade the preference file at each app update.

So, OK, the preference file need not contain every single setting the application relies upon before the application is run; what about after the application is run? By default, unfortunately, the Mac OS X user defaults system is not made aware of the decision taken by the application when the latter gets an empty value, so the user defaults system simply leaves the setting unset, and the preference file remains as-is. There is, however, a mechanism by which the application can declare to the user defaults system the value to use when the search for a given setting turns out empty, by registering them in the registration domain; but as it turns out, these default settings do not get written in the preferences file either, they simply get used as the default for each setting.

So in other words, when using techniques recommended by Apple, no setting will ever get written out to the preferences file until explicitly changed by the user. It is still possible to change the setting without using the application, but this requires knowing exactly the name of the setting, its location, and the kind of value it can take, which are hard to guess without it being present in the preference file. This will not do, so what can we do?

What I have done so far in my (not widely released) applications is the following (for my fellow Mac developers lucky enough to be maintaining Carbon applications, adapt using CFPreferences as appropriate): to begin with I always use -[NSUserDefaults objectForKey:] rather than the convenience methods like -[NSUserDefaults integerForKey:], at the very least as the first call, so that I can known whether the value is unset or actually set to 0 (which is impossible to tell with -[NSUserDefaults integerForKey:]); then if the setting was not found in the user defaults system, I explicitly write the default value to the user defaults system before returning it as the asked setting; for convenience I wrap the whole thing into a function, one for each setting, looking more or less like this:

NSString* StuffWeServe(void)
{
   NSString* result = nil;
   
   result = [[NSUserDefaults standardUserDefaults] objectForKey:@"KindOfObjectWeServe"];
   if (result != nil) // elided: checks that result actually is a string
      return result;
   
   // not found, insert the default
   result = @"burgers";
   [[NSUserDefaults standardUserDefaults] setObject:result forKey:@"KindOfObjectWeServe"];
   
   return result;
}

It is important to never directly call NSUserDefaults and always go through the function whenever the setting value is read (writing the setting can be done directly through NSUserDefaults). I only needed to do this for a handful of settings, if you have many settings it should be possible to implement this in a systemic fashion by subclassing NSUserDefaults and overriding objectForKey:, to avoid writing a bunch of similar functions.

Using this code, after a new setting is requested for the first time it is enough to synchronize the user defaults (which should happen automatically during the application run loop) for it to appear in the preference file and so more easily allow it to be discovered and changed by end users, or more typically, system administrators.

Patents

Patents and their application to software have been in the news lately: Lodsys and other entities that seem to have been created from whole cloth for that sole purpose are suing various software companies for patent infringement, Android is attacked (directly or indirectly) by historical operating system actors Apple, Microsoft and Oracle (as owner of Sun) for infringing their patents, web video is standardized but the codec is left unspecified as the W3C will only standardize on freely-licensable technologies while any remotely modern video compression technique is patented (even the ostensibly patent-free WebM codec is heading towards having a patent pool formed around it).

Many in the software industry consider it obvious that not only a reform is needed, but that software patents should be banned entirely given their unacceptable effects; however I haven’t seen much of a justification of why they should be banned, as often the article/blog post/editorial defending this position considers it obvious. Well, it is certainly obvious for the author as a practitioner of software, and obvious to me as the same, but it’s not to others, and I wouldn’t want engineers of other trades to see software developers as prima donnas who think they should be exempted from the obligations related to patents for no reason other than the fact it inconveniences them. So here I am going to expose why I consider that software patents actually discourage innovation, and in fact discourage any activity, in the software industry.

Why the current situation is untenable

Let’s start by the basics. A patent is an invention that an inventor, in exchange for registering it in a public office (which includes a fee), is given exclusive rights to. Of course, he can share that right by licensing the patent to others, or he can sell the patent altogether. Anyone else using the invention (and that includes an end user) is said to be infringing the patent and is in the wrong, even if he came up with it independently. That seems quite outlandish, but it’s a tradeoff that we as a society have made: we penalize parallel inventors who are of good faith in order to better protect the original inventor (e.g. to avoid copyists getting away with their copying by pretending they were unaware of the original invention). Of course, if the parallel inventor is not found to have been aware of the original patent, he is less penalized than if he were, but he is penalized nonetheless. The aim is to give practitioners in a given domain an incentive to keep abreast of the state of the art in various ways, including by reading the patents published by the patent office in their domain. In fields where the conditions are right, I hear it works pretty well.

And it is here we see the first issue with software patents: the notorious incompetence of the USPTO (United States Patent and Trademark Office)1, which has been very lax and inconsistent when it comes to software patents, and has granted a number of dubious ones; and I hear it’s not much better in other countries where software patents are granted (European countries thankfully do not grant patents on software, for the most part). One of the criteria when deciding whether an invention can be patented is whether it is obvious to a practitioner aware of the state of the art, and for reasonably competent software developers the patents at the center of some lawsuits are downright obvious inventions. The result is that staying current with the software patents that are granted is such a waste of time that it would sink most software companies faster than any patent suit.

Now, it is entirely possible that the USPTO is overworked with a flood of patent claims which they’re doing their best to evaluate given their means, and the bogus patents that end up being granted are rare exceptions. I personally believe the ones we’ve seen so far are but the tip of the iceberg (most are probably resting along with more valid patents in the patent portfolios of big companies), but even if we accept they are an exception, it doesn’t matter because of a compounding issue with software patents: litigation is very expensive. To be more specific, the U.S. patent litigation system seems calibrated for traditional brick and mortar companies that produce physical goods at the industrial scale; calibrated in the sense of how much scrutiny is given to the patents and the potential infringement, the number of technicalities that have to be dealt with before the court gets to the core of the matter, how long the various stages of litigation last, etc. Remember that in the meantime, the lawyers and patent attorneys gotta get paid. What are expensive but sustainable litigation expenses for these companies simply put most software companies, which operate at a smaller scale, out of business.

Worse yet, even getting to the point where the patent and the infringement are looked at seriously is too expensive for most companies. As a result, attackers only need to have the beginning of a case to start threatening software developers with a patent infringement lawsuit if they don’t take a license; it doesn’t matter if the attacker’s case is weak and likely to lose in court eventually, as these attackers know that the companies they’re threatening do not have the means to fight to get to that point. And there is no provision for the loser to have to pay for the legal fees of the winner. So the choice for these companies is either to back off and pay up, or spend at least an arm and a leg that they will never recover defending themselves. This is extortion, plain and simple.

So even if bogus patents are the exception, it is more than enough for a few of them to end up in the wild and used as bludgeons to waylay software companies pretty much at will, so the impact is disproportionate with the number of bogus patents. Especially when you consider the assailants cannot be targeted back since they do not produce products.

But at the very least, these issues appear to be fixable. The patent litigation system could be scaled back (possibly only for software patents), and, who knows, the USPTO could change and do a correct job of evaluating software patents, especially if there are disincentives in place (like a higher patent submission fee) to curb the number of submissions and allow the USPTO to do a better job. And one could even postulate a world where software developers “get with the program” and follow patent activity and avoid patented techniques (or license them, as appropriate) such that software development is no longer a minefield. But I am convinced this will not work, especially the latter, and that software (with possible exceptions) should not be patentable, for reasons I am going to expose.

Why software patents themselves are not sustainable

The first reason is that contrary to, say, mechanical engineers, or biologists, or even chip designers, the software development community is dispersed, heterogeneous, and loosely connected, if at all. An employee in IT writing network management scripts is a software practitioner; an iOS application developer is a software practitioner; a web front-end developer writing HTML and JavaScript is a software practitioner; a Java programmer writing line of business applications internal to the company is a software practitioner; an embedded programmer writing the control program for a washing machine is a software practitioner; a video game designer scripting a dialog tree is a software practitioner; a Linux kernel programmer is a software practitioner; an embedded programmer writing critical avionics software is a software practitioner; an expert writing weather simulation algorithms is a software practitioner; a security researcher writing cryptographic algorithms is a software practitioner. More significantly, every company past a certain size, regardless of its field, will employ software practitioners, if only in IT, and probably to write internal software related to its field. Software development is not limited to companies in one or a few fields, software practitioners are employed by companies from all industry and non-industry sectors. So I don’t see software developers ever getting into a coherent enough “community” for patents to work as intended.

The second reason, which compounds the first, is that software patents can not be reliably indexed, contrary to, say, chemical patents used in the pharmaceutical industry for instance2. If an engineer working in pharmacology wants to know whether the molecule he intends to work on is patented already, there are databases that, based on the formal description of the molecule, allow to find any and all patents covering that molecule, or allow the knowledge with a reasonably high degree of confidence that the molecule is not patented yet if the search turns up no result. No such thing exists (and likely no such thing can exist) for software patents, where there is at best keyword search; this is less accurate, but in particular cannot give confidence that an algorithm we want to clear is not patented, as a keyword search may miss patents that would apply. It appears that the only way to ensure a piece of software does not infringe patents is to read all software patents (every single one!) as they are issued to see if one of them wouldn’t cover the piece of software we want to clear; given that every company that produces software would need to do so, and remember the compounding factor that this includes every company past a certain size, this raises some scalability challenges, to put it lightly.

This is itself compounded by the fact you do not need a lot of resources available, or to spend a lot of resources or time, to develop and validate a software invention. To figure out whether a drug is worth patenting (to say nothing of producing it in the first place), you need a lab, in which you run experiments taking time and money to pay for the biological materials, the qualified technicians tending to the experiments, etc. Which may not work, in which case you have to start over; one success has to bear the cost of likely a magnitude more failures. To figure out whether a mechanical invention is worth patenting, you need to build it, spend a lot of materials (ones constitutive of the machine because it broke catastrophically, or ones the machine is supposed to process like wood or plastic granules) iterating on the invention until it runs, and even then it may not pan out in the end. But validating a software invention only requires running it on a computer that can be had for $500, eating a handful of kilojoules (kilojoules! Not kWhs, kilojoules, or said another way, kilowatt-seconds) of electrical power, and no worker time at all except waiting for the outcome, since everything in running software is automated. With current hardware and compilers, the outcome of whether a software invention works or not can be had in mere seconds, so there is little cost to failure of an invention. As a result, developing a software invention comparable in complexity to an invention described in a non-software patent has a much, much lower barrier of entry and requires multiple orders of magnitude less resources; everyone can be a software inventor. Now there is still of course the patent filing fee, but still in software you’ve got inventions that are easier to come up with, as a result many more of them will be filed, while they impact many more companies… Hmm…

Of course, don’t get me wrong, I do not mean here that software development is easy or cheap, because software development is about creating products, not inventions per se; developing a product involves a lot more (like user interface design, getting code by different people to run together, figuring out how the product should behave and what users want in the product, and many other things, to say nothing of non-programming work like art assets, etc.) than simply the inventions contained inside, and developing that takes a lot of time and resources.

Now let us add the possibility of a company getting a software patent so universal and unavoidable that the company is thus granted a monopoly on a whole class of software. This has historically happened in other domains, perhaps most famously with Xerox who for long held a monopoly on copying machines, by having the patent on the only viable technique for doing so at the time. But granting Xerox a monopoly on the only viable copying technique did not impact other markets, as this invention was unavoidable for making copying machines and… well, maybe integrated fax/printer/copying machine gizmos which are not much more than the sum of their parts, but that was it. On the other hand, a software invention is always a building block for more complex software, so an algorithmic patent could have an unpredictable reach. Let us take the hash table, for instance. It is a container that allows to quickly (in a sense formally defined) determine whether it already contains an object with a given name, and where, while still allowing to quickly add a new object; something computer memories by themselves are not able to do. Its performance advantages do not merely make programs that use it faster, they allow many programs, which otherwise would be unfeasibly slow, to exist. The hash table enables a staggering amount of software; for instance using a hash table you can figure out in a reasonable time from survey results the list of different answers given in a free-form field of that survey, and for each such answer the average age of respondents who picked that answer (as an example). Most often the various hash tables uses are even further removed from user functionality, but are no less useful, each one providing its services to another software component which itself provides services to another, etc. in order to provide the desired user functionality. Thanks to the universal and infinitely composable nature of software there is no telling where else, in the immensity of software, a software invention could be useful.

Back when it was invented, the hash table was hardly obvious, had it been patented everyone would have had to find alternative ways to accomplish more or less the same purpose (given the universal usefulness it has), such as trees, but those would themselves have become patented until they was no solution left, as there are only so many ways to accomplish that goal (given that in software development you cannot endlessly vary materials, chemical formulas, or environmental conditions); at that point software development would have become frozen in an oligopoly of patent-having companies, which would have taken advantage of being the only ones able to develop software to file more patents to indefinitely maintain that advantage.

Even today, software development is still very young compared to other engineering fields, even to what they were around the start of the nineteenth century when patent systems were introduced. And its fundamentals, such as the hardware it runs on and its capabilities, change all the time, such that there is always a need to reinvent some of its building blocks; therefore patenting techniques currently being developed risks having enormous impact on future software.

But what if algorithmic inventions that are not complex by software standards were not allowed patent protection, and only complex (by software standards) algorithms were, to compensate for the relative cheapness of developing an algorithmic invention of complexity comparable to a non-algorithmic invention, and avoid the issue of simple inventions with too important a reach? The issue is, with rare exceptions complex software does not constitute an invention bigger than the sum of individual inventions. Indeed, complex software is developed to solve user needs, which are not one big technical problem, but rather a collection of technical problems the software needs to solve, such that the complex software is more than the sum of its parts only to the extent these parts work together to solve a more broadly defined, non-technical problem (that is, the user needs). However this complex software is not a more complex invention solving a new technical problem its individual inventions do not already solve, so patenting this complex software would be pointless.

Exceptions (if they are possible)

This does leave open the possibility of some algorithmic techniques for which I would support making an exception and allowing them patent protection while denying it to algorithms in general, contingent on a caveat I will get into afterwards.

First of these are audio and video compression techniques: while they come down to algorithms in the end, they operate on real world data (music, live action footage, voice, etc.) and have shown to be efficient at compressing this real-world data, so they have more than just mathematical properties. But more importantly, these techniques compress data by discarding information that will end up not being noticed as missing by the consumer of the media once uncompressed, and this has to be determined by experimentation, trial and error, large user trials, etc. that take resources comparable to a non-algorithmic invention. As a result, the economics of developing these techniques is not at all similar to software, and application of these techniques is bounded to some, and not all, software applications, so it is worth considering keeping patent protection for these techniques.

Other techniques which are worth, in my opinion, patenting even though they are mostly implemented in software are some encryption/security systems. I am not necessarily talking here of encryption building blocks like AES or SHA, but rather of setups such as PGP. Indeed these setups have provable properties as a whole, so they are more than just the sum of their parts; furthermore, as with all security software the validation that such techniques work can not be done by merely running the code3, but only by proving (a non-trivial job) that they are secure, again bringing the economics more in line with those of non-algorithm patents, therefore having these techniques in the patent system should be beneficial.

So it could be worthwhile to try and carve an exception and allow patents for these techniques and others sharing the same patent-system-friendly characteristics, but if attempted extreme care will have to be taken when specifying such an exception. Indeed, even in the U.S.A. algorithm patents are formally banned, but accumulated litigations ended up with court decisions that progressively eroded this ban, first allowing algorithms on condition they were intimately connected to some physical process, then easing more and more that qualification until it became meaningless; software patents must still pretend being about something other than software or algorithms, typically being titled some variation of “method and apparatus”, but in practice the ban on algorithm patents is well and truly gone, having been loopholed to death. So it is a safe bet any granted exception, on an otherwise general ban on software patents should it happen in the future, will be subject to numerous attempts to exploit it for loopholes to allow software in general to be patented again, especially given the important pressure from big software companies to keep software patents valid.

So if there is any doubt as to the validity and solidity of a proposed exception to a general ban on software patents, then it is better to avoid general software patents coming back through a back door, and therefore better to forego the exception. Sometimes we can’t have nice things.

Other Proposals

Nilay Patel argues that software patents should be allowed, officially even. He mentions mechanical systems and a tap patent in particular, arguing that since the system can be entirely modeled using physical equations, fluid mechanics in particular, the entire invention comes down to math in the end like for software, so why should software patents be treated differently and banned? But the key difference here, to take again the example of the tap patent he mention, is that the part of math which is an external constraint, the fluid mechanics, are an immutable constant of nature. On the other hand with algorithm patents all algorithms involved are the work of man; even if there are external constraining algorithms in a given patent, due to legacy constraints for instance, these were the work of man too. In fact, granting a patent because an invention is remarkable due to the legacy constraints it has to work with and how it solves them would indirectly encourage the development and diffusion of such constraining legacy! We certainly don’t want the patent system encouraging that.

The EFF proposes, among other things, allowing independent invention as a valid defense again software patent infringement liabilities. If this is allowed, we might as well save costs and abolish software patents in the first place: a patent system relies on independent infringement being an infringement nonetheless in order to avoid abuses rendering the whole system meaningless, and I do not see software being any different in that regard.

I cannot remember where, but I heard the idea, especially with regard to media compression patents, of allowing software implementations to use patented algorithm inventions without infringing, so that software publishers would not have to get a license, while hardware implementations would require getting a license. But an issue is that “hardware” implementations are sometimes in fact DSPs which run code actually implementing the codec algorithms, so with this scheme the invention could be argued to be implemented in software; therefore OEMs would just have to switch to such a scheme if they weren’t already, qualify the implementation as software, and not have to pay for any license, so it would be equivalent to abolishing algorithm patents entirely.


  1. I do not comment on the internal affairs of foreign countries in this blog, but I have to make an exception in the case of the software patent situation in the U.S.A., which is so appalling that it ought to be considered a trade impediment.

  2. I learned that indexability was a very useful property that, in contrast to software patents, some patent domains did have, and the specific example of the pharmaceutical industry as such a domain, from an article on the web which I unfortunately cannot find at the moment; a search on the web did not allow me to find it but turned up other references for this fact.

  3. It’s like a lock: you do not determine that a lock you built is fit for fulfilling its purpose by checking that it closes and that using the key opens it; you determine it by making sure there is no other way to open it.

PSA: Do not release ARMv7s code until you have tested it

If you are using a third-party SDK in your iOS app, you may encounter a problem when linking with the current Xcode release: in that case the linker errors with the following line in the build log:

ld: file is universal (2 slices) but does not contain a(n) armv7s slice: libexample.a for architecture armv7s

(with some libraries, the linker will instead output the following when it errors, but it’s the same general problem:)

ld: warning: ignoring file libexample.a, file was built for archive which is not the architecture being linked (armv7s): libexample.a
Undefined symbols for architecture armv7s:

One solution for this issue is to get an updated version of the SDK that has a library with an ARMv7s slice (provided such an update exists, of course, otherwise you have no choice but to apply the second solution). However you should do so only if you have an iPhone 5 to test on; otherwise, I strongly recommend you apply the second solution: go to your project settings, (or target settings, if they are overridden at the target level), and edit the Architectures setting from “armv7 armv7s” to just “armv7”.

Why? Well, only the iPhone 5 can run the variant of your app code (called a slice) compiled for ARMv7s, so if you build and eventually release an update to your app that includes ARMv7s support, you would be releasing code you have not tested yourself, which is a big no-no. Don’t do it. In fact even if you can test on an iPhone 5, there is likely no need to rush and add ARMv7s support in your app as the benefits are incremental at best, as far as I can tell (but do not take my word for it, measure!); I really can’t understand why Apple added ARMv7s support in such a way that existing projects start using it right away by default.

Look forward to a post detailing the benefits (if any) to adding an ARMv7s slice to your app, as well as updates to my existing posts, in the coming weeks.

This post was initially published with somewhat different contents as at the time the iPhone 5 was not actually available. In the interest of historical preservation, the original content has been moved here.

Developer command-line tools setup

After my previous post, Gregory Pakosz wondered why I was using xcrun, as he did not need to (turns out he set to install command-line developer tools in the Xcode prefs, so otool and friends were in /usr/bin). That got me thinking a bit about the setup for accessing command-line tools we can assume another developer has.

Starting from the olden days of Mac OS X and up until recently, the Developer Tools were a system-spanning install, most definitely not self-contained. In particular, either as standard or as an option offered at install, you could install command-line tools like the compiler and more in /usr/bin, directly accessible to your shell; and even if some disabled the option, they would add /Developer/usr/bin to their $PATH. As a result, you could just assume otool, libtool, gcc, etc. would be directly accessible in the environment of a fellow developer, and just give command lines directly using them when conversing with them by email, Twitter, blog posts, etc.

Then the iPhone SDK happened, and our numbers grew immensely (waving at you guys!), but habits were generally kept. And then Xcode 4 happened and emphasized being a one stop shop for your entire development process. And with Xcode 4.3 Xcode truly became self-contained, with no longer any installation per se; these days it is possible, and I suspect common, to develop and submit iPhone apps without needing to be aware of the Terminal at all (not that this is a bad thing, mind you).

In my case I initially missed installing the command-line tools as instead of meeting the customization step of the install process and enabling everything except WebObjects, now I simply had Xcode as an application in /Applications and never thought to seek the installs in the Xcode preferences.

I have now enabled that preference and installed the tools in /usr/bin, but still I missed it initially, so who knows how many others do. Also, it occurs to me more and more build systems and other scripts in the Mac/iOS development world are now aware of the Xcode hierarchy, and rely on xcode-select -print-path and/or an environment variable to locate the developer directory and use the tools in there directly; as a result, the Command Line Tools install is now mostly for the benefit of open source/Unix stuff which simply assumes cc/gcc is in the path, and maybe we should start thinking about it as being for that specific purpose.

So my question is, should we simply assume command-line developer tools are in the path and write our blogs posts, emails, and Twitter messages as usual (maybe with a reminder to “make sure you have the Command Line Tools installed in the Xcode prefs.”)? Or should we start considering these tools do not necessarily belong in /usr/bin and instead instruct our readers to use xcode-select -switch then xcrun, or even instruct them to directly add the Developer/usr/bin folder inside Xcode to the $PATH (which is likely to break from time to time as the paths change)? What do you think? As usual, write me at wanderingcoder@sfr.fr.

Okay, feedback is not unanimous in either direction. For now I will keep putting xcrun, with the assumption that it will help those who have not installed the Command Line Tools, while those who have will know enough to remove it from the command line before use. — August 8, 2012