Developer ID might not seem restrictive, but it is

I need to talk about Gatekeeper and Developer ID.

In short, I am very uncomfortable with this previewed security feature of Mountain Lion. Apple is trying to assure that users are only going to be safer and that developers are still going to be able to do business as usual, but the Mac ecosystem is not limited to these two parties and this ignores pretty much everyone else: for these people Gatekeeper is going to be a problem. Enough so to make me consider switching.

I don’t mean to say it’s all bad, as Apple is set to allow more at the same time as it allows less. Indeed, with Developer ID Apple is clearly undertaking better support of apps from outside the Mac App Store, if only because they will have to maintain this system going forward, and I can only hope this support will improve in other areas (such as distribution: disk images are long past cutting edge). But while Apple gives with one hand, it takes away with the other, as Mountain Lion will by default (at least as of the current betas, though that seems unlikely to change) reject unsigned apps and apps signed by certificates other than Mac App Store and Developer ID ones; of course most people will not change that default, and so you will have trouble getting these people to run your code unless you get at least a Developer ID from Apple, and while better than requiring you to go through the Mac App Store this requirement is quite restrictive too.

The matter is not that with Developer ID apps will now be exposed to being blacklisted by Apple; honestly, speaking as a developer I personally do not mind this bit of accountability. Maybe there are going to be issues with this power now entrusted to the hands of Apple, such as the possibility of authorities (through executive branch bullying, or with a proper court order) asking Apple to neutralize an app perceived as illegal, but if this ever happens I believe the first incidents will cause this eventuality to be properly restricted by law.

No, the matter, as I wrote to Wil Shipley in an email after his proposal, is that many people who are important to the Mac platform are going to be inconvenienced with this, as getting a Developer ID requires a Mac Developer Program membership.

  • Sysadmins/IT people, to begin with, often need to deploy scripts, and either those don’t need to be signed, and they become the new malware vectors, or they do (Apple could define an xattr that would store the signature for a script) and then any company deploying Macs needs to enter the Mac Developer Program and manage a Developer ID that IT needs to be able to access day to day (that is, not just for releases, like in a software company) and so could leak, just so that the company can service its own Macs internally.

  • Then we have people using open-source tools that Apple doesn’t provide, such as lynx, ffmpeg, httrack, mercurial, etc., and who likely get them from projects like MacPorts; maybe we have an exception for executables that were built on the same machine, but how is it enforced then?

  • Student developers have historically been very important to the Mac platform, if only because many current Mac developers started out as such. If entering the Mac Developer Program is required to distribute Mac apps in the future, it’s a threshold that many will not clear, and as a result they will not get precious feedback from other people using their code, or worse they will not choose Mac development as a career as they could have if they had been encouraged to do so by people using their software (for instance, Jeff Vogel wasn’t planning on making Mac games as a career, but he quit grad school when Exile started becoming popular). At 99$ (per year), it seems silly to consider the cost of the Mac Developer Program as an obstacle, especially when compared to the cost of a Mac, but you have to consider the Mac likely benefitted from a student discount and was possibly entirely paid by the family; not so for the Mac Developer Program. Regardless, any extra expense will, rationally or not, cause it not to be taken by a significant portion of the people who would have otherwise tried it, even if it would have paid for itself eventually.

  • Many users will tinker with their apps for perfectly legitimate reasons, for instance in order to localize it and then submit the localization to the author, or in the case of games to create alternate scenarios or complete mods. It’s something that I am particularly sensitive to, as for a long time I have both enjoyed other’s people’s mods and conversely tinkered myself and shared with others: I have created mods, documented the formats to help others create mods, extracted data from the game files, gave tips and tricks and development feedback on other people’s in-progress mods, I was even at some point in charge of approving mods for a game to the official mods repository, and I created tools to help develop mods (more on that later). The user modding tradition is very strong in the Ambrosia Software games community, going back to Maelstrom nearly 20 years ago, and that’s merely the one community I am most familiar with. However, tinkering in such ways typically breaks the app signature; an app with an invalid signature will currently run on Lion (I know it if only because my Dock currently has an invalid signature), but it will likely change with Mountain Lion as otherwise Gatekeeper would be pointless (an important attack to protect against is legitimate apps that have been modified to insert a malicious payload and then redistributed). So we will have to rely on developers excluding files that could be desirable for users to tinker with from the signature seal… well, except that the developer will then need to make sure the app cannot be compromised if the files outside the seal are, and I’m pretty sure it’s impossible to do so for nibs for instance, so app developers will not be able to simply leave the nibs out of the seal so that users may localize them; they will need to roll out systems like the one Wil Shipley developed for localizations, completely out of the realm of Apple-provided tooling.

  • Power users/budding developers will often create small programs whose sole purpose is to help users of a main program (typically a game, but not always), for instance by interpreting some files and/or performing some useful calculations; they typically develop it for themselves, and share it for free with other users in the community of the main program. It’s something I have done myself, again for Ambrosia games, and it’s a very instructive experience: you start with an already determined problem, file format, etc., so you don’t have to invent everything from scratch which often intimidates budding developers. However, if it is required to register in the Mac Developer Program to distribute those then power users will keep those to themselves and they won’t benefit from the feedback, and other users won’t benefit from these tools.

(Note that Gatekeeper is currently tied to the quarantine system, and so in that configuration some of the problems I mentioned do not currently apply, but let’s be realistic: it won’t remain the case forever, if only so that Apple can have the possibility of neutralizing rogue apps even after they have been launched once.)

In fact, a common theme here is that of future developers. Focusing solely on users and app developers ignores the fact that Mac application developers don’t become so overnight, but instead typically start experimenting on their spare time, in an important intermediate step before going fully professional; it is possible to become an app developer without this step, but then the developer won’t have had the practice he could have gotten by experimenting before he goes pro. Or worse, he will have experimented on Windows, Linux, or the web, and gotten exactly the wrong lessons for making Mac applications—if he decides he wants to target the Mac at all in the end.

Because of my history, I care a lot about this matter, especially the last two examples I gave, and so I swore that if Apple were to require code to be signed by an authority that ultimately derives from Apple in order to run on the Mac, such that one would have to pay Apple for the privilege to distribute one’s own Mac software (as would be the case with Developer ID), then I would switch away from the Mac. But here Apple threw me a curveball, as it is the case by default, but users can choose to allow everything, but should that matter, since the default is what most people will ever know? Argh! I don’t know what to think.

In fact, at the same time I am worrying about the security of the whole system and wish for it to be properly secure: I know that any system that allows unsigned code to run is subject to the dancing bunnies problem; and maybe the two are in fact irreconcilable and it is reality itself I am having a problem with. I don’t know. Maybe Apple could allow some unsigned apps to run by default, on condition they have practically all the sandboxing restrictions to limit their impact. The only thing is, in order to be able to do anything interesting, these apps would at least have to have access to files given to them, and even that, combined with some social engineering, would be enough for malware to do harm, as users likely won’t treat these unsigned apps differently from regular desktop apps, which they consider “safe”. Maybe the only viable solution for distribution of tinkerer apps are as web apps (I hear there is work going on to allow those to access user files); I don’t like that very much (e.g. JavaScript is not very good to parse arbitrary files), but at the same time users do tend to take web apps with more caution than they take desktop apps (at least as far as giving them files goes, I hope), and any alternate “hyper sandboxed” system that would be introduced would have to compensate the 15+ years head start the web has in setting user expectations.

The same way, the very same cost of the Mac Developer Program which is a problematic threshold for many is also the speed bump that will make it economically unviable for a malware distributor who just had its certificate revoked to get a new one again and again.

This is why, paradoxically, I wish for iOS to take over the desktop, as by then iOS will likely have gained the possibility to run unsigned apps, and users having had their expectations set by years of being able to use only (relatively) safe iOS App Store apps will see these unsigned apps differently than they do apps from the store.

Anyway, nothing that has been presented about Mountain Lion so far is final, and important details could change before release, so it’s no use getting too worked up based on the information we know today. But I am worried, very worried.

Goodbye, NXP Software

For the last four years, starting before this blog even began, I have been working as a contractor programmer for NXP Software. Or rather had been, as the mission has now ended, effective 1st of January 2012. It was a difficult decision to take, and I will miss among other things the excellent office ambience, but I felt it was time for me to try other things, to see what’s out there, so to speak. After all, am I not the wandering coder?

I’ll always be thankful for everything I learned, and for the opportunities that have been offered to me while working there. Working at NXP Software was my first real job, and I couldn’t have asked for a better place to start at as people there have been understanding in the beginning when I clumsily transitioned to being a full-blown professional. I am also particularly thankful (among many other things) for the opportunity to go to WWDC 2010, where I learned a ton and which allowed me to meet people from the Apple community (not to mention visiting San Francisco and the bay area, even for a spell).

There are countless memories I’ll forever keep of the place, but the moment I’m most proud of would be the release of CineXplayer, and in particular its getting covered on Macworld. Proud because it’s Macworld (and Dan Moren), of course, but also because of something unassumingly mentioned in the article. You see, in the CineXplayer project I was responsible for all engine development work (others handled the UI development), including a few things at the boundary such as video display and subtitle rendering; we did of course start out from an existing player engine, and we got AVI/XviD support from ongoing development on that player (though we got a few finger cuts from that as we pretty much ended up doing the QA testing of the feature…), but interestingly when we started out this player engine had no support for scrubbing. None at all. It only supported asynchronous jumping, which couldn’t readily be used for scrubbing. And I thought: “This will not do.” and set out to implement scrubbing; some time later, it was done, and we shipped with it.

And so I am particularly proud of scrubbing in CineXplayer and its mention in Dan Moren’s article, not because it was particularly noticed but on the contrary because of the so modest mention it got: this means it did its job without being noticed. Indeed, rather than try and seek fifteen pixels of fame, programmers should take pride in doing things that Just Work™.

As I said, I wanted a change of scenery, and that is why I am still employed by SII and I have started a new mission in Cassidian to work on developing professional mobile radio systems (think the kind of private mobile network used by public safety agencies like police and firefighters). Don’t worry, I am certainly not done developing for iOS or dispensing iOS knowledge and opinions here, as I will keep doing iOS stuff at home; I can’t promise anything will come out of it on the iOS App Store, but you’ll certainly be seeing blog posts about it.

And I know some people in NXP Software read this blog, so I say farewell to all my peeps at NXP Software, and don’t worry, I’ll drop by from time to time so you’ll be seeing me again, most likely…

GCC is dead, long live the young LLVM

(Before I get flamed, I’m talking of course of GCC in the context of the toolchains provided by Apple for Mac and iOS development; the GCC project is still going strong, of course.)

You have no doubt noticed that GCC disappeared from the Mac OS X developer tools install starting with Lion; if you do gcc --version, you’ll see LLVM-GCC has been given the task of handling compilation duties for build systems that directly reference gcc. And now with the release of the iOS 5 SDK, GCC has been removed for iOS development too, leaving only LLVM-based compilers there as well.

Overall I’m going to say it’s a good thing: LLVM, especially with the Clang front end has already accomplished a lot, and yet has so much potential ahead of it; while GCC was not a liability, I guess this very customized fork was a bit high maintenance. Still, after 20 years of faithful service for Cocoa development at NeXT then Apple, it seems a bit cavalier for GCC to be expelled in mere months between the explicit announcement and it actually being removed. Ah well.

But while I have no worry with LLVM when doing desktop development (that is, when targeting x86 and x86-64), however LLVM targeting iOS (and thus ARM) is young. Very young. LLVM has only been deemed production quality when targeting ARM in summer 2010, merely one year ago and change. Since then I have heard of (and seen acknowledged by Chris Lattner) a fatal issue (since fixed) with LLVM for ARM, and it seems another has cropped up in Xcode 4.2 (hat tip to @chockenberry). So I think the decision to remove GCC as an option for iOS development was slightly premature on Apple’s part: a compiler is supposed to be something you can trust, as it has the potential to introduce bugs anywhere in your code; it has to be more reliable and trustworthy than the libraries, or even the kernel, as Peter Hosey quipped.

Now don’t get me wrong, I have no problem with using Clang or LLVM-GCC for iOS development, in fact at work we switched to Clang on a trial basis (I guess it’s now no longer on a trial basis anymore, certainly not after the iOS 5 SDK) about one year ago, and we’ve not had any issue ourselves nor looked back since. Indeed, for its relative lack of maturity and the incidents I mentioned, LLVM has one redeeming quality, and it’s overwhelming: Apple is itself using LLVM to compile iOS; Cocoa libraries, built-in apps, Apple iOS App Store apps, etc., millions upon millions of lines of code ensure that if a bug crops up in LLVM, Apple will see it before you do… provided, that is, that you don’t do things Apple doesn’t do. For instance, Apple has stopped targeting ARMv6 devices starting with iOS 4.3 in March 2011, and it is no coincidence that the two incidents I mentioned were confined to ARMv6 and did not affect ARMv7 compilation.

So I recommend a period of regency, where we allow LLVM to rule, but carefully oversee it, and in particular prevent it from doing anything it wouldn’t do at Apple, so that we remain squarely in the use cases where Apple shields us from trouble. This means:

  • foregoing ARMv6 development from now on. In this day and age it’s not outlandish to have new projects be ARMv7-only, so do so. If you need to maintain an existing app that has ARMv6 compatibility, then develop and build it for release with Xcode 4.1 and GCC, or better yet, on a Snow Leopard machine with Xcode 3.2.6 (or if you don’t mind Snow Leopard Server, it seems to be possible to use a virtual machine to do so).
  • avoiding unaligned accesses, especially for floating-point variables. It is always a good idea anyway, but doubly so now; doing otherwise is just asking for trouble.
  • ensuring your code is correct. That sounds like evident advice, but I’ve seen in some cases incorrect code which would run OK with GCC, but was broken by LLVM’s optimizations.
  • I’d even be wary of advanced C++ features; as anyone who has spent enough time in the iOS debugger can attest from the call stacks featuring C++ functions from the system, Apple uses quite a bit of C++ in the implementation of some frameworks, like Core Animation, however C++ is so vast that I’m not sure they make use of every nook and cranny of the C++98 specification, so be careful.
  • avoiding anything else you can think of that affects code generation and is unusual enough that Apple likely does not use it internally.

Now there’s no need to be paranoid either; for instance to the best of my knowledge Apple compiles most of its code for Thumb, but some is in ARM mode, so you shouldn’t have any problem coming from using one or the other.

With this regency in place until LLVM matures, there should be no problems ahead and only success with your iOS development (as far as compiling is concerned, of course…)

Benefits (and drawback) to compiling your iOS app for ARMv7

In “A few things iOS developers ought to know about the ARM architecture”, I talked about ARMv6 and ARMv7, the two ARM architecture versions that iOS supports, but I didn’t touch on an important point: why you would want to compile for one or the other, or even both (thanks to Jasconius at Stack Overflow for asking that question).

The first thing you need to know is that you never need to compile for ARMv7: after all, apps last updated at the time of the iPhone 3G (and thus compiled for ARMv6) still run on the iPad 2 (provided they didn’t use private APIs…).

Scratch that, you may have to compile for ARMv7 in some circumstances: I have heard reports that if your app requires iOS 5, then Xcode won’t let you build the app ARMv6 only. – May 22, 2012

So you could keep compiling your app for ARMv6, but is it what you should do? It depends on your situation.

If your app is an iPad-only app, or if it requires a device feature (like video recording or magnetometer) that no ARMv6 device ever had, then do not hesitate and compile only for ARMv7. There are only benefits and no drawback to doing so (just make sure to add armv7 in the Required Device Capabilities (UIRequiredDeviceCapabilities) key in the project’s Info.plist, otherwise you will get a validation error from iTunes Connect when uploading the binary, such as: “iPhone/iPod Touch: application executable is missing a required architecture. At least one of the following architecture(s) must be present: armv6”).

If you still want your app to run on ARMv6 devices, however, you can’t go ARMv7-only, so your only choices are to compile only for ARMv6, or for both ARMv6 and ARMv7, which generates a fat binary which will still run on ARMv6 devices while taking advantage of the new instructions on ARMv7 devices1. Doing the latter will almost double the executable binary size compared to the former; executable binary size is typically dwarfed by the art assets and other resources in your application package, so it typically doesn’t matter, but make sure to check this increase. In exchange, you will get the following:

  • ability to use NEON (note that you will not automatically get NEON-optimized code from the compiler, you must explicitly write that code)
  • Thumb that doesn’t suck: if you follow my advice and disable Thumb for ARMv6 but enable it for ARMv7, this means your code on ARMv7 will be smaller than on ARMv6, helping with RAM and instruction cache usage
  • slightly more efficient compiler-generated code (ARMv7 brings a few new instructions besides NEON).

Given the tradeoff, even if you don’t take advantage of NEON it’s almost always a good idea to compile for both ARMv6 and ARMv7 rather than just ARMv6, but again make sure to check the size increase of the application package isn’t a problem.

Now I think it is important to mention what compiling for ARMv7 will not bring you.

  • It will not make your code run more efficiently on ARMv6 devices, since those will still be running the ARMv6 compiled code; this means it will only improve your code on devices where your app already runs faster. That being said, you could take advantage of these improvements to, say, enable more effects on ARMv7 devices.
  • It will not improve performance of the Apple frameworks and libraries: those are already optimized for the device they are running on, even if your code is compiled only for ARMv6.
  • There are a few cases where ARMv7 devices run code less efficiently than ARMv6 ones (double-precision floating-point code comes to mind); this will happen on these devices even if you only compile for ARMv6, so adding (or replacing by) an ARMv7 slice will not help or hurt this in any way.
  • If you have third-party dependencies with libraries that provide only an ARMv6 slice (you can check with otool -vf <library name>), the code of this dependency won’t become more efficient if you compile for ARMv7 (if they do provide an ARMv7 slice, compiling for ARMv7 will allow you to use it, likely making it more efficient).

So to sum it up: you should likely compile for both ARMv6 and ARMv7, which will improve your code somewhat (or significantly if you take advantage of NEON) but only when running on ARMv7 devices, while increasing your application download to a likely small extent; unless, that is, if you only target ARMv7 devices, in which case you can drop compiling for ARMv6 and eliminate that drawback.


  1. Apple would very much like you to optimize for ARMv7 while keeping ARMv6 compatibility: at the time of this writing, the default “Standard” architecture setting in Xcode compiles for both ARMv6 and ARMv7.

April’s Fools 2011

So, if you read my previous post before today… April’s fools! And not in the way you might think. This behavior of the iPad 2 is real, I did not make it up, I did indeed verify it this week. The joke is that I claimed to be surprised, hoping to make people believe this unexpected behavior was an April’s fools. Posting strange-sounding yet true information on April the first—now that is the real prank.

It’s hard to tell how successful I was in tricking people into believing this was a joke; I did however get a few emails explaining (as I pretended to request) how such a thing was possible. Congratulations guys, you did not fall for it!

I completely expected this behavior of the iPad 2, I knew about ARM having a weakly ordered memory model, and have known for some time (this test code was prepared over the last few weeks, for instance). By pretending to be surprised, I attempted to raise awareness of this behavior, which many people are completely unaware of; indeed, programmers have rarely been exposed to weakly ordered memory systems so far: x86 is strongly ordered, and even if these programmers have worked on ARM they have only worked on single-core systems so far (the only consumer hardware I know of that exposed a weakly ordered memory model are the various bi-pro PowerPC PowerMacs, which are not very common and back then Mac code was mostly single-threaded). I’ve been thinking about ways to raise this awareness for some time, but it was hard to find out how since it was pretty much a theoretical concern as long as no mainstream multi-core ARM hardware was shipping. But now that the iPad 2, the Xoom, and other multi-core ARM tablets and handsets have shipped I can show everyone that this indeed occurs.

Later today or tomorrow, I will replace the contents of that post with a more in-depth description and a few references, in other words the post I intended to write in the first place, before I realized I could turn it into a small April’s fools prank. It will be at the same URL, in fact you might have noticed the slug did not really match the title, I intended this as a small hint that something was off…

Whether you thought the iPad 2 behavior was a joke, you knew this behavior was real but believed I was genuinely surprised, or you saw right through my feigned surprise, thank you for reading!

(On that note, I should mention I have been sloppy in checking my spam filters residue so far, and my ISP deletes them automatically after one week. So if you ever wrote me and I never answered, this may be why. My apologies if this happened to you, please send the email again if you feel like doing so.)

ARM multicore systems such as the iPad 2 feature a weakly ordered memory model

At the time of this writing, numerous multicore ARM devices are either shipping or set to ship; handsets, of course, but more interestingly this wave of tablets, in particular the iPad 2 (but not only it), seems to be generally based around multicore ARM chips, be it the Tegra 2 from nVidia, or the OMAP 4 from TI, etc. ARM multicore systems did exist before, as the ARM11 was MP-capable, but I’m not aware of it being used in many or any device open for third-party development; this seems to be really exploding now with the Cortex A9.

These devices will also introduce a surprising system behavior to many programmers for the first time, a behavior which if it isn’t understood will cause crashes, or worse.

Let me show what I’m talking about:

BOOL PostItem(FIFO* cont, uint32_t item) /* Bad code, do not use in production */
{ /* Bad code, do not use in production */
#error This is bad code, do not use!
    size_t newWriteIndex = (cont->writeIndex+1)%FIFO_CAPACITY; /* Bad code, do not use in production */
    /* see why at http://wanderingcoder.net/2011/04/01/arm-memory-ordering/ */
    if (newWriteIndex == cont->readIndex) /* Bad code, do not use in production */
        return NO; /* notice that we could still fit one more item,
                    but then readIndex would be equal to writeIndex
                    and it would be impossible to tell from an empty
                    FIFO. */
                    
    cont->buffer[cont->writeIndex] = item; /* Bad code, do not use in production */
    cont->writeIndex = newWriteIndex; /* Bad code, do not use in production */
    
    return YES; /* Bad code, do not use in production */
}

BOOL GetNewItem(FIFO* cont, uint32_t* pItem) /* Bad code, do not use in production */
{
#error This is bad code, do not use!
    if (cont->readIndex == cont->writeIndex) /* Bad code, do not use in production */
        return NO; /* nothing to get. */
        
    *pItem = cont->buffer[cont->readIndex]; /* Bad code, do not use in production */
    /* see why at http://wanderingcoder.net/2011/04/01/arm-memory-ordering/ */
    cont->readIndex = (cont->readIndex+1)%FIFO_CAPACITY; /* Bad code, do not use in production */
    
    return YES; /* Bad code, do not use in production */
}

(This code is taken from the full project, which you can download from Bitbucket in order to reproduce my results.)

This is a lockless FIFO; it looks innocent enough. I tested it in the following setup: a first thread posts consecutive integers slightly more slowly (so that the FIFO is often empty) than a second thread, which gets them and checks that it gets consecutive integers. When this setup was run on the iPad 2, in every run the second thread very quickly (after about 100,000 transfers) got an integer that wasn’t consecutive with the previous one received; instead, it was the expected value minus FIFO_CAPACITY, in other words a leftover value from the buffer.

What happens is that the system allows writes performed by one core (the one which runs the first thread) to be seen out of order from another core. So the second core, running the second thread, first sees that writeIndex was updated, goes on to read the buffer at offset readIndex, and only after that sees the write in the buffer to that location, so it read what was there before that write.

A processor architecture which, like ARM, allows this to happen is referred to as weakly ordered. This behavior might seem scandalous, but remember your code is run on two processing units which, while they share the same memory, are not tightly synchronized, so you cannot expect everything to behave exactly the same way as in the single core case, this is what allows two cores to be faster than one. Many processor architectures permit writes to be reordered (PowerPC for instance), among other things permitting this allows an important reduction in cache synchronization traffic. While it also allows more freedom when designing out of order execution in the processor core, it is not necessary: a system made of in-order processors may reorder writes because of the caches, and it is possible to design a system with out of order processors that does not reorder writes.

Speaking of which, on the other hand x86 guarantees that writes won’t be reordered, that architecture is referred to as strongly ordered. This is not to say it doesn’t do any reordering, for instance reads are allowed to happen ahead of writes that come “before” them; this breaks a few algorithms like Peterson’s algorithm. Since this architecture dominates the desktop, and common mobile systems have only featured a single core so far and thus don’t display memory ordering issues, programmers as a result have gotten used to a strongly ordered world and are generally unaware of these issues. But now that the iPad 2 and other mainstream multicore ARM devices are shipping, exposing for the first time a large number of programmers to a weakly ordered memory model, they can no longer afford to remain ignorant—and going from a strongly ordered memory model to a weakly ordered one breaks far more, and much more common, algorithms, like the double-checked lock and this naive FIFO, than going from single processor to a strongly ordered multiprocessor system ever did.

Note that this can in fact cause regressions on already shipping iOS App Store apps (it is unclear whether existing apps are confined to a single core for compatibility or not) since, while very few iOS apps do really take advantage of more than one core yet, some nevertheless will from time to time since they are threaded for other reasons (e.g. to have tasks run in real-time for games or audio/video playback). However, Apple certainly tested existing iOS App Store apps on the iPad 2 hardware and they would have noticed if it caused many issues, so this probably only affects a limited number of apps and/or it occurs rarely. Still, it is important to raise awareness of this behavior, as an unprecedented number of weakly ordered memory devices are going to be in the wild now, and programmers are expected to make use of these two cores.

What now?

So what if you have a memory ordering issue? Well, first you don’t necessarily know that it is one, just like for threading bugs; the only thing you know is that you have an intermittent issue, you won’t know it is memory ordering related until you find the root cause. And if you thought threading bugs were fun, wait until you investigate a memory ordering issue. Like threading issues, scenarios in which memory ordering issues manifest themselves occur rarely, which makes them just as hard (if not harder) to track down.

To add to the fun, the fact your code runs fine on a multicore x86 system (which practically all Intel Macs, and therefore practically all iOS development machines, are) does not prove at all that it will run correctly on a multicore ARM system, since x86, as we’ve seen, is strongly ordered. So these memory ordering issues will manifest themselves only on device, never on the Simulator. You have to debug on device.

Once you find a plausible culprit code, how do you fix it (since often the only way to show the root cause is where you suspect it is, is to fix the code anyway and see if the symptoms disappear)? I advise against memory barriers; at least with threading bugs, you can reason in terms of a sequence of events (instructions of one thread happening, one thread interrupting another, etc.); with memory ordering bugs there is no longer any such thing as a single sequence, each core has its own; as in Einstein’s relativity, simultaneity in different reference frames is now meaningless. This makes memory ordering issues extremely hard to reason about, and the last thing you want is to leave it incorrectly resolved: it’s neither done nor to be done.

Instead, what I do is lock the code with a mutex, as it should have been in the first place. On top of its traditional role, the mutex ensures that a thread that took it sees the writes made before it was previously released elsewhere, taking care of the problem. Your code won’t be called often enough for the mutex to have any performance impact (unless you’re one of the few to be working on the fundamental primitives of the operating system or of a game engine, in which case you don’t need my advice).

For new iOS code, especially for code meant to run on more than one core at the same time, I suggest using Grand Central Dispatch, and using it in place of any other explicit or implicit thread communication mechanism. Even if you don’t want to tie yourself too much to iOS, coding in this way will make the various tasks and their dynamic relationships clear, making any future port easier. If you’re writing code for another platform, try to use similar task management mechanisms, if they exist they’re very likely to be better than what you could come up with.

But the important thing is to be aware of this behavior, and spread the awareness in the organization. Once you’re aware of it, you’re much better equipped to deal with it. As we say in France, “Un homme averti en vaut deux” (a warned man is worth two).

Here are a few references, for further reading:

This post was initially published with entirely different contents as an April’s fools. In the interest of historical preservation, the original content has been moved here.

First Impressions of the Mac App Store

I try to be original in the subjects I tackle, but if you are a Mac user, there is no escaping the Mac App Store, which is probably the most important thing to happen to the Macintosh platform since Mac OS X, at least. It remains to be seen whether it will be in a good or a bad way, but for now, I’ve given it a test drive.

Trial Run

After uneventfully updating to 10.6.6 and launching the Mac App Store application, I decided to buy Delicious Library to catalog my growing collection of webcomic books (it’s not as big as the one Wil Shipley pimps in the Delicious Library 2 screenshots, but I’m getting there), and of course to get a feel of how the Mac App Store works for a paid application download, not just a free one. This was when I encountered the first issue:

Screen Capture of a Mac App Store dialog in French, with text being cut off in the middle

“effehargements”? I’m afraid I don’t know that word

Gee, are things starting well… I mean, did localizers get so little time to give feedback on the size of user interface elements that this couldn’t be fixed for release? Any other explanation I can think of would, in fact, be worse. I’m not going to focus on that too much since it’s likely to be fixed soon, but it’s a bad first impression to make.

After logging in with my Apple ID as instructed, I was unsurprisingly told I had new terms to accept. Less expected is the fact these terms are an extension of the iTunes Store terms and conditions; apparently the commercial relationship users of the Mac App Store have is an extension of the one most of us already have with iTunes, not an entirely new one or an extension of the Apple Online Store ones. The main reason, I guess, is that they can use the credit card already associated with your iTunes account, and any iTune Store credit you may have; plus, that way the Mac App Store benefits from the iTunes Store infrastructure (servers and stuff).

Of course, by the time I was done reading the terms, my session had expired; it’s as if they weren’t expecting you to read them… I’m noticing this everywhere, though, it’s not just Apple. So I logged in again, accepted the terms, and bought Delicious Library. As widely reported, the application then moved to the Dock with a nice, if slightly overdone, animation (sure, have an animation, but they may have used a simpler one), where it showed a progress bar while it downloaded, up until the download was complete, at which point it jumped once, and stayed in the Dock (while, behind the scenes, it had been put in the Applications folder). This may seem gratuitous, but to me this is indispensable for the buying/downloading experience, as opposed to the disconnected experience of downloading software on the Web.

I then tried out Delicious Library, entering a few books, etc. (unfortunately, I do not have a webcam attached to my Mac pro, so I had to enter the ISBNs by hand). I’m not going to get into a review of Delicious Library here, I just checked that the application was working correctly.

Then, I checked something I had been wondering about. Even though the Mac App Store will only work on Mac OS X 10.6.6 onwards, this is not necessarily the baseline for the apps bought on the Mac App Store themselves: apparently, they can support earlier releases of Mac OS X, including Leopard. Obviously, they cannot be bought from there, they have to be transferred from a Snow Leopard Mac where you bought them. But I was wondering how the computer authorization process (documented in various articles, like Macworld’s hands on, read just above “Work in progress”) would work on a Leopard machine where the Mac App Store cannot be installed.

So I took the Delicious Library application, and moved it to my original MacBook, which remains on 10.5 for a variety of reasons (I don’t have another Mac with Snow Leopard on hand, to test on pre-Mac App Store 10.6.5, unfortunately). When I connected my MacBook to the network (for the first time in the year), there was no update, which would have been necessary to add such support. And when I tried to run Delicious Library, this is what I got:

Delicious Library crash report, listing an “unknown required load command 0x80000022”

Uh oh, Wil

This error is, in fact, not related to the Mac App Store at all, it seems instead that the application relies on some other Snow Leopard-only feature, probably by mistake. Apparently, this build was never tested on Leopard. I double-checked, and the application does declare it can run on Leopard in the Mac App Store application, as well as in its property list (from which the Mac App Store information was probably generated). So, I went looking for a free app that would run on Leopard; Evernote fit the bill, so I downloaded it and transferred it. It ran without problem, however being a free app, it did not need to validate its license on the MacBook or anything of the sort, I would have to test with a paying app. Osmos declares it runs on Leopard (as early as Tiger, in fact, though it’s Intel-only, so not on a PowerPC machine), so I bought it (the things I’ll do for you people) and transferred it. But it didn’t run any better than Delicious Library, though for a different reason (it required a version of libcurl more recent than the one found in Leopard). So, it’s another app that hasn’t actually been tested on the baseline Mac OS X version it claims it supports, great. I stopped the expense there.

Note that a large majority of paid apps actually require Snow Leopard, if their Mac App Store listings are to be believed. I’d wager that none of the paid apps that declare otherwise were actually tested on Leopard, and that all paid apps actually require Snow Leopard and probably 10.6.6 to run correctly; anyone care to confirm otherwise? I have no intent on spending a bunch of money to test that theory.

General Criticism

Besides the events of this run, I want to make more general observations on the Mac App Store. Contrary to the music, movie, book, comic book, etc. industries, where digital distribution is a relatively new phenomenon, people have been selling computer software over the network (not even necessarily the Internet back in those days, think Compuserve, AoL, the numerous BBS…) – and making a living out of it – since the beginning of the nineties, if not earlier. And yet, even after 20 years, for the majority of Mac users the act of buying software still means the brick and mortar store, or at best, a mail-order store like Amazon. There is no household-name software that’s distributed mostly digitally, except for some open-source applications like VLC or Firefox, expander/viewer/reader companion apps, and rare successes like… uhh… I’m sure I’ll think of one eventually.

Welp, while my questioning of whether such software existed was rhetorical at the time, it turns out there is a piece of software that in fact qualifies: Skype; their unusual business model is irrelevant: indisputably, it is commercial software mostly distributed digitally. This goes to show that when you solve a very practical problem with killer tech, you can overcome the barrier between digital distribution and the mainstream Mac market; that being said, it remains a hard problem for anyone else. – May 22, 2012

I’ve said the Mac App Store is probably the most important thing to happen to the Macintosh platform since Mac OS X, and that’s because it promises to provide at last a way to distribute software outside of the brick and mortar stores, that the rest of us will actually use; this, in turn, will allow developers who do not have the means to distribute their products in stores to access the majority of Mac users; of course, virtual or physical, shelf space and attention remain limited, but now we can avoid a hugely inefficient step in the middle.

Since the Mac App Store will set the expectations of Mac users for years to come, how it works, what it allows users, the kind of software found on it, etc., are extremely important, not just for Apple in the short run, but for the health of the platform in many years down the line. To me, the Mac App Store delivers in the main area it was supposed to: provide a great, integrated, end-to-end buying/downloading experience. However, it falls short to a smaller or greater extent in all other areas.

Let’s begin by the design. It’s a straight port of the “App Store” iPad app. Really, couldn’t they have done better? Surely, they could have made better use of the space afforded by the desktop, instead of using the strict iPhone/iPad tabbed design. Why have one entire tab to the updates, couldn’t this be put in a notification area in all modes? And breadcrumbs? Are they forbidden now? But the worst is surely that weird window title bar, with no title, and the stoplight window controls in the center left of the bar; I mean, is space at such a premium that they couldn’t have gone with a traditional unified title and toolbar design? It would have worked very well with the Panic toolbar design! To add insult to injury, these “toolbar tabs that go to the top of the title bar” are actually click-through! Argh! Now not even the top center of a window is safe for clicking (the worst thing is, I was already instinctively avoiding them when clicking to bring the Mac App Store window to the foreground, showing how thoroughly pervasive click-through has already damaged my computer habits).

As I’ve said, the Mac App Store provides a good, more importantly connected experience, from the buying intent to the moment the app is ready to use in the Dock. I’ve heard some complain about this automatic Dock placement, but to me this is not a problem, or to be more accurate this is not the problem with it. If you think about it, this policy actually is the most efficient: as it stands if, in the minority of cases, you don’t want to regularly use the application you just bought, you can just drag it out of the Dock; otherwise, you do nothing. The alternatives are not putting it automatically, in which cases in the majority of cases you are going to fetch the newly bought app from the Applications folder to put it in the Dock (which is more work than dragging an application out of the Dock), or asking you each time, in which case every time you have to read a dialog, choose on the spot, and click; and let’s not mention having a preference. The same goes for automatic placement in the Applications folder. Yes, I know browsers have these kind of options (Downloads folder/Desktop/ask each time), but that’s mostly because of historical reasons, your not necessarily expecting to be downloading something, and the wide variety of things you could be downloading from a browser.

But that’s from the perspective of an experienced user. For user actions, and doubly so when actions are made to happen “automatically” like that, the way to undo the action should be obvious from the way it was done (or shown, animations are very useful for that); I don’t mean the way to reverse immediately if the action was a mistake (the undo command is here for that), but the way the action can be reversed later if so desired. Here the way to “undo” the Dock placement remains reasonably obvious (drag it out of there), but users are going to think it gets rid of the application. Besides the fact this not the case, users will be reluctant to move applications out of the Dock for fear of not being able to find them again, and will keep them all in there. Yeah, I know, Mac OS X Lion and the Launchpad are supposed to solve that, but they’re not there yet, and in the meantime, the Mac App Store is here and users will use it. People do not get confused by however complex the system is underneath (do many even suspect that applications are in fact folders containing the binary and support files? No.), but by “innovations” that purportedly simplify some aspect of the task while leaving some or most of the complexity still visible elsewhere.

Besides the way the Mac App Store application currently works, there are issues with what the Mac App Store enables, or to be more accurate, does not enable.

For long, developers have asked for a way to update their applications as part of the (Apple Menu)→Software Update command, or to get access to the crash reports from their applications that were sent to Apple (though the issue was more that the “send the crash report to Apple” feature gave the expectation that the developer could do something about it or was notified of the issue). But I’ve always felt that this could not be done without Apple and the developer being in a tighter relationship, because of the potential spoofing and security issues that could occur; and now the Mac App Store is that relationship. However, there are Mac App Store improvements that could be given to non-Mac App Store applications (and it’s in the long-term best interest of the Mac platform that this option remains viable); for instance, it’s been a long time since distribution through a disk image was cutting edge, and Installer packages are too interaction-heavy. It’s not possible to have one-click download and installation from the web for obvious security reasons, but couldn’t Apple make available an application packaging method that can be downloaded, then, when double-clicked, would ask something close to the quarantine question, possibly show an EULA (as disk images can do), then install the app in the Application folder, show where it is, and discard the package, without any further interaction? I’m sure plenty other improvements of the sort could be made.

I also take issue with many Apple policies with the Mac App Store. Many of them are the exact same complaints developers have had about the iOS App Store (with the exception, of course, of the inability to distribute applications outside of it; developers can distribute betas and other special versions of the application however they want); while they may seem developer complaints that users shouldn’t care about, most of them disrupt the relationship between users and developers, resulting in a lose-lose situation. These issues are, among others: not inclusive enough (for instance, applications cannot install kernel extensions; why have that facility then?), no customer information whatsoever, user “review” system that gives the expectation developers can give tech support on them even though they can’t, still no “unfiltered Internet access” rating, no support for upgrade pricing, and most of all, no real support for demos/try before you buy.

That last one is the most infuriating. Apple is showing all Mac users a more practical alternative to shrink-wrapped software stores, and the policy is still that you have to buy with your eyes closed, only on the basis of a description, a few screenshots, and “reviews” that… could be better? Frak! And don’t tell me this demo business is confusing, people get to try before they buy in real life all the time: with TVs, consoles, audio systems, etc. in the electronics store; with clothes, shoes, etc.; with cars in their auto dealer, etc, etc, etc. Do I need to go on? I’ve always been suspicious of the “experience optimized for impulse buying” argument for the absence of real demos on the iOS App Store (it seems to me there are already plenty of apps at impulse buy prices, so it would be a good idea to encourage non-impulse buy prices), but here on the Mac App Store it makes no sense at all. Oh, sure, developers can distribute a demo from their web site, but it feels about as disconnected as a broken wire. This, alone, will ensure that I will rarely, if ever, buy again from the Mac App Store; not because I’m going to go out of my way to avoid using it, but because I’ll always be afraid of wasting my money on something useless, as I never buy on impulse. Practically all the downloaded software I own, I bought it after trying it, and I’m not going to start changing that now; that would be going… backwards, back at the time of brick and mortar stores, precisely those the Mac App Store is supposed to obsolete.

By the way, in case you have an iOS device and want to encourage “try before you buy”, there is a simple way: go to the “try before you buy” iTunes badge, to mark this as an iTunes store link featured group, download 10-20 of them that seem interesting to you, and try them out. That’s it, that’s all I’m asking of you: there is bound to be a few that you will like and where you will buy the complete version; this, in turn, will send Apple the message that yes, we do want to try apps before we buy them.

I’m deeply torn about the Mac App Store; not just how it currently is, but the whole principle of it. As it currently is, it works and will be used without a doubt, while having a number of issues and setting a number of bad expectations. While I have no doubt many issues will be fixed, Apple has been pretty stubborn about some of them (I mean, for how long have we been asking for a trial system on the iOS App Store?). And there needs to be life outside the Mac App Store, but Apple seems utterly uninterested in improving anything there.

It’s stunning what you learn about your iPhone while in holidays

I should go in holidays more often. You find out interesting things about your iPhone when you go outside the urban environment it was mostly designed for. I’m not ready to talk about the tests I mentioned earlier, but here’s a few other observations I can share…

First, the iPhone, or at the very least the iPhone 3GS which is the model I own, seems to have trouble getting a GPS fix in cloudy weather. If obtaining a location regardless of weather is of any real importance, a dedicated GPS device is still required.

Second, if you’re not going to have access to a power outlet for, say, one week, switch your device to airplane mode, it’s just as efficient and more practical than turning it off. Let me explain. I was for a week in remote parts of the Alps, trekking from mountain refuge to mountain refuge; even those that have some electricity from a generator are unlikely to have power outlets. Furthermore, in that environment you often don’t get any signal, or if you do it’s very weak, so a lot of battery is going to be wasted looking for a signal or maintaining a weak connection. So last year, I simply turned off my iPhone so as to conserve battery for the whole week. However, it was not very practical as I had to wait for it to boot each time I wanted to use it, which wasn’t very fast… So this year, I put it in airplane mode instead of turning it off. This turns off the radios and leaves it consuming apparently very little power; but it was ready much faster when I wanted to check whether there was any signal, or test my application, or show off something, and it still had juice at the end of the week.

Lastly, and this is more for developers: even if your app requires location, you may not always get true north if there is no data connection, so be sure to always handle the lack of true heading and fall back to the magnetic heading. As you know, a compass does not point exactly to the North Pole (in this context also referred to as geographic north), but to a point called the magnetic north pole, located somewhere in Greenland (in fact it is even more complicated than that, but let’s leave it at that). The iPhone 3GS and iPhone 4, both featuring a magnetometer, cannot give you true north using only that sensor any more than a compass can; however, if the device also has your location, it can apply the correction between magnetic north and true north directions at that point and give you the true north; this is well documented in the API documentation. However, while your location is necessary, it is not sufficient! I have observed that if I don’t have any data connection, I only get magnetic north, even if I just got a GPS fix; the device probably needs to query a remote database to get the magnetic north correction for a given point. So if your application uses heading in any way, never assume you can have true north, even if you know you have location; always handle the lack of true north and fall back to magnetic north in that case, mountain trekkers everywhere will thank you.

A few things iOS developers ought to know about the ARM architecture

When I wrote my Introduction to NEON on iPhone, I considered some knowledge about the iOS devices’ processors as assumed to be known by the reader. However, from some discussions online I have realized some of this knowledge was not universal; my bad. Furthermore these are things I think are useful for iPhone programming in general (not just if you’re interested in NEON), even if you program in high-level Objective-C. You could live without them, but knowing them will make you a better iPhone programmer.

The Basics

All iOS devices released so far are powered by processors based on the ARM architecture; as you’ll see, this architecture is a bit unlike what you may be used to on the desktop with x86 or even PowerPC. However, it is not a “special” or “niche” architecture: nearly all mobile phones (not just smartphones) are based on ARM; practically all iPods were based on ARM, as well as nearly all mp3 players; PDAs and Pocket PCs were generally ARM-based; Nintendo portable consoles are based on ARM since the GBA; it is now invading graphic calculators with some TI and HP models using it; and if you want pedigree, know the Newton was based on ARM as well (in fact, Apple was an early investor in ARM). And that’s only mentioning gadgets; countless ARM processors have shipped in unassuming embedded roles.

ARM processors are renowned for their small size on the silicon die, low power usage, and of course their performance (within their power class). The ARM architecture (at least as used in the iOS platform) is little-endian, just like x86. It is a RISC architecture, like MIPS, PowerPC, etc., and for a long time was only 32-bit, but now has a 64-bit extension called ARM64. Notice the simulator does not execute ARM code, when building for the simulator your app is compiled for x86 and executes natively, so none of the following applies when running on the simulator, you need to be running on target.

ARMv7, ARM11, Cortex A8 and A4, oh my!

The ARM architecture comes in a few different versions developed over time; each one added some new instructions and other improvements, while being backwards compatible with the previous versions. The first iPhone had a processor that implements ARMv6 (short for ARM version 6), while the latest devices have processors that can support ARMv7. So when you compile code, you specify the architecture version you’re targeting, and the compiler will restrict the instructions it generates to those available in that architecture version; the same goes for the assembler, which will check that the instructions used in the code are present in the specified architecture version. In the end, you have object code that targets a specific architecture variant, ARMv6 or ARMv7 (or ARMv5 or ARMv4, but given that ARMv6 is the baseline in iOS development, you’re very unlikely to target these); the object and executable files are in fact marked with the architecture they target, run otool -vh foo.o on one of your object or executable files sometime.

Do not confuse ARMv6 and ARMv7 with ARM6 and ARM7: the latter are two old ARM processor models (or rather, model families), while the former are ARM architecture versions. — September 25, 2011

However, it does not make sense to say that the original iPhone had “the ARMv6 processor”: ARMv6 does not designate a particular processor, but the set of instructions a processor can run, which does not imply any particular implementation. The processor core implementation used in the original iPhone was the ARM11 (it was the ARM1176JZF-S, to be really accurate, but it matters very little, just remember it was a member of the ARM11 family); as mentioned earlier, this processor implements ARMv6. Subsequent devices used ARM11 as well, up until the iPhone 3GS which started using the Cortex A8 processor core, used in all iOS devices released since then at the time of this writing (this is not yet certain, but strongly suspected, in the case of the iPhone 4). This core implements the ARMv7 instruction set, or in short, supports ARMv7.

The processor powering the iPad 2 (and likely subsequent devices) supports ARMv7 too, but is not one (or multiple) Cortex A8. It is technically still unknown what processor core it could be as I have not seen any evidence for it, but I am convinced it is two Cortex A9 cores. — September 25, 2011
Subsequent iOS devices have started using even more mysterious processors designed by Apple’s semiconductor architecture group, which are only known by their code names: Swift and Cyclone. At this point if you want to know which device uses which processor it is best to to follow the awesome iOS Support Matrix. — May 16, 2014

Now having said that, DO NOT go around and write code that detects which device your code is executing on and tries to figure out which architecture it supports using the known information on currently released devices. Besides being the most unreliable code you could write, this kind of code will break when run on a device released after your application. So please don’t do it, otherwise I swear I will come to your house and maim you. This information is just so that you have a rough idea of the installed base of devices that can support ARMv7 and the ones that can only run ARMv6; I’ll get to detection in a minute.

But you may be wondering: “I thought the iPad and iPhone 4 had an A4 processor, not a Cortex A8?!” The A4 is in fact the whole applications System on a Chip, which includes not only a Cortex A8 core, but also graphics hardware, as well as video and audio codec accelerators and other digital blocks. The SoC and the processor core on it are very different things; the processor core does not even take the majority of the space on the silicon die.

ARMv7 support on the latest devices would be pretty useless if you couldn’t take advantage of it, so you can do so, but always doing so would prevent your code from running on earlier devices, which may not be what you want. So how do you detect which architecture version a device supports so that you can take advantage of ARMv7 features if and only if they are present? The thing is, you don’t. Instead, your code is compiled twice, once targeting ARMv6, and once targeting ARMv7; the two executables are then put together in a fat binary, and at runtime the device will itself choose the best half it can support. Yes, Mach-O fat binaries are not just for grouping completely different CPU architectures (e.g. PowerPC and Intel, making a Universal Binary), or 32 and 64 bit versions of an architecture, but also two variants (cpu subtypes, in Mach-O parlance) of the same architecture. The outcome is that from the viewpoint of the programmer, everything gets decided at compile time: the code compiled targeting ARMv6 will only ever run on ARMv6 devices, and the code compiled targeting ARMv7 will only ever run on ARMv7 (or better).

If you’ve read my NEON post, you may remember that in that post I also suggested a way to do the detection and selection at runtime. If you check now, you’ll notice I have actually removed that part, and now recommend you do not use that method any more.1 This is because while it does work, it was impossible (or at the very least too tricky to implement to do so without any error) to ensure that the code would keep working the day it is run on a future ARMv8 processor. The fact the documented status of that API is unclear doesn’t help, either (its man page isn’t in the iOS man pages). You should exclusively use compile-time decision and fat executables if you want to run on ARMv6 and take advantage of ARMv7.

Note that I haven’t covered here why you would want to take advantage of ARMv7; I have now done so in a new post — September 25, 2011

One last note on the subject: in the context of iOS devices, the ARM architecture versions do not strictly mean what they mean for ARM processors in general. For instance, iOS code that requires ARMv6 actually requires support for floating-point instructions as well (VFPv2, to be accurate), which is an optional part of ARMv6, but has been present since the original iPhone, so when ARMv6 is mentioned in iOS development (e.g. as a compiler -arch setting or as the cpu subtype of an executable) hardware floating-point support is implied. The same goes for ARMv7 and NEON: NEON is actually an optional part of the ARMv7-A profile, but NEON has been present in all iOS devices supporting ARMv7, so when developing for iOS NEON is considered part of ARMv7.

So to sum it up:

  • the first devices had an ARM11 processor, which implements ARMv6,
  • devices from the iPhone 3GS onwards have a Cortex A8 processor, which implements ARMv7,
  • some of these feature the so-called A4 “processor”, in fact a SoC,
  • the iPad 2 does not have a Cortex A8, but likely two Cortex A9, which too implement ARMv7

— September 25, 2011

ARMv7s update:
The iPhone 5 introduced a new architecture variant called ARMv7s; ARMv7s is to ARMv7 what ARMv7 was to ARMv6. ARM does not define ARMv7s, it is purely an Apple term and means what we know as ARMv7 plus VFPv4, Advanced SIMDv2 and integer division in hardware (and that’s it for unprivileged instructions as far as anyone can tell). The newly introduced processor core which implements ARMv7s is currently a complete mystery.

Also, it did not even take ARMv8, as ARMv7s indeed did undo my runtime detection method in an old internal project where I was still using it: the code acted as if the device did not support ARMv7 and NEON… when run on an iPhone 5. — December 11, 2012

ARM64 update:
With the iPhone 5S Apple has now introduced ARMv8, the first ARM architecture version that supports a new 64-bit instruction set, called ARM64. It is hard to compare it to the upgrade from ARMv6 to ARMv7: while the latter brought NEON, ARM64 is a whole instruction set which is in some aspects completely new and in others quite familiar, as I explored in this post, and I will mention these differences whenever appropriate — May 16, 2014

Conditional Execution

A nifty feature of the ARM architecture is that most instructions can be conditionally executed: if a condition is false, the instruction will have no effect. This allows short if blocks to be implemented more efficiently: while the usual method is to jump to after the block if the condition is false, instead the instructions in the block are conditionalized, saving one branch.

Now I wouldn’t mention this if it was just a feature the compiler uses to make the code more efficient; while it is that, I’m mentioning it because it can be surprising when debugging. Indeed, you can sometimes see the debugger go inside if blocks whose conditions you know are false (e.g. early error returns), or go in both sides of an if-else! This is because the processor actually goes through that code, but some parts of it aren’t actually executed, because they are conditionalized. Moreover, if you put a breakpoint inside such an if block, it may be hit even if the condition is false!

That being said, it seems (in my limited testing) that the compiler avoids generating conditionally executed instructions in the debug configuration, so it should only occur when debugging optimized code; unfortunately sometimes you have no choice but to do so.

ARM64 all but eliminated conditional execution (only a few simple instructions can still behave conditionally); I would be surprised to see any if block, even a trivial one, being executed as conditional instructions in ARM64 code. — May 16, 2014

Thumb

The Thumb instruction set is a subset of the ARM instruction set, compressed so that instructions take only 16 bits (all ARM instructions are 32 bits in size; Thumb is still a 32-bit architecture, just the instructions take less space). It is not a different architecture, rather is should be seen as a shorthand for the most common ARM instructions and functionality. The advantage, of course, is that it allows an important reduction in code size, saving memory, cache, and code bandwidth; while this is more useful in microcontroller type applications where it allows hardware savings in the memory used, it is still useful on iOS devices, and as such it is enabled by default in Xcode iOS projects. The code size reduction is nice, but never actually reaches 50% as sometimes two Thumb instructions are required for one equivalent ARM instruction. ARM and Thumb instructions cannot be freely intermixed, the processor needs to switch mode when going from one to the other; this can only occur when calling or returning from a function. So a function has to either be Thumb or ARM as a whole; in practice you do not control whether code is compiled for Thumb or ARM at the function granularity but rather at the source file granularity.

When targeting ARMv6, compiling for Thumb is a big tradeoff. ARMv6 Thumb code has access to fewer registers, does not have conditional instructions, and in particular cannot use the floating-point hardware. This means for every single floating-point addition, subtraction, multiplication, etc., floating-point Thumb code must call a system function to do it. Yes, this is as slow as it sounds. For this reason, I recommend disabling Thumb mode when targeting ARMv6. If you do leave it on, make sure you profile your code, and if some parts are slow you should first try disabling Thumb at least for that part (easy with file-specific compiler flags in Xcode, use -mno-thumb). Remember that floating-point calculations are pretty common on iOS since Quartz and Core Animation use a floating-point coordinate system.

When targeting ARMv7, however, all these drawbacks disappear: ARMv7 contains Thumb-2, an extension of the Thumb instruction set which adds support for conditional execution and 32-bit Thumb instructions that allow access to all ARM registers as well as hardware floating-point and NEON. It’s pretty much a free reduction in code size, so it should be left on (or reenabled if you disabled it); use conditional build settings in Xcode to have it enabled for ARMv7 but disabled for ARMv6.

To add a conditional build setting, follow Jeff Lamarche’s instructions in this post (last paragraph) for Xcode 3. For Xcode 4, you should follow Apple’s instructions, with the caveat that you can only add a conditional setting once you have selected the setting in a particular configuration (the menu item will be greyed out otherwise), so yes, you have to do it once for Debug, once for Release, and once for Distribution. — September 25, 2011

In ARM documentation and discussions on the Internet you may find mentions that code needs to be “interworking” to use Thumb; unless you write your own assembly code, you don’t have to worry about it as all code is interworking in the iOS platform (interworking refers to a set of rules that allow functions compiled for ARM to directly call functions compiled for Thumb, and the converse, without any problem and transparently as far as the C programmer is concerned). When displaying assembly, Shark or the Time Profile instrument may have trouble figuring out whether the function is ARM or Thumb, if you see invalid or nonsensical instructions, you may need to switch from one to the other.

ARM64 is a new instruction set where all instructions are also 32 bits in size, and which does not have the equivalent of Thumb mode; said another way, there is no 64-bit Thumb. — May 16, 2014

Alignment

In the iOS platform unaligned accesses are supported; however, they are slower than aligned accesses, so try and avoid them. In some particular cases (those involving load/store multiple instructions, if you’re interested), unaligned accesses can be a hundred times slower than aligned accesses, because the processor cannot handle these and has to ask the OS for assistance (read this article, it’s the same phenomenon that on PowerPC causes unaligned doubles to be so much slower). So be careful, alignment still matters.

Division

This one always surprises everyone. Open the ARM architecture manual (if you don’t already have it, see Introduction to NEON on iPhone, section Architecture Overview) and try and find an integer division instruction. Go ahead, I’ll wait. Can’t find it? It’s normal. There is none. Yes, the ARM architecture has no hardware support for integer division, it must be performed in software. If you compile the following code:

int ThousandDividedBy(int divisor)
{
    return 1000/divisor;
}

to assembly, you’ll see the compiler has inserted a call to a function called “___divsi3”; it’s a system function that implements the division in software (notice the divisor should be non-constant, as otherwise the division is likely to be turned into a multiplication). This means that on iOS, integer division is actually an operating system benchmark!

“But!” you may say, having finally returned victorious from the ARM manual “You’re wrong! There is an ARM division instruction, even two! Here, sdiv and udiv!” Sorry to rain on your parade, but these instructions are only available in the ARMv7-R and ARMv7-M profiles (real-time and embedded, respectively – think motor microcontrollers and wristwatches), not in ARMv7-A which is the profile that the iOS devices that have ARMv7 do support. Sorry!

Hey, what’s this? If you compile that code for architecture ARMv7s… a sdiv instruction gets generated! Yep, ARMv7s does include support for integer division in hardware, but that only applies for new devices, existing devices are still unable to do so and will keep using the ARMv7 code and divide in software, so don’t just go division-happy yet. — December 11, 2012
ARM64 also has integer division in hardware; in fact now the ARM architecture document does list the sdiv and udiv instructions as an (optional) part of the ARMv7-A profile. — May 16, 2014

GCC

It’s not a secret that the ARM code generated by GCC is considered to be crap. On other ARM-based platforms professional developers use the toolchain provided by ARM itself, RVDS, but this is not an option on the iOS platform as RVDS doesn’t support the Mach-O runtime used by OS X, only the ELF runtime. But there is at least an alternative to GCC, as now LLVM can be used for iOS development. While I’ve not tested it much, I have at least seen nice improvements on 64-bit integer code in ARM32 (a particularly weak point of GCC on ARM) when using LLVM. Hopefully, LLVM will prove to be an improvement over GCC in all domains.

Apple no longer provides GCC for iOS development (or Mac development, for that matter), so this is a moot point today. — December 11, 2012
~

There, you’re now a better iOS developer!


  1. While I’m at it, I should mention I also added information about __ARM_NEON__, and fixed a few typos. I’ll need to set a nice way to let people know when my posts are updated, as I’d like to keep them current, or at the very least problem-free. For now, this ad hoc system will do.

Introduction to NEON on iPhone

A sometimes overlooked addition to the iPhone platform that debuted with the iPhone 3GS is the presence of an SIMD engine called NEON. Just like AltiVec for PowerPC and MMX/SSE for x86, this allows multiple computations to be performed at once on ARM, giving an important speedup to some algorithms, on condition that the developer specifically codes for it.

Given that, among the iPhone OS devices, only the iPhone 3GS and the third-gen iPod Touch featured NEON up until recently, it typically wasn’t worth the effort to work on a NEON optimization since it would benefit only these devices, unless the application could require running on one of them (e.g. because its purpose is to processes videos recorded by the iPhone 3GS). However, with the arrival of the iPad it makes much more sense to develop NEON optimizations. In this post I’ll try to give you a primer on NEON.

As of this writing, the latest release of iOS, 4.3, only supports NEON-powered devices, so it’s no longer a stretch to support only NEON devices and forget the iPhone 3G. — September 25, 2011

Before we begin, a bit of setup

This instruction set extension is actually called Advanced SIMD, it is an optional extension to the ARMv7-A profile; however, when ARMv7 is mentioned the presence of the Advanced SIMD extensions is generally implied, unless mentioned otherwise. “NEON” actually refers to the implementation of this extension in ARM processors, but this name is commonly used to designate these SIMD instructions as well as the processor feature.

So far, Apple has only shipped one processor core featuring NEON: the Cortex A8, in three devices: the iPhone 3GS, the third-gen iPod Touch, and the iPad. This means that besides the general details of NEON that are valid whichever the implementation, the details of NEON as implemented in the Cortex A8 are of high interest to us since it’s the only processor core which will run iPhone OS NEON-optimized code at the moment.

With the iPad 2 this is no longer the case: it supports NEON but does not have a Cortex A8; however the Cortex A8 timings should still be considered the reference, at least on iOS. — September 25, 2011
At this point I am only keeping the Cortex A8 information in this post for historical interest, as it is no longer the reference four years later — May 16, 2014

Do You Need It?

iPhone OS devices (even the iPad) are pretty focused. The user is unlikely to want to perform scientific calculations, simulation, offline rendering, etc. on these devices, so the main applications for NEON here are multimedia and gaming. Then, see if the work hasn’t already been done for you in the Accelerate framework, new in iPhone OS 4.0, for you to use when the update lands. However, iPhone OS 4.0 won’t make it to the iPad until this fall, and of course if the algorithm you need isn’t in Accelerate, it’s not going to be of any help to you; in these cases, then it makes sense to write your own NEON code.

“this fall” was fall 2010, of course; it’s no longer an issue today. — September 25, 2011

As always when optimizing, make sure you don’t optimize the wrong thing. When doing multimedia work, hunches about what is taking up processor time are less likely to be wrong than for general applicative work, but you should still first write the regular C version of the algorithm, then profile your code using Shark, in order to know the one part which is taking up the most time, and consequently is to be tackled first (and then the second most, etc.). Then benchmark your improvements to make sure you did improve things (It’s unlikely for the NEON version to be slower than scalar code, but regressions can happen when trying to improve the NEON code).

That would be the Time Profile Instrument, today (rest well, Shark, my friend). — September 25, 2011

Detection and Selection

Obviously, devices earlier than the iPhone 3GS are not going to be able to run NEON code, so there needs to be a way to select a different code path depending on which device is executing the code; even if you only target, say, the iPad, there needs to be a way to let it be know the code requires NEON, so that it does not accidentally run on a non-NEON device.

This way is to build the application for both ARMv6 and ARMv7, and, using #ifdefs, disable at compile time the undesired code (the NEON code when building for ARMv6, the non-NEON fallback when building for ARMv7; __ARM_NEON__ will be defined if NEON is available); if you target only ARMv7 devices, then only build for ARMv7 (in that case, don’t forget to add armv7 in the UIRequiredDeviceCapabilities of your Info.plist). When your application is run, the device will automatically pick the most capable half it can: the ARMv7/NEON one for ARMv7-capable devices, the ARMv6 half, if present, otherwise. The drawback is that, unless you target only ARMv7 devices, the whole application code size (not just the code for which there is a NEON version) will end up twice in the final distribution: once compiled for ARMv6, and once for ARMv7. As executable code is typically a small portion of the application size, this is not a problem for most applications, but it is a problem for some of them.1

Don’t forget that you’ll need to disable the NEON code at compile time when building for the simulator, as your application is compiled for x86 when targeting the simulator, and NEON code will cause build errors in this context. This means you always need to also write a generic C version of the algorithm, even if you only target the iPad, or you won’t be able to run your application in the simulator.

Development

There are actually different ways you can create NEON code. You can program directly in assembly language; it does require good knowledge of ARM assembly programming, so it’s not for everyone, and (even if NEON programming isn’t for everyone, either) now given the differences ARM64 brings I do not recommend writing new NEON code in assembly going forward. It is better to use C and compiler intrinsics (which you get by including arm_neon.h), leaving the compiler to worry about register allocation, assembly language details, etc.

ARM mentions a third way, which is to let the compiler auto-vectorize, but I don’t know how well it works, or if it is even enabled in the iPhone toolchain, so I’m not going to reference it here.

I’ve only used assembly programming so far, so I will generally describe techniques under assembly terms, but everything should be equally applicable when programming with intrinsics, just with a different syntax. There is one important thing, however, which is applicable only when you program in assembly (you don’t have to worry about that when using intrinsics): make sure you save and restore d8 to d15 if you use them in your ARM32 NEON function, as the ABI specifies that these registers must be preserved by function calls. If you don’t, it may work fine for a while, until you realize with horror that the floating-point variables in the calling functions become corrupted, and all hell breaks loose. So make sure these registers are saved and restored if they are used.

Architecture Overview

Now we get to the meat of the matter. To follow from home you will need the architecture document from ARM describing these instructions; fortunately, you already have it. You see, Shark has this feature where you can get help on a particular assembler instruction, which it provides simply by opening the architecture specification document at the page for this instruction. While for PowerPC and Intel, the document bundled with Shark is a simple list of all the instructions, for ARM it is actually a fairly complete subset of the ARM Architecture Reference Manual (some obsolete information is omitted). Rather than open it through Shark, you can open it directly in Preview by finding the helper application (normally at /Library/Application Support/Shark/Helpers/ARM Help.app), right clicking->Show Package Contents, and locating the PDF in the resources.

Shark is gone in Xcode 4, but this document has merely moved, these days it is in <Xcode 4 install folder>/Library/PrivateFrameworks/DTISAReferenceGuide.framework/Resources/ARMISA.pdf — September 25, 2011
Don’t you just love stuff that keeps changing place? Xcode 4.3 and later should have this document in Xcode.app/Contents/Applications/Instruments.app/Contents/Frameworks/DTISAReferenceGuide.framework/Versions/A/Resources/ARMISA.pdf — May 22, 2012
Of course it has moved again by now, but the version in Xcode 5 has become obsolete anyway (as of this writing it still hasn’t been updated for integer division in hardware, which shipped in iOS devices in September 2012), so you should get it directly from the source from now on (don’t worry about the “registered ARM customers” part, you only need to create an account and agree to a license in order to download the PDF) — May 16, 2014

I will sometimes reference AltiVec and/or SSE, as these are the SIMD instruction set extensions that iPhone developers are most likely to be familiar with, and there is already an important body of information for both architectures online; less so for NEON.

Programmer Model

As you know, SIMD (Single Instruction Multiple Data) architectures apply the same operation (multiplication, shift, etc.) in parallel to a number of elements; the range of elements to which an operation is applied is called a vector. In modern SIMD architectures, vectors have a constant size in number of bits, and contain a different number of elements depending on the element size being operated on: for instance, in both AltiVec and SSE vectors are 128 bits, if the operation is on 8-bit elements for instance, then it operates on 16 elements in parallel (128/8); if the operation is on 16-bit elements, then each vector contains 8 of them (128/16).

Under ARM32 mode in NEON you have access to sixteen 128-bit vector registers, named q0 to q15; this is the same vector size as AltiVec and SSE, and there are twice as many registers as 32-bit SSE, but half as many as AltiVec. However, this register file of 16 Q (for Quadword) vectors can also be seen as thirty-two 64-bit D (Doubleword) vectors, named d0 to d31: each D register is one half, either the low or high one, of a Q register, and conversely each Q register is made up of two D registers; for instance, d12 is the low half of q6, and q3 is at the same location as the d6-d7 pair. Most instructions that do not change the element size between input and output can operate on either D or Q registers; instructions that narrow elements as part of their operation (e.g. a 16-bit element on input becomes a 8-bit element on output) take Q vectors as input and have a D vector as output; instructions that enlarge elements as part of their operation take D vectors as input and have a Q vector as output. There are, however, a few instructions that only work on D vectors: the vector loads, vector stores, vpadd, vpmax, vpmin, and vtbl/vtbx; while for some operations this matters very little (e.g. to load two Q vectors, you use a load multiple instruction with a list containing the four matching D vectors), in the other cases this means the operation must be done in two steps to operate on a Q vector, once on the lower D half and once one the higher D half.

This D/Q duality provides a consistent way to handle narrowing and widening operations, instead of the somewhat ad hoc schemes used by AltiVec and SSE (e.g. for full precision multiplies on AltiVec you multiply the even elements, then the odd elements of a vector; on SSE you obtain the low part of the multiplication, then the high part). It also makes it easier to manage the narrowed vector prior to an operation that uses one, or after an operation that produces one. It does make it a bit tricky to ensure that you don’t overwrite data you want to keep, as you must remember not to use q10 if you want to keep the data in d20.

In ARM64 mode NEON has been extended to thirty-two 128-bit vector registers, named q0 to q31, and most of the limitations above have become irrelevant. However, the relationship with the 64-bit sized registers has changed, and I am not going to expand on it as I now recommend you do not program in NEON with ARM64 assembly, and exclusively use intrinsics instead; I’ll expand on why in a minute. — May 16, 2014

The register file is shared with the floating-point unit, but NEON and floating-point instructions can be freely intermixed (provided, of course, that you share the registers correctly), contrary to MMX/x87.

A reminder: the ARM architecture (at least as used in the iPhone) is little-endian; remember that when permuting or otherwise manipulating the vector elements.

Architectural features

NEON instructions typically have two input and one separate output register, so calculations are generally non-destructive. There are three-operand instructions like multiply-add, but in that case there is no separate output register, in the case of multiply-add for instance the addition input register is also the output. A bit less usual is the fact some ARM32 instructions, like vzip, have two output registers.

Some instructions take a list of registers. In ARM32 mode it must be a list of consecutive D registers, though in some cases there can be a delta of two between registers (e.g. {d13, d15, d17}), in order to support data in Q registers.

Some instructions can take a scalar as one of their inputs. In that case, the scalar is used for all operations instead of having the corresponding elements of the two vectors be matched. For instance, if the vector containing a, b, c, d is multiplied by the scalar k, then the result is k×a, k×d, k×c, k×d. Scalars can also be duplicated to all elements of a vector, in preparation for operations that do not support scalar operands. Scalars are specified by the syntax dn[x], where x is the index of the scalar in the dn vector.

Syntax

All NEON instructions, even loads and stores, begin by “V” in ARM32 mode, which makes them easy to tell apart, and easy to locate in the architecture documentation (in ARM64 they no longer begin by “V”, but are now listed separately in the documentation). Instructions can have one (or more) letter just after the V which acts as a modifier: for instance “Q” means the instruction saturates, “R” that it rounds, and “H” that it halves. Not all combinations are possible (far from it), but this gives you an indication when reading the instruction of what it does.

In ARM32 mode practically all instructions need to take a suffix telling the individual size and type of the elements being operated on, from .u8 (unsigned byte) to .f32 (single-precision floating-point): for instance, vqadd.s16. If the element size changes as part of the operation, the prefix indicates the element size of the narrowest input. Some instructions only need the element type to be partially specified, for instance vadd can operate on .i16, as it only needs to know that the elements are integers (and their sizes), not whether they are signed or unsigned; some instructions even only need the element size (e.g. .32). However, always use the most specific data type you can, for instance if you’re currently operating on u32 data, then specify .u32 even for instructions that would just as well accept .32: the assembler will accept it and it will make your code clearer and easier to type-check. This part of the syntax has changed for ARM64, but ARM64 NEON assembly syntax is not going to be covered in this post.

A little historical note: NEON instructions used to have a different syntax, and some of them changed names. Notably, instructions where the element size changes took two prefixes, with the size after and before (e.g. .u32.u16). This is something you may still see in disassembly, for instance. And, for some reason, the iPhone SDK assembler only accepts instruction mnemonics in lowercase, so while the instructions are uppercase in the documentation, be sure to write them in lowercase.

Instruction Overview

It is way out of the scope of this blog post to provide a full breakdown of the NEON capabilities (like, for instance, this classic Apple document does for SSE). I will just give you a quick rundown of each major area to get you started, after that the documentation should be enough.

Load and Stores

NEON of course has a vector load and a vector store instructions, and even has vector load/store multiple instructions. However, these instructions are typically only used for saving/restoring registers on the stack and loading constants, both places where you can easily guarantee alignment, as these instructions demand word alignment. To load and store the data from the streams you will be operating on, you will typically use vld1/vst1; these instructions handle unaligned access, and the element size they take in fact pretty much acts as an alignment hint: they will expect the address to be aligned to a multiple to the element size.

Much more interesting are the vld#/vst# instructions. These instructions allow you to deinterlace data when loading, and reinterlace it when storing; for instance if you have an array of floating-point xyz structures, then with vld3.f32 you will have a few x data neatly loaded into one vector register, with another vector register containing the y and a third the z. Even for two-element (e.g. stereo audio) or four-element (e.g. 32-bit RGBA pixels) interlaced data, it avoids you the temptation to operate on non-uniform data, instead everything is neatly segregated in its own register (one register holds all left, one register holds all alpha, etc.). Notice that in ARM32 mode (in ARM64 mode they can directly load in the Q vectors) these have the option to operate on non-consecutive registers of the form {<Dd>, <Dd+2>, <Dd+4>, <Dd+6>}, so that you can load/store Q registers using two of these instructions (one filling the low D halves, one the high D halves).

These instructions can also load one data or one structure to a particular element of a register, so scatter loading (not that it should be abused) is even easier than with SSE; you can also load the same data to all elements of the register directly.

In NEON you can’t really do software realignment, as just like SSE there is no real support for this (vext looks tempting, until you realize the amount is an immediate, fixed at compile time). By starting with a few scalar iterations, you may be able to align the output stream to a multiple of the vector size; however the other streams are typically not aligned to a vector boundary at this point, so use unaligned accesses for everything else.

Permutation

NEON does have a real permute with vtbl/vtbx, however it doesn’t come cheap. Loading a Q vector with permuted data from two Q vectors, which is the equivalent of an AltiVec vperm instruction, will require issuing two instructions which will take 6 cycles in total on a Cortex A8, so save this for when it’s really worth it.

For permutations known at compile time, you should be able to combine the various permutation instructions to do your bidding: vext, vzip/vuzp, vtrn and vrev; vswp can be considered a permutation instruction too, it can serve as the equivalent of vtrn.64 and .128 (which don’t exist) for instance. vzip in particular acts a bit like AltiVec merge, though the fact that in ARM32 mode it overwrites its inputs makes it slightly unwieldy. Don’t forget you can use the structured load/store instructions to get the data in the right place right as you load it, instead of permuting it afterwards.

Comparisons and Boolean operators

There are a variety of comparison instructions, including ones which compare directly to 0; as is customary, if the comparison is true, the corresponding element is set to all ones, otherwise to all zeroes. There is the usual array of boolean operators to manipulate these masks, plus some less usual ones such as XOR, AND with one operand complemented, and OR with one operand complemented. Oh, and there is a select instruction (in fact, three, depending on which input you want to overwrite) to make good use of these masks.

Floating-Point

NEON floating-point capabilities are very similar to AltiVec. To wit:

  • just like AltiVec, only single-precision floating-point numbers are available
  • by default, denormals are flushed to 0 and results are rounded to nearest
  • only estimates to reciprocal square root and reciprocal are given, refinement steps are necessary to get the IEEE correct result
  • there are single-instruction multiply-add and multiply-substract

This is not to say there is no difference, however:

  • contrary to AltiVec, in ARM32 mode multiply-add seems to be UNfused, there is apparently an intermediate rounding step
  • probably related to the previous point, refinement step instructions for reciprocal square root and reciprocal are provided.
  • denormal handling cannot be turned on
  • conversion to integer necessarily rounds towards 0
  • there are no log and 2x estimates
  • some horizontal NEON operations are available, as well as absolute differences.
Some iOS devices now do support new fused multiply-add (and multiply-subtract) NEON instructions, as part of an instruction set extension known as Advanced SIMDv2 which is included in what Apple calls ARMv7s. As of this writing, however, ARMv7s-supporting devices are still new and likely not common enough yet to justify writing a second algorithm that would take advantage of fused multiply-add. — December 11, 2012

In ARM64 mode NEON now supports operating on double-precision floating-point numbers, which could be very useful for some people, just as could be the varied improvements, in particular to the rounding modes (which I believe are now fully IEEE compliant), making ARM64 floating-point NEON much closer to SSE than to AltiVec. However, for most developers there is not enough ARM64-supporting hardware to justify taking advantage of these improvements yet.

Also, multiply-add is always fused in ARM64 mode, so those of you who actually need an unfused multiply then add (you know who you are) will need to use a multiply instruction followed by an add instruction. — May 16, 2014

Integer

I’d qualify the integer portions of NEON as very nimble. For instance, you can right shift at the same time as you narrow, allowing you to extract any subpart of the wider element, and not only that, but you can round and saturate at the same time as you extract, including unsigned saturation from signed integer; very useful for extracting the result from an intermediate fixed-point format. The other way round, you can shift left and widen to get to that intermediate format. Simple left shifts can also saturate; without this, extracting bitfields with saturation is really unwieldy. There are also shift right and insert which allow to efficiently pack bit formats, for instance.

The multiplications are okay, I guess, though vq(r)dmulh is pretty much your only choice for a multiplication that does not widen and is useful for fixed-point computations, so better learn to love it.

Miscellaneous

Though not part of NEON, I should mention the pld (preload) instruction, which has been here since ARMv5TE, as such memory hint/prefetch instructions are often closely associated with SIMD engines. Architecturally, the effect of the pld instruction is not specified, the action performed depends on the processor. On the Cortex A8, this instruction causes the cache line containing the address to be loaded in the L2 cache; on the Cortex A8 the NEON unit directly loads from the L2 cache. If you do use pld, make sure to benchmark the performance before and after, as it can slow down things if used incorrectly.

In general, the documentation from ARM is of good quality, however there are not many figures to explain the operation of an instruction, so you should be ready to read accurate but verbose pseudocode if you’re unsure of the operation of an instruction or need to check it does what you think it does.

ARM64 considerations

Besides the particulars mentioned earlier, it is important to remember that in ARM64 mode the NEON register file is significantly different (with resulting changes to some instructions), meaning that existing NEON assembly code will need to be overhauled before it can work in ARM64 mode, and new assembly NEON code will need to be written twice.

There is no avoiding it: q3 is now no longer related to d6, but is now related to d3, while the high half of q3 can now only be accessed in specific ways, so for existing NEON assembly code you will need to revisit all of it for these kind of assumptions when writing the ARM64 version, and for new code I do not believe it is reasonably feasible, even with macros, to write a single NEON assembly source that would work in both modes.

This is why I recommend foregoing NEON assembly from now on, and writing new NEON code exclusively using compiler intrinsics, where you can easily write one version that compiles to both modes, freeing you from having to worry about register allocation in either mode; I have documented my experience porting to the NEON intrinsics (it’s surprisingly straightforward) in this new post. — May 16, 2014

Cortex A8 implementation of NEON

Remember that at this point, the Cortex A8 informations in this post only have, or pretty much only have historical interest, four years later. Unfortunately, to the best of my knowledge ARM does not document the Cortex A9 timings, and Apple documents even less those of Swift and Cyclone, so we have no reference to base ourselves on these days. — May 16, 2014

The Cortex A8 implements NEON and VFP as a coprocessor “located” downstream from the ARM core: that is, NEON and VFP instructions go through the ARM pipeline undisturbed, before being decoded again by the NEON unit and going through its pipeline. This has a few consequences, the main one being that, while moving a value from an ARM register to a NEON/VFP register is fast, moving a value from a NEON/VFP register to an ARM register is very slow, causing a 20 cycle pipeline stall.

On the Cortex A8, most NEON instructions execute with a single cycle throughput; however, latencies are typically 2 or more, so directly using the result of the previous instruction will cause a stall; try to alternate operation on two parallel data to maximize throughput. Some NEON instructions that operate on Q vectors will execute in two cycles while they take only one when operating on D vectors, as if they were split in two instructions that operate on the two D halves (and in fact this is probably what happens); not much you can do about it, just something to know (not really unusual, remember that up until the Core 2 duo, Intel processors could only execute SSE instruction by breaking them in two since key internal data paths were only 64-bit wide, so all 128-bit instructions took two cycles). However, vzip and vuzp on Q vectors actually take 3 cycles instead of 1 for D vectors, since when operating on Q vectors the operation can not be reduced to two operations of D vectors.

The Cortex A8 has a second NEON pipeline for load/store and permute instructions, so these instructions can be issued in parallel with non-permute instructions (provided there is no dependency issue). Remember the Cortex A8 (and its NEON unit) is an in-order core, so it is not going to go fetch farther instructions in the instruction stream to extract parallelism: only instructions next to each other can be issued in parallel. Notice that duplicating a scalar is considered a permute instruction, so provided you do so a bit before the result is needed, the operation is pretty much free.

One last consideration from a recent ARM blog post is that you shouldn’t use “regular” ARM code to handle the first and last scalar iterations, as there is a penalty when writing to the same area of memory from both NEON and ARM code; even scalar iterations should be done with NEON code (which should be easy with single element loads and stores).

Oh, you mean I have to conclude?

My impression of NEON as a whole so far is that it seems a capable SIMD/multimedia architecture, and stands the comparison with AltiVec and SSE. It doesn’t have some things like sum of absolute differences, and there are probably some missing features I haven’t noticed yet, so it still has to grow a bit to reach parity, but it is already very useful.


  1. In an earlier version I proposed a second way, however I have since removed it due to various concerns; for the reasons, please see A few things iOS developers ought to know about the ARM architecture, section “ARMv7, ARM11, Cortex A8 and A4, oh my!”