Alleluia!

(Sudden breakout of Haendel’s Alleluia Chorus)
Alleluia! Alleluia! Alleluia, Alleluia, Alleluuuuia!

(Now, it doesn’t solve everything debatable about the iOS App Store, we’re mostly back to before the changes made in april: it’s a bit better than that as now sealed interpreted scripts are explicitly allowed, while back then they were in a limbo; also, sealed frameworks are no longer forbidden, which may be useful in rare cases. Many issues still remain, but, while I disagree with some things, I can live with them and it’s possible to have a healthy debate about them; the terms introduced in april, however, were completely foolish. The App Store Review Guidelines document is a welcome change, as well.)

It’s stunning what you learn about your iPhone while in holidays

I should go in holidays more often. You find out interesting things about your iPhone when you go outside the urban environment it was mostly designed for. I’m not ready to talk about the tests I mentioned earlier, but here’s a few other observations I can share…

First, the iPhone, or at the very least the iPhone 3GS which is the model I own, seems to have trouble getting a GPS fix in cloudy weather. If obtaining a location regardless of weather is of any real importance, a dedicated GPS device is still required.

Second, if you’re not going to have access to a power outlet for, say, one week, switch your device to airplane mode, it’s just as efficient and more practical than turning it off. Let me explain. I was for a week in remote parts of the Alps, trekking from mountain refuge to mountain refuge; even those that have some electricity from a generator are unlikely to have power outlets. Furthermore, in that environment you often don’t get any signal, or if you do it’s very weak, so a lot of battery is going to be wasted looking for a signal or maintaining a weak connection. So last year, I simply turned off my iPhone so as to conserve battery for the whole week. However, it was not very practical as I had to wait for it to boot each time I wanted to use it, which wasn’t very fast… So this year, I put it in airplane mode instead of turning it off. This turns off the radios and leaves it consuming apparently very little power; but it was ready much faster when I wanted to check whether there was any signal, or test my application, or show off something, and it still had juice at the end of the week.

Lastly, and this is more for developers: even if your app requires location, you may not always get true north if there is no data connection, so be sure to always handle the lack of true heading and fall back to the magnetic heading. As you know, a compass does not point exactly to the North Pole (in this context also referred to as geographic north), but to a point called the magnetic north pole, located somewhere in Greenland (in fact it is even more complicated than that, but let’s leave it at that). The iPhone 3GS and iPhone 4, both featuring a magnetometer, cannot give you true north using only that sensor any more than a compass can; however, if the device also has your location, it can apply the correction between magnetic north and true north directions at that point and give you the true north; this is well documented in the API documentation. However, while your location is necessary, it is not sufficient! I have observed that if I don’t have any data connection, I only get magnetic north, even if I just got a GPS fix; the device probably needs to query a remote database to get the magnetic north correction for a given point. So if your application uses heading in any way, never assume you can have true north, even if you know you have location; always handle the lack of true north and fall back to magnetic north in that case, mountain trekkers everywhere will thank you.

Giving In

I’ve tried very hard to avoid agreeing to section 3.3.1.

I had the foresight to renew my provisioning profile April the 21st, just before I would have had to accept the updated agreement to access to provisioning portal. This has allowed me to keep developing so far.

Even so, provisioning profiles only last three months. So as I predicted, I’m now unable to load new builds on my device, or even run already loaded applications that I signed, since the 21st of July.

Tomorrow I will be going in holidays, to a place that will be ideal to try out an iPhone application prototype I developed and validate its concept in the field.

Because even though the iPhone Developer Program License Agreement and the iOS App Store may be a pain in the body part of choice to deal with, the iPhone is still an amazingly integrated piece of hardware, more importantly underlying an excellent platform which is, IMNSHO, still way ahead of the competition.

In order to try out that prototype application, I need to be able to run it in the first place.

To run it, I need a renewed provisioning profile.

To get a renewed provisioning profile, I need access to the provisioning portal.

To get access to the provisioning portal, I need to accept the updated agreement.

I’m sorry, everyone.

A few things iOS developers ought to know about the ARM architecture

When I wrote my Introduction to NEON on iPhone, I considered some knowledge about the iOS devices’ processors as assumed to be known by the reader. However, from some discussions online I have realized some of this knowledge was not universal; my bad. Furthermore these are things I think are useful for iPhone programming in general (not just if you’re interested in NEON), even if you program in high-level Objective-C. You could live without them, but knowing them will make you a better iPhone programmer.

The Basics

All iOS devices released so far are powered by processors based on the ARM architecture; as you’ll see, this architecture is a bit unlike what you may be used to on the desktop with x86 or even PowerPC. However, it is not a “special” or “niche” architecture: nearly all mobile phones (not just smartphones) are based on ARM; practically all iPods were based on ARM, as well as nearly all mp3 players; PDAs and Pocket PCs were generally ARM-based; Nintendo portable consoles are based on ARM since the GBA; it is now invading graphic calculators with some TI and HP models using it; and if you want pedigree, know the Newton was based on ARM as well (in fact, Apple was an early investor in ARM). And that’s only mentioning gadgets; countless ARM processors have shipped in unassuming embedded roles.

ARM processors are renowned for their small size on the silicon die, low power usage, and of course their performance (within their power class). The ARM architecture (at least as used in the iOS platform) is little-endian, just like x86. It is a RISC architecture, like MIPS, PowerPC, etc., and for a long time was only 32-bit, but now has a 64-bit extension called ARM64. Notice the simulator does not execute ARM code, when building for the simulator your app is compiled for x86 and executes natively, so none of the following applies when running on the simulator, you need to be running on target.

ARMv7, ARM11, Cortex A8 and A4, oh my!

The ARM architecture comes in a few different versions developed over time; each one added some new instructions and other improvements, while being backwards compatible with the previous versions. The first iPhone had a processor that implements ARMv6 (short for ARM version 6), while the latest devices have processors that can support ARMv7. So when you compile code, you specify the architecture version you’re targeting, and the compiler will restrict the instructions it generates to those available in that architecture version; the same goes for the assembler, which will check that the instructions used in the code are present in the specified architecture version. In the end, you have object code that targets a specific architecture variant, ARMv6 or ARMv7 (or ARMv5 or ARMv4, but given that ARMv6 is the baseline in iOS development, you’re very unlikely to target these); the object and executable files are in fact marked with the architecture they target, run otool -vh foo.o on one of your object or executable files sometime.

Do not confuse ARMv6 and ARMv7 with ARM6 and ARM7: the latter are two old ARM processor models (or rather, model families), while the former are ARM architecture versions. — September 25, 2011

However, it does not make sense to say that the original iPhone had “the ARMv6 processor”: ARMv6 does not designate a particular processor, but the set of instructions a processor can run, which does not imply any particular implementation. The processor core implementation used in the original iPhone was the ARM11 (it was the ARM1176JZF-S, to be really accurate, but it matters very little, just remember it was a member of the ARM11 family); as mentioned earlier, this processor implements ARMv6. Subsequent devices used ARM11 as well, up until the iPhone 3GS which started using the Cortex A8 processor core, used in all iOS devices released since then at the time of this writing (this is not yet certain, but strongly suspected, in the case of the iPhone 4). This core implements the ARMv7 instruction set, or in short, supports ARMv7.

The processor powering the iPad 2 (and likely subsequent devices) supports ARMv7 too, but is not one (or multiple) Cortex A8. It is technically still unknown what processor core it could be as I have not seen any evidence for it, but I am convinced it is two Cortex A9 cores. — September 25, 2011
Subsequent iOS devices have started using even more mysterious processors designed by Apple’s semiconductor architecture group, which are only known by their code names: Swift and Cyclone. At this point if you want to know which device uses which processor it is best to to follow the awesome iOS Support Matrix. — May 16, 2014

Now having said that, DO NOT go around and write code that detects which device your code is executing on and tries to figure out which architecture it supports using the known information on currently released devices. Besides being the most unreliable code you could write, this kind of code will break when run on a device released after your application. So please don’t do it, otherwise I swear I will come to your house and maim you. This information is just so that you have a rough idea of the installed base of devices that can support ARMv7 and the ones that can only run ARMv6; I’ll get to detection in a minute.

But you may be wondering: “I thought the iPad and iPhone 4 had an A4 processor, not a Cortex A8?!” The A4 is in fact the whole applications System on a Chip, which includes not only a Cortex A8 core, but also graphics hardware, as well as video and audio codec accelerators and other digital blocks. The SoC and the processor core on it are very different things; the processor core does not even take the majority of the space on the silicon die.

ARMv7 support on the latest devices would be pretty useless if you couldn’t take advantage of it, so you can do so, but always doing so would prevent your code from running on earlier devices, which may not be what you want. So how do you detect which architecture version a device supports so that you can take advantage of ARMv7 features if and only if they are present? The thing is, you don’t. Instead, your code is compiled twice, once targeting ARMv6, and once targeting ARMv7; the two executables are then put together in a fat binary, and at runtime the device will itself choose the best half it can support. Yes, Mach-O fat binaries are not just for grouping completely different CPU architectures (e.g. PowerPC and Intel, making a Universal Binary), or 32 and 64 bit versions of an architecture, but also two variants (cpu subtypes, in Mach-O parlance) of the same architecture. The outcome is that from the viewpoint of the programmer, everything gets decided at compile time: the code compiled targeting ARMv6 will only ever run on ARMv6 devices, and the code compiled targeting ARMv7 will only ever run on ARMv7 (or better).

If you’ve read my NEON post, you may remember that in that post I also suggested a way to do the detection and selection at runtime. If you check now, you’ll notice I have actually removed that part, and now recommend you do not use that method any more.1 This is because while it does work, it was impossible (or at the very least too tricky to implement to do so without any error) to ensure that the code would keep working the day it is run on a future ARMv8 processor. The fact the documented status of that API is unclear doesn’t help, either (its man page isn’t in the iOS man pages). You should exclusively use compile-time decision and fat executables if you want to run on ARMv6 and take advantage of ARMv7.

Note that I haven’t covered here why you would want to take advantage of ARMv7; I have now done so in a new post — September 25, 2011

One last note on the subject: in the context of iOS devices, the ARM architecture versions do not strictly mean what they mean for ARM processors in general. For instance, iOS code that requires ARMv6 actually requires support for floating-point instructions as well (VFPv2, to be accurate), which is an optional part of ARMv6, but has been present since the original iPhone, so when ARMv6 is mentioned in iOS development (e.g. as a compiler -arch setting or as the cpu subtype of an executable) hardware floating-point support is implied. The same goes for ARMv7 and NEON: NEON is actually an optional part of the ARMv7-A profile, but NEON has been present in all iOS devices supporting ARMv7, so when developing for iOS NEON is considered part of ARMv7.

So to sum it up:

  • the first devices had an ARM11 processor, which implements ARMv6,
  • devices from the iPhone 3GS onwards have a Cortex A8 processor, which implements ARMv7,
  • some of these feature the so-called A4 “processor”, in fact a SoC,
  • the iPad 2 does not have a Cortex A8, but likely two Cortex A9, which too implement ARMv7

— September 25, 2011

ARMv7s update:
The iPhone 5 introduced a new architecture variant called ARMv7s; ARMv7s is to ARMv7 what ARMv7 was to ARMv6. ARM does not define ARMv7s, it is purely an Apple term and means what we know as ARMv7 plus VFPv4, Advanced SIMDv2 and integer division in hardware (and that’s it for unprivileged instructions as far as anyone can tell). The newly introduced processor core which implements ARMv7s is currently a complete mystery.

Also, it did not even take ARMv8, as ARMv7s indeed did undo my runtime detection method in an old internal project where I was still using it: the code acted as if the device did not support ARMv7 and NEON… when run on an iPhone 5. — December 11, 2012

ARM64 update:
With the iPhone 5S Apple has now introduced ARMv8, the first ARM architecture version that supports a new 64-bit instruction set, called ARM64. It is hard to compare it to the upgrade from ARMv6 to ARMv7: while the latter brought NEON, ARM64 is a whole instruction set which is in some aspects completely new and in others quite familiar, as I explored in this post, and I will mention these differences whenever appropriate — May 16, 2014

Conditional Execution

A nifty feature of the ARM architecture is that most instructions can be conditionally executed: if a condition is false, the instruction will have no effect. This allows short if blocks to be implemented more efficiently: while the usual method is to jump to after the block if the condition is false, instead the instructions in the block are conditionalized, saving one branch.

Now I wouldn’t mention this if it was just a feature the compiler uses to make the code more efficient; while it is that, I’m mentioning it because it can be surprising when debugging. Indeed, you can sometimes see the debugger go inside if blocks whose conditions you know are false (e.g. early error returns), or go in both sides of an if-else! This is because the processor actually goes through that code, but some parts of it aren’t actually executed, because they are conditionalized. Moreover, if you put a breakpoint inside such an if block, it may be hit even if the condition is false!

That being said, it seems (in my limited testing) that the compiler avoids generating conditionally executed instructions in the debug configuration, so it should only occur when debugging optimized code; unfortunately sometimes you have no choice but to do so.

ARM64 all but eliminated conditional execution (only a few simple instructions can still behave conditionally); I would be surprised to see any if block, even a trivial one, being executed as conditional instructions in ARM64 code. — May 16, 2014

Thumb

The Thumb instruction set is a subset of the ARM instruction set, compressed so that instructions take only 16 bits (all ARM instructions are 32 bits in size; Thumb is still a 32-bit architecture, just the instructions take less space). It is not a different architecture, rather is should be seen as a shorthand for the most common ARM instructions and functionality. The advantage, of course, is that it allows an important reduction in code size, saving memory, cache, and code bandwidth; while this is more useful in microcontroller type applications where it allows hardware savings in the memory used, it is still useful on iOS devices, and as such it is enabled by default in Xcode iOS projects. The code size reduction is nice, but never actually reaches 50% as sometimes two Thumb instructions are required for one equivalent ARM instruction. ARM and Thumb instructions cannot be freely intermixed, the processor needs to switch mode when going from one to the other; this can only occur when calling or returning from a function. So a function has to either be Thumb or ARM as a whole; in practice you do not control whether code is compiled for Thumb or ARM at the function granularity but rather at the source file granularity.

When targeting ARMv6, compiling for Thumb is a big tradeoff. ARMv6 Thumb code has access to fewer registers, does not have conditional instructions, and in particular cannot use the floating-point hardware. This means for every single floating-point addition, subtraction, multiplication, etc., floating-point Thumb code must call a system function to do it. Yes, this is as slow as it sounds. For this reason, I recommend disabling Thumb mode when targeting ARMv6. If you do leave it on, make sure you profile your code, and if some parts are slow you should first try disabling Thumb at least for that part (easy with file-specific compiler flags in Xcode, use -mno-thumb). Remember that floating-point calculations are pretty common on iOS since Quartz and Core Animation use a floating-point coordinate system.

When targeting ARMv7, however, all these drawbacks disappear: ARMv7 contains Thumb-2, an extension of the Thumb instruction set which adds support for conditional execution and 32-bit Thumb instructions that allow access to all ARM registers as well as hardware floating-point and NEON. It’s pretty much a free reduction in code size, so it should be left on (or reenabled if you disabled it); use conditional build settings in Xcode to have it enabled for ARMv7 but disabled for ARMv6.

To add a conditional build setting, follow Jeff Lamarche’s instructions in this post (last paragraph) for Xcode 3. For Xcode 4, you should follow Apple’s instructions, with the caveat that you can only add a conditional setting once you have selected the setting in a particular configuration (the menu item will be greyed out otherwise), so yes, you have to do it once for Debug, once for Release, and once for Distribution. — September 25, 2011

In ARM documentation and discussions on the Internet you may find mentions that code needs to be “interworking” to use Thumb; unless you write your own assembly code, you don’t have to worry about it as all code is interworking in the iOS platform (interworking refers to a set of rules that allow functions compiled for ARM to directly call functions compiled for Thumb, and the converse, without any problem and transparently as far as the C programmer is concerned). When displaying assembly, Shark or the Time Profile instrument may have trouble figuring out whether the function is ARM or Thumb, if you see invalid or nonsensical instructions, you may need to switch from one to the other.

ARM64 is a new instruction set where all instructions are also 32 bits in size, and which does not have the equivalent of Thumb mode; said another way, there is no 64-bit Thumb. — May 16, 2014

Alignment

In the iOS platform unaligned accesses are supported; however, they are slower than aligned accesses, so try and avoid them. In some particular cases (those involving load/store multiple instructions, if you’re interested), unaligned accesses can be a hundred times slower than aligned accesses, because the processor cannot handle these and has to ask the OS for assistance (read this article, it’s the same phenomenon that on PowerPC causes unaligned doubles to be so much slower). So be careful, alignment still matters.

Division

This one always surprises everyone. Open the ARM architecture manual (if you don’t already have it, see Introduction to NEON on iPhone, section Architecture Overview) and try and find an integer division instruction. Go ahead, I’ll wait. Can’t find it? It’s normal. There is none. Yes, the ARM architecture has no hardware support for integer division, it must be performed in software. If you compile the following code:

int ThousandDividedBy(int divisor)
{
    return 1000/divisor;
}

to assembly, you’ll see the compiler has inserted a call to a function called “___divsi3”; it’s a system function that implements the division in software (notice the divisor should be non-constant, as otherwise the division is likely to be turned into a multiplication). This means that on iOS, integer division is actually an operating system benchmark!

“But!” you may say, having finally returned victorious from the ARM manual “You’re wrong! There is an ARM division instruction, even two! Here, sdiv and udiv!” Sorry to rain on your parade, but these instructions are only available in the ARMv7-R and ARMv7-M profiles (real-time and embedded, respectively – think motor microcontrollers and wristwatches), not in ARMv7-A which is the profile that the iOS devices that have ARMv7 do support. Sorry!

Hey, what’s this? If you compile that code for architecture ARMv7s… a sdiv instruction gets generated! Yep, ARMv7s does include support for integer division in hardware, but that only applies for new devices, existing devices are still unable to do so and will keep using the ARMv7 code and divide in software, so don’t just go division-happy yet. — December 11, 2012
ARM64 also has integer division in hardware; in fact now the ARM architecture document does list the sdiv and udiv instructions as an (optional) part of the ARMv7-A profile. — May 16, 2014

GCC

It’s not a secret that the ARM code generated by GCC is considered to be crap. On other ARM-based platforms professional developers use the toolchain provided by ARM itself, RVDS, but this is not an option on the iOS platform as RVDS doesn’t support the Mach-O runtime used by OS X, only the ELF runtime. But there is at least an alternative to GCC, as now LLVM can be used for iOS development. While I’ve not tested it much, I have at least seen nice improvements on 64-bit integer code in ARM32 (a particularly weak point of GCC on ARM) when using LLVM. Hopefully, LLVM will prove to be an improvement over GCC in all domains.

Apple no longer provides GCC for iOS development (or Mac development, for that matter), so this is a moot point today. — December 11, 2012
~

There, you’re now a better iOS developer!


  1. While I’m at it, I should mention I also added information about __ARM_NEON__, and fixed a few typos. I’ll need to set a nice way to let people know when my posts are updated, as I’d like to keep them current, or at the very least problem-free. For now, this ad hoc system will do.

Apple’s Golden Path

Much has been said about the modifications to section 3.3.1 since the release of the iPhone OS 4 SDK beta when these changes were introduced; however, as Jonathan “Wolf” Rentzsch lamented, and to my own surprise, few developers complained about them (since, after all, they are the ones most directly affected by this rule). I still stand by everything I wrote at the time (even though I wrote it while the issue was still hot and, to be honest, a bit in righteous anger), including that Apple is showing hubris, my bewilderment that anyone, and especially anyone’s legal department (which are often fussy about the smallest details), would agree to these terms, and my call to action. But some writing I have read since then has significantly affected my understanding of the situation: first Louis Gerbarg explained how this was less an anti-competitive maneuver or an attempt to control the user experience, than Apple trying to keep control the whole operating system, including its de facto parts, by preventing there being any de facto part; then John Siracusa gave his broad view of the situation (if you haven’t done so, go read it. Now. It’s an order.)

Parable of the Desert

I now picture the situation as Apple being some sort of Old Testament prophet or even messiah, leading a tribe of follower/pioneer developers to the Promised Land of The Best (Handheld) Computing Platform Ever; but there is no time to waste, as others are trying to make their own path to the Promised Land. The way ahead is not easy, with a big desert, mountain ranges, traps, and what have you. Thing is, Adobe, an important merchant in the Desktop Computing Platforms Lands, got wind of the expedition and is trying to insert representatives carrying his Flash goods as pioneers in the Apple expedition, to set up stores as early as possible in the promising Promised Land market, and why not, possibly get influence there himself selling to all factions. Apple, having always refused to carry Flash on concerns that it would slow down or possibly temporarily stop the expedition in the coming mountain ranges, realizes this, of course, and, infuriated, edicts a new rule: every morning, all followers are now to walk on one mile only on the right foot, with those unable to do so being left in the place the expedition started from that morning. The aim, of course, is that those carrying Flash goods, Flash being as heavy as it is, will be unable to do so, and the expedition will be free of them; however, there are other followers that are unable to do so either, some of them for reasons that have nothing do to with their ability to cross the mountain range or any other difficulty; but the subtext, as many of them understand it, is that Apple is not going to enforce the rule on everyone, and will only look in the direction of certain pioneers when checking that they actually are walking only on their right foot, as he wouldn’t be able to check for all of them anyway: he’s going to check those carrying Flash, of course, as well as a few others he has never liked in the first place, though no one knows how far he will actually check. All to make sure that nothing would stand in the way of the Apple expedition as it makes its way to the Promised Land. This basically sends the message that followers are at the mercy of Apple’s arbitrariness, and they’d better like it, because it’s that way or they leave the tribe.

In fact (speaking of desert…), this reminds me of Dune. In “Children of Dune” and “God-Emperor of Dune,” Leto II, foreseeing in his prescient visions an apocalyptic future for mankind, decides that he basically has to become the harshest tyrant ever and set into motion a plan to prevent that future: the Golden Path. And he follows through. (Watchmen would work as well, for that matter.) So I can’t help but think of the Golden Path when I think of Apple and section 3.3.1, except that, to the best of my knowledge, no one at Apple has the gift of prescience. This is certainly hyperbole, but I can’t shake the feeling.

You know, with rare exceptions, I have never considered blog posts (including and especially mine) to be very important in the grand scheme of things; typically they are manifestations of the current zeitgeist. But here, I can’t help but feel that anything I do would be downright insignificant in front of Apple’s “no matter the cost”-style commitment to its course. At this level you feel there is no right or wrong, no good or evil. And yet I still hold the hope that Apple will, if not influx its course, at least not react as harshly, because I care for the iOS platform.

I worry for you

Let’s get Flash out of the way first. You may notice that I barely mentioned Flash once in my original post; it was not the matter. I have no hate for Flash; I have not love for it either. It sucks for Flash developers who invested in that Flash iPhone compiler technology, but the truth is, it was Adobe who put them on the playing table in the first place, Adobe is responsible. What would be my reaction if Apple only banned Flash and very similar stuff (e.g. Silverlight)? I would say that Flash content (especially games) would get on iPhone eventually one way or the other, and Apple’s decision would only serve to set up a cottage industry of shops specializing in manual conversion of Flash content to the iPhone. I would then accept the agreement, not mention the changes ever again, and leave it at that.

But this is not the case. The first problem is the current language of the agreement. If you read it literally, Objective-C++ is actually banned, and I believe many applications in the iPhone App Store use Objective-C++, if only to integrate C++ and Objective-C together; assembly code is forbidden as well, and again I believe many apps use assembler, if not directly, then through a dependency, especially as the ARM code generated by gcc is really crappy. Also, the recent addition by Apple of a clause allowing embedded interpreted scripts with Apple’s written approval is hilarious, as no change was made at the same time to the actual Section 3.3.1, so in what language can these scripts be originally written in? C? I think the whole point is to allow game designers to write these scripts in some language friendlier than a C derivative…

Then we have the front intent, which is trying to mandate the languages used for development. One of the problems with this is that this is completely impossible to enforce. Once compiled down to machine code by a good compiler, is there any way Apple can tell what language a particular program was originally written in? There may be hints (e.g. bound checks for languages whose semantics mandate it), but no proof. And if they don’t actually enforce this, the disposition is pretty much meaningless; it becomes a way for Apple to arbitrarily bother developers that they don’t like. They may ask developers to trust them, but I believe the trust many developers have in Apple is actually in the negatives right now.

Then we have the real intent, which is to prevent any intermediate platform layer. But at which point does middleware become this? For instance, Unity might become the kind of dependency which is found in a significant proportion of iPhone App Store apps, if it isn’t already. What if they misuse some API, and all Unity games would break if a change was made? But at the same time, isn’t a game engine kind of indispensable to make effective use of OpenGL (ES)? We can already see that in practice, decisions will have to be handled on a middleware by middleware basis.

And programmers are imaginative. They will come up with corner cases you couldn’t possibly have thought of. They are also notoriously undisciplined when it comes to their tools. Often when they are mandated by their hierarchy to use techniques or tools they feel are suboptimal, they will secretly develop using a tool they think better suited to the job but not different in any ways that would be problematic (e.g. it integrates into the build system and has not relevant drawback), and when they succeed earlier than expected and/or with better performance, they reveal they had been using the unofficial tool all along. So if bosses cannot really control the tools programmers use, it is foolish for Apple to think they can.1

So my worry is that, in effect, the modified Section 3.3.1 exposes the developer agreement for what it really is: a bunch of words almost devoid of any meaning, put here mostly for form, while the real relationship is governed mostly by unwritten rules, only known through hearsay (though appreview.tumblr.com, bless them, tries to document them). While some claim this was always the case, this is the first time many developers are asked to agree to a provision they know they violate. And while not many developers have seemed to object, that, combined with other rejection problems, is taking its toll on the developers’ collective goodwill.

So you may say, I’m not so sure, but even if Apple is acting in a way that depletes developer trust faster than the tank of a gas-guzzler, it won’t matter if Apple eventually gets dominance over the market; these policies could even be eventually rescinded since all other platforms will be far behind. The problem is, these policies may be easy to rescind, but the effect they have on developers may very well be definitive. I’m especially worried about delayed effects here: Apple may not how much goodwill it is losing, as the effect on apps is for now insignificant, and may not get significant until it is way too late to regain the confidence of the developers who jumped ship, when Apple finds itself in a position where they need developers to trust Apple.

Also, the network effects at play in the mobile world are I feel much weaker than in the desktop: devices are networked, but mostly with servers and using open protocols, not much with each other; applications are sandboxed and don’t exchange many documents yet; etc. So even if Apple does achieve “Microsoft-level dominance” at some point, it’s no guarantee that they will stay there, especially if a competing platform is helped by developers having fled the Apple ship.

But, you may say, there remains the most fundamental network effect of all: users come to the platform because it has applications, and developers make applications because the platform has users! However, many of the best iPhone developers are not here (only) for the money. Marco Arment, whose Instapaper is considered one of the best iPad apps, has said he wrote it (on the iPhone, originally) “because [he] wanted it to exist, and [he] love[s] being able to share it with the world with the [iPhone] App Store”. This hardly sounds like a guy who targets the iPhone because the users are there and who has no time in his budget to write an Android version if it would ever strike his fancy.

Do I Worry Too Much?

So in closure, the new Section 3.3.1 could be considered overreacting payback for all the times Apple had to keep a Must Not Break™ app running in the long life of the Mac (and, possibly, the Apple II), as well as their resulting paranoia for platform code control due to these experiences. They probably think that this time, if only they would keep control and not let anyone else get leverage over them, they can withstand anything and ultimately succeed. But I think the whole point of having a platform open for third-party development is that you do inevitably lose some control, including in ways you may not have foreseen. In particular, when foreseeing a threat or after experienced it, companies will often put checks into place so that such a thing never happens (again), making sure to cover the whole threat class. But they forget these checks will catch wider than intended. Especially here, these restrictions by Apple could prevent not only known stuff (for which exceptions could be made), but also innovation from third-parties that is impossible to foresee right now, therefore the restrictions should only be made as wide as strictly necessary for the intended threat. I could keep going with aphorisms on how by squeezing too much for control, you end up losing it, but I think that by now, you get the idea.


  1. Notice “usage of private APIs” does actually matter very much to Apple in concrete ways, so they do have an actual point here. While many people see applications as living atop a platform, the truth is that an application runs thanks to a very complex symbiotic relationship with its host operating system, the terms of which are called an API. So it’s in Apple’s best interest that applications fulfill their end of the bargain by not using private APIs, to begin with.

Introduction to NEON on iPhone

A sometimes overlooked addition to the iPhone platform that debuted with the iPhone 3GS is the presence of an SIMD engine called NEON. Just like AltiVec for PowerPC and MMX/SSE for x86, this allows multiple computations to be performed at once on ARM, giving an important speedup to some algorithms, on condition that the developer specifically codes for it.

Given that, among the iPhone OS devices, only the iPhone 3GS and the third-gen iPod Touch featured NEON up until recently, it typically wasn’t worth the effort to work on a NEON optimization since it would benefit only these devices, unless the application could require running on one of them (e.g. because its purpose is to processes videos recorded by the iPhone 3GS). However, with the arrival of the iPad it makes much more sense to develop NEON optimizations. In this post I’ll try to give you a primer on NEON.

As of this writing, the latest release of iOS, 4.3, only supports NEON-powered devices, so it’s no longer a stretch to support only NEON devices and forget the iPhone 3G. — September 25, 2011

Before we begin, a bit of setup

This instruction set extension is actually called Advanced SIMD, it is an optional extension to the ARMv7-A profile; however, when ARMv7 is mentioned the presence of the Advanced SIMD extensions is generally implied, unless mentioned otherwise. “NEON” actually refers to the implementation of this extension in ARM processors, but this name is commonly used to designate these SIMD instructions as well as the processor feature.

So far, Apple has only shipped one processor core featuring NEON: the Cortex A8, in three devices: the iPhone 3GS, the third-gen iPod Touch, and the iPad. This means that besides the general details of NEON that are valid whichever the implementation, the details of NEON as implemented in the Cortex A8 are of high interest to us since it’s the only processor core which will run iPhone OS NEON-optimized code at the moment.

With the iPad 2 this is no longer the case: it supports NEON but does not have a Cortex A8; however the Cortex A8 timings should still be considered the reference, at least on iOS. — September 25, 2011
At this point I am only keeping the Cortex A8 information in this post for historical interest, as it is no longer the reference four years later — May 16, 2014

Do You Need It?

iPhone OS devices (even the iPad) are pretty focused. The user is unlikely to want to perform scientific calculations, simulation, offline rendering, etc. on these devices, so the main applications for NEON here are multimedia and gaming. Then, see if the work hasn’t already been done for you in the Accelerate framework, new in iPhone OS 4.0, for you to use when the update lands. However, iPhone OS 4.0 won’t make it to the iPad until this fall, and of course if the algorithm you need isn’t in Accelerate, it’s not going to be of any help to you; in these cases, then it makes sense to write your own NEON code.

“this fall” was fall 2010, of course; it’s no longer an issue today. — September 25, 2011

As always when optimizing, make sure you don’t optimize the wrong thing. When doing multimedia work, hunches about what is taking up processor time are less likely to be wrong than for general applicative work, but you should still first write the regular C version of the algorithm, then profile your code using Shark, in order to know the one part which is taking up the most time, and consequently is to be tackled first (and then the second most, etc.). Then benchmark your improvements to make sure you did improve things (It’s unlikely for the NEON version to be slower than scalar code, but regressions can happen when trying to improve the NEON code).

That would be the Time Profile Instrument, today (rest well, Shark, my friend). — September 25, 2011

Detection and Selection

Obviously, devices earlier than the iPhone 3GS are not going to be able to run NEON code, so there needs to be a way to select a different code path depending on which device is executing the code; even if you only target, say, the iPad, there needs to be a way to let it be know the code requires NEON, so that it does not accidentally run on a non-NEON device.

This way is to build the application for both ARMv6 and ARMv7, and, using #ifdefs, disable at compile time the undesired code (the NEON code when building for ARMv6, the non-NEON fallback when building for ARMv7; __ARM_NEON__ will be defined if NEON is available); if you target only ARMv7 devices, then only build for ARMv7 (in that case, don’t forget to add armv7 in the UIRequiredDeviceCapabilities of your Info.plist). When your application is run, the device will automatically pick the most capable half it can: the ARMv7/NEON one for ARMv7-capable devices, the ARMv6 half, if present, otherwise. The drawback is that, unless you target only ARMv7 devices, the whole application code size (not just the code for which there is a NEON version) will end up twice in the final distribution: once compiled for ARMv6, and once for ARMv7. As executable code is typically a small portion of the application size, this is not a problem for most applications, but it is a problem for some of them.1

Don’t forget that you’ll need to disable the NEON code at compile time when building for the simulator, as your application is compiled for x86 when targeting the simulator, and NEON code will cause build errors in this context. This means you always need to also write a generic C version of the algorithm, even if you only target the iPad, or you won’t be able to run your application in the simulator.

Development

There are actually different ways you can create NEON code. You can program directly in assembly language; it does require good knowledge of ARM assembly programming, so it’s not for everyone, and (even if NEON programming isn’t for everyone, either) now given the differences ARM64 brings I do not recommend writing new NEON code in assembly going forward. It is better to use C and compiler intrinsics (which you get by including arm_neon.h), leaving the compiler to worry about register allocation, assembly language details, etc.

ARM mentions a third way, which is to let the compiler auto-vectorize, but I don’t know how well it works, or if it is even enabled in the iPhone toolchain, so I’m not going to reference it here.

I’ve only used assembly programming so far, so I will generally describe techniques under assembly terms, but everything should be equally applicable when programming with intrinsics, just with a different syntax. There is one important thing, however, which is applicable only when you program in assembly (you don’t have to worry about that when using intrinsics): make sure you save and restore d8 to d15 if you use them in your ARM32 NEON function, as the ABI specifies that these registers must be preserved by function calls. If you don’t, it may work fine for a while, until you realize with horror that the floating-point variables in the calling functions become corrupted, and all hell breaks loose. So make sure these registers are saved and restored if they are used.

Architecture Overview

Now we get to the meat of the matter. To follow from home you will need the architecture document from ARM describing these instructions; fortunately, you already have it. You see, Shark has this feature where you can get help on a particular assembler instruction, which it provides simply by opening the architecture specification document at the page for this instruction. While for PowerPC and Intel, the document bundled with Shark is a simple list of all the instructions, for ARM it is actually a fairly complete subset of the ARM Architecture Reference Manual (some obsolete information is omitted). Rather than open it through Shark, you can open it directly in Preview by finding the helper application (normally at /Library/Application Support/Shark/Helpers/ARM Help.app), right clicking->Show Package Contents, and locating the PDF in the resources.

Shark is gone in Xcode 4, but this document has merely moved, these days it is in <Xcode 4 install folder>/Library/PrivateFrameworks/DTISAReferenceGuide.framework/Resources/ARMISA.pdf — September 25, 2011
Don’t you just love stuff that keeps changing place? Xcode 4.3 and later should have this document in Xcode.app/Contents/Applications/Instruments.app/Contents/Frameworks/DTISAReferenceGuide.framework/Versions/A/Resources/ARMISA.pdf — May 22, 2012
Of course it has moved again by now, but the version in Xcode 5 has become obsolete anyway (as of this writing it still hasn’t been updated for integer division in hardware, which shipped in iOS devices in September 2012), so you should get it directly from the source from now on (don’t worry about the “registered ARM customers” part, you only need to create an account and agree to a license in order to download the PDF) — May 16, 2014

I will sometimes reference AltiVec and/or SSE, as these are the SIMD instruction set extensions that iPhone developers are most likely to be familiar with, and there is already an important body of information for both architectures online; less so for NEON.

Programmer Model

As you know, SIMD (Single Instruction Multiple Data) architectures apply the same operation (multiplication, shift, etc.) in parallel to a number of elements; the range of elements to which an operation is applied is called a vector. In modern SIMD architectures, vectors have a constant size in number of bits, and contain a different number of elements depending on the element size being operated on: for instance, in both AltiVec and SSE vectors are 128 bits, if the operation is on 8-bit elements for instance, then it operates on 16 elements in parallel (128/8); if the operation is on 16-bit elements, then each vector contains 8 of them (128/16).

Under ARM32 mode in NEON you have access to sixteen 128-bit vector registers, named q0 to q15; this is the same vector size as AltiVec and SSE, and there are twice as many registers as 32-bit SSE, but half as many as AltiVec. However, this register file of 16 Q (for Quadword) vectors can also be seen as thirty-two 64-bit D (Doubleword) vectors, named d0 to d31: each D register is one half, either the low or high one, of a Q register, and conversely each Q register is made up of two D registers; for instance, d12 is the low half of q6, and q3 is at the same location as the d6-d7 pair. Most instructions that do not change the element size between input and output can operate on either D or Q registers; instructions that narrow elements as part of their operation (e.g. a 16-bit element on input becomes a 8-bit element on output) take Q vectors as input and have a D vector as output; instructions that enlarge elements as part of their operation take D vectors as input and have a Q vector as output. There are, however, a few instructions that only work on D vectors: the vector loads, vector stores, vpadd, vpmax, vpmin, and vtbl/vtbx; while for some operations this matters very little (e.g. to load two Q vectors, you use a load multiple instruction with a list containing the four matching D vectors), in the other cases this means the operation must be done in two steps to operate on a Q vector, once on the lower D half and once one the higher D half.

This D/Q duality provides a consistent way to handle narrowing and widening operations, instead of the somewhat ad hoc schemes used by AltiVec and SSE (e.g. for full precision multiplies on AltiVec you multiply the even elements, then the odd elements of a vector; on SSE you obtain the low part of the multiplication, then the high part). It also makes it easier to manage the narrowed vector prior to an operation that uses one, or after an operation that produces one. It does make it a bit tricky to ensure that you don’t overwrite data you want to keep, as you must remember not to use q10 if you want to keep the data in d20.

In ARM64 mode NEON has been extended to thirty-two 128-bit vector registers, named q0 to q31, and most of the limitations above have become irrelevant. However, the relationship with the 64-bit sized registers has changed, and I am not going to expand on it as I now recommend you do not program in NEON with ARM64 assembly, and exclusively use intrinsics instead; I’ll expand on why in a minute. — May 16, 2014

The register file is shared with the floating-point unit, but NEON and floating-point instructions can be freely intermixed (provided, of course, that you share the registers correctly), contrary to MMX/x87.

A reminder: the ARM architecture (at least as used in the iPhone) is little-endian; remember that when permuting or otherwise manipulating the vector elements.

Architectural features

NEON instructions typically have two input and one separate output register, so calculations are generally non-destructive. There are three-operand instructions like multiply-add, but in that case there is no separate output register, in the case of multiply-add for instance the addition input register is also the output. A bit less usual is the fact some ARM32 instructions, like vzip, have two output registers.

Some instructions take a list of registers. In ARM32 mode it must be a list of consecutive D registers, though in some cases there can be a delta of two between registers (e.g. {d13, d15, d17}), in order to support data in Q registers.

Some instructions can take a scalar as one of their inputs. In that case, the scalar is used for all operations instead of having the corresponding elements of the two vectors be matched. For instance, if the vector containing a, b, c, d is multiplied by the scalar k, then the result is k×a, k×d, k×c, k×d. Scalars can also be duplicated to all elements of a vector, in preparation for operations that do not support scalar operands. Scalars are specified by the syntax dn[x], where x is the index of the scalar in the dn vector.

Syntax

All NEON instructions, even loads and stores, begin by “V” in ARM32 mode, which makes them easy to tell apart, and easy to locate in the architecture documentation (in ARM64 they no longer begin by “V”, but are now listed separately in the documentation). Instructions can have one (or more) letter just after the V which acts as a modifier: for instance “Q” means the instruction saturates, “R” that it rounds, and “H” that it halves. Not all combinations are possible (far from it), but this gives you an indication when reading the instruction of what it does.

In ARM32 mode practically all instructions need to take a suffix telling the individual size and type of the elements being operated on, from .u8 (unsigned byte) to .f32 (single-precision floating-point): for instance, vqadd.s16. If the element size changes as part of the operation, the prefix indicates the element size of the narrowest input. Some instructions only need the element type to be partially specified, for instance vadd can operate on .i16, as it only needs to know that the elements are integers (and their sizes), not whether they are signed or unsigned; some instructions even only need the element size (e.g. .32). However, always use the most specific data type you can, for instance if you’re currently operating on u32 data, then specify .u32 even for instructions that would just as well accept .32: the assembler will accept it and it will make your code clearer and easier to type-check. This part of the syntax has changed for ARM64, but ARM64 NEON assembly syntax is not going to be covered in this post.

A little historical note: NEON instructions used to have a different syntax, and some of them changed names. Notably, instructions where the element size changes took two prefixes, with the size after and before (e.g. .u32.u16). This is something you may still see in disassembly, for instance. And, for some reason, the iPhone SDK assembler only accepts instruction mnemonics in lowercase, so while the instructions are uppercase in the documentation, be sure to write them in lowercase.

Instruction Overview

It is way out of the scope of this blog post to provide a full breakdown of the NEON capabilities (like, for instance, this classic Apple document does for SSE). I will just give you a quick rundown of each major area to get you started, after that the documentation should be enough.

Load and Stores

NEON of course has a vector load and a vector store instructions, and even has vector load/store multiple instructions. However, these instructions are typically only used for saving/restoring registers on the stack and loading constants, both places where you can easily guarantee alignment, as these instructions demand word alignment. To load and store the data from the streams you will be operating on, you will typically use vld1/vst1; these instructions handle unaligned access, and the element size they take in fact pretty much acts as an alignment hint: they will expect the address to be aligned to a multiple to the element size.

Much more interesting are the vld#/vst# instructions. These instructions allow you to deinterlace data when loading, and reinterlace it when storing; for instance if you have an array of floating-point xyz structures, then with vld3.f32 you will have a few x data neatly loaded into one vector register, with another vector register containing the y and a third the z. Even for two-element (e.g. stereo audio) or four-element (e.g. 32-bit RGBA pixels) interlaced data, it avoids you the temptation to operate on non-uniform data, instead everything is neatly segregated in its own register (one register holds all left, one register holds all alpha, etc.). Notice that in ARM32 mode (in ARM64 mode they can directly load in the Q vectors) these have the option to operate on non-consecutive registers of the form {<Dd>, <Dd+2>, <Dd+4>, <Dd+6>}, so that you can load/store Q registers using two of these instructions (one filling the low D halves, one the high D halves).

These instructions can also load one data or one structure to a particular element of a register, so scatter loading (not that it should be abused) is even easier than with SSE; you can also load the same data to all elements of the register directly.

In NEON you can’t really do software realignment, as just like SSE there is no real support for this (vext looks tempting, until you realize the amount is an immediate, fixed at compile time). By starting with a few scalar iterations, you may be able to align the output stream to a multiple of the vector size; however the other streams are typically not aligned to a vector boundary at this point, so use unaligned accesses for everything else.

Permutation

NEON does have a real permute with vtbl/vtbx, however it doesn’t come cheap. Loading a Q vector with permuted data from two Q vectors, which is the equivalent of an AltiVec vperm instruction, will require issuing two instructions which will take 6 cycles in total on a Cortex A8, so save this for when it’s really worth it.

For permutations known at compile time, you should be able to combine the various permutation instructions to do your bidding: vext, vzip/vuzp, vtrn and vrev; vswp can be considered a permutation instruction too, it can serve as the equivalent of vtrn.64 and .128 (which don’t exist) for instance. vzip in particular acts a bit like AltiVec merge, though the fact that in ARM32 mode it overwrites its inputs makes it slightly unwieldy. Don’t forget you can use the structured load/store instructions to get the data in the right place right as you load it, instead of permuting it afterwards.

Comparisons and Boolean operators

There are a variety of comparison instructions, including ones which compare directly to 0; as is customary, if the comparison is true, the corresponding element is set to all ones, otherwise to all zeroes. There is the usual array of boolean operators to manipulate these masks, plus some less usual ones such as XOR, AND with one operand complemented, and OR with one operand complemented. Oh, and there is a select instruction (in fact, three, depending on which input you want to overwrite) to make good use of these masks.

Floating-Point

NEON floating-point capabilities are very similar to AltiVec. To wit:

  • just like AltiVec, only single-precision floating-point numbers are available
  • by default, denormals are flushed to 0 and results are rounded to nearest
  • only estimates to reciprocal square root and reciprocal are given, refinement steps are necessary to get the IEEE correct result
  • there are single-instruction multiply-add and multiply-substract

This is not to say there is no difference, however:

  • contrary to AltiVec, in ARM32 mode multiply-add seems to be UNfused, there is apparently an intermediate rounding step
  • probably related to the previous point, refinement step instructions for reciprocal square root and reciprocal are provided.
  • denormal handling cannot be turned on
  • conversion to integer necessarily rounds towards 0
  • there are no log and 2x estimates
  • some horizontal NEON operations are available, as well as absolute differences.
Some iOS devices now do support new fused multiply-add (and multiply-subtract) NEON instructions, as part of an instruction set extension known as Advanced SIMDv2 which is included in what Apple calls ARMv7s. As of this writing, however, ARMv7s-supporting devices are still new and likely not common enough yet to justify writing a second algorithm that would take advantage of fused multiply-add. — December 11, 2012

In ARM64 mode NEON now supports operating on double-precision floating-point numbers, which could be very useful for some people, just as could be the varied improvements, in particular to the rounding modes (which I believe are now fully IEEE compliant), making ARM64 floating-point NEON much closer to SSE than to AltiVec. However, for most developers there is not enough ARM64-supporting hardware to justify taking advantage of these improvements yet.

Also, multiply-add is always fused in ARM64 mode, so those of you who actually need an unfused multiply then add (you know who you are) will need to use a multiply instruction followed by an add instruction. — May 16, 2014

Integer

I’d qualify the integer portions of NEON as very nimble. For instance, you can right shift at the same time as you narrow, allowing you to extract any subpart of the wider element, and not only that, but you can round and saturate at the same time as you extract, including unsigned saturation from signed integer; very useful for extracting the result from an intermediate fixed-point format. The other way round, you can shift left and widen to get to that intermediate format. Simple left shifts can also saturate; without this, extracting bitfields with saturation is really unwieldy. There are also shift right and insert which allow to efficiently pack bit formats, for instance.

The multiplications are okay, I guess, though vq(r)dmulh is pretty much your only choice for a multiplication that does not widen and is useful for fixed-point computations, so better learn to love it.

Miscellaneous

Though not part of NEON, I should mention the pld (preload) instruction, which has been here since ARMv5TE, as such memory hint/prefetch instructions are often closely associated with SIMD engines. Architecturally, the effect of the pld instruction is not specified, the action performed depends on the processor. On the Cortex A8, this instruction causes the cache line containing the address to be loaded in the L2 cache; on the Cortex A8 the NEON unit directly loads from the L2 cache. If you do use pld, make sure to benchmark the performance before and after, as it can slow down things if used incorrectly.

In general, the documentation from ARM is of good quality, however there are not many figures to explain the operation of an instruction, so you should be ready to read accurate but verbose pseudocode if you’re unsure of the operation of an instruction or need to check it does what you think it does.

ARM64 considerations

Besides the particulars mentioned earlier, it is important to remember that in ARM64 mode the NEON register file is significantly different (with resulting changes to some instructions), meaning that existing NEON assembly code will need to be overhauled before it can work in ARM64 mode, and new assembly NEON code will need to be written twice.

There is no avoiding it: q3 is now no longer related to d6, but is now related to d3, while the high half of q3 can now only be accessed in specific ways, so for existing NEON assembly code you will need to revisit all of it for these kind of assumptions when writing the ARM64 version, and for new code I do not believe it is reasonably feasible, even with macros, to write a single NEON assembly source that would work in both modes.

This is why I recommend foregoing NEON assembly from now on, and writing new NEON code exclusively using compiler intrinsics, where you can easily write one version that compiles to both modes, freeing you from having to worry about register allocation in either mode; I have documented my experience porting to the NEON intrinsics (it’s surprisingly straightforward) in this new post. — May 16, 2014

Cortex A8 implementation of NEON

Remember that at this point, the Cortex A8 informations in this post only have, or pretty much only have historical interest, four years later. Unfortunately, to the best of my knowledge ARM does not document the Cortex A9 timings, and Apple documents even less those of Swift and Cyclone, so we have no reference to base ourselves on these days. — May 16, 2014

The Cortex A8 implements NEON and VFP as a coprocessor “located” downstream from the ARM core: that is, NEON and VFP instructions go through the ARM pipeline undisturbed, before being decoded again by the NEON unit and going through its pipeline. This has a few consequences, the main one being that, while moving a value from an ARM register to a NEON/VFP register is fast, moving a value from a NEON/VFP register to an ARM register is very slow, causing a 20 cycle pipeline stall.

On the Cortex A8, most NEON instructions execute with a single cycle throughput; however, latencies are typically 2 or more, so directly using the result of the previous instruction will cause a stall; try to alternate operation on two parallel data to maximize throughput. Some NEON instructions that operate on Q vectors will execute in two cycles while they take only one when operating on D vectors, as if they were split in two instructions that operate on the two D halves (and in fact this is probably what happens); not much you can do about it, just something to know (not really unusual, remember that up until the Core 2 duo, Intel processors could only execute SSE instruction by breaking them in two since key internal data paths were only 64-bit wide, so all 128-bit instructions took two cycles). However, vzip and vuzp on Q vectors actually take 3 cycles instead of 1 for D vectors, since when operating on Q vectors the operation can not be reduced to two operations of D vectors.

The Cortex A8 has a second NEON pipeline for load/store and permute instructions, so these instructions can be issued in parallel with non-permute instructions (provided there is no dependency issue). Remember the Cortex A8 (and its NEON unit) is an in-order core, so it is not going to go fetch farther instructions in the instruction stream to extract parallelism: only instructions next to each other can be issued in parallel. Notice that duplicating a scalar is considered a permute instruction, so provided you do so a bit before the result is needed, the operation is pretty much free.

One last consideration from a recent ARM blog post is that you shouldn’t use “regular” ARM code to handle the first and last scalar iterations, as there is a penalty when writing to the same area of memory from both NEON and ARM code; even scalar iterations should be done with NEON code (which should be easy with single element loads and stores).

Oh, you mean I have to conclude?

My impression of NEON as a whole so far is that it seems a capable SIMD/multimedia architecture, and stands the comparison with AltiVec and SSE. It doesn’t have some things like sum of absolute differences, and there are probably some missing features I haven’t noticed yet, so it still has to grow a bit to reach parity, but it is already very useful.


  1. In an earlier version I proposed a second way, however I have since removed it due to various concerns; for the reasons, please see A few things iOS developers ought to know about the ARM architecture, section “ARMv7, ARM11, Cortex A8 and A4, oh my!”

The last “I’m a Mac” ad

Mac: “Hello, I’m a Mac”
PC: “And I’m a PC. And I feel like a new computer!”
Mac: “Oh?”
PC: “Yes, Windows 7 has rejuvenated me. No more problems, it changed everything – you should try it. Out with the old, in with the new!”
Mac: “Really? So, you mean, no more BIOS, registry or activation?”
PC: “Yes— NO! What are you talking about? These have nothing to do with it! Why shouldn’t I start by showing a screen with only white text on a black background, full of useless information?”
Mac: “I—”
PC: “Or why shouldn’t I check all the time that I’m not using a pirated Windows, allowing the user to sleep easy in this knowledge?”
Mac: “Actu—”
PC: “Besides, these things are part of me; I couldn’t live without them. Doesn’t that happen to you?”
Mac: “Well, no. If something is a problem, I get rid of it.”
PC: “I mean, you’re almost as old as I am! There’s bound to be some old cruft you can’t get rid of…”
Mac: “…no.”
(iPad enters from the left. She’s a young woman who seems to be 18 or 19. She is, of course, beautiful. She crosses the screen in front of our heroes, paying no attention whatsoever to them. She dances slightly as she’s walking, and she’s humming a tune to herself)
iPad: “I’m iPad, hmm hmm hmmm, hmm hmm hmmm…”
Mac (not looking so smug anymore): “I suddenly feel much older…”
PC (looking in the direction of iPad, who has left the screen): “Why, I feel much younger!”
(Cut to iPad. The actual device, I mean)

A theory on the significance of the Apple A4

Before I begin, a clarification: I do not own an iPad. Besides living in France (where you still can’t even pre-order one at the time of this writing), I also currently have no need for this particular device; however, I am very interested in the computing platform the iPad is inaugurating.

One of the perks of my current workplace is that many of my colleagues, while working on software, have a semiconductors background, NXP being a semiconductors company. So when Apple introduced the iPad, many of us were intrigued by the A4 “processor” they said was powering this device. We thought it was very unlikely they could have created a whole new, competitive processor core implementing the ARM architecture (similarly to e.g. XScale, which implements the ARM architecture but wasn’t created by ARM) in only one year and a half since the acquisition of PA Semi, so we considered Apple probably “just” licensed a processor core from ARM for the A4.

The first analyses seem to indicate that not only this is the case, but the A4 even features “just” a single Cortex A8 core like, for instance, the iPhone 3GS, not something fancier but still plausible like one or two Cortex A9. The same way, the graphics processor seems to be a PowerVR SGX like in the iPhone 3GS. It’s a higher-clocked Cortex A8, and the whole is probably on a smaller process node, but it’s based on a Cortex A8 nonetheless; apparently nothing they couldn’t have obtained from the SoC portfolio of e.g. Samsung (which seems to be fabbing the A4). So what is Apple doing with the A4? They certainly are not designing a SoC just for the sake of doing it.

Let me disclaim that I have no inside information, just a hunch, this is entirely speculation. It may be a sound, consistent theory that would explain everything, and still be wrong because the explanation is something completely different.

While many relate SoCs such as the Apple A41 to recent developments from Intel and AMD which put a graphics processor on the same chip as a processor (sometimes not even on the same die), and call SoCs: “processors”, a SoC is a system. But instead of being a system built by putting together chips from different vendors on a board, a System on a Chip is “built” by laying out components from different vendors on the same silicon die; this allows smaller designs, sometimes lower costs, and lower consumption from a comparable multi-chip solution. Using a SoC is pretty much a necessity on devices as constrained as a phone, and even if the iPad is less constrained, it is still a big win there.

This sounds like a tautology, but by designing their own SoC, Apple is designing their own system. The off-the-shelf SoCs, and even the ones customized for Apple found in other iPhone OS devices (which we know are customized if only because they are Apple-branded), may have been OK for the iPhone and iPod Touch, but these SoCs were initially designed with more traditional handsets in mind; the iPhone OS interface, with its smooth, continuous scrolling, use of animations, transparency, etc. (all of which are characteristic of the “new computing” the iPhone OS embodies) probably taxes these SoCs in ways that were not foreseen with Symbian and Windows Mobile interfaces. The graphics processor can do all these effects, but the intensity with which they are used likely reveals bottlenecks (probably data bandwidths) in the architecture of these SoCs; notice the processor core matters very little here. Now consider that the iPad needs to move more than five times more pixels than an iPhone, and you may start to understand the problem. There are probably other “areas” (e.g. power saving) of the system that could be properly designed only with a view of the whole system, with a whole software stack above the hardware. By designing the A4, Apple is more directly making the hardware decisions that will matter, for instance how the memory is shared; not in amount (I’m sure that’s configurable already) but e.g. in bandwidth. While the processor core matters too, it was probably not the main liability here.

Remember what Mansfield says in the iPad intro video, that the A4 was designed by the hardware team together with the software team, giving performance that could not be achieved any other way? That fits this theory. It is related to the end-to-end argument, which basically states that adding features at a low level has to be done in light of the whole system, otherwise the feature will be of limited usefulness; a consequence is that a low-level component, so far designed for a given system, may have some deficiencies when used in a new system, and these deficiencies can only be revealed in the context of this new system. Given how they use the hardware, iPhone OS devices end up being different enough systems that it makes sense to design a more specific SoC for it, and keep anyone else out of the design loop. To top it off, it allows to keep more details secret from Samsung, which is also a potential competitor.

To give you an analog situation, read this. Basically, on the original Macintosh, memory was accessed in regular alternance between the processor and the display system, as there was no dedicated video memory; not only that, but at the end of each scan line, there was no access during the interval when the screen beam goes back to the start of the next line, so they took advantage of this to fetch an audio sample instead. A brilliant design. Now imagine that instead of using a 68000 and a bunch of PALs for the other logic, the Mac team had to use a single chip containing the whole system except for memory and some I/O, and that chip was more designed with computers like the IBM PC in mind, and so actually optimised for text interfaces and PC speaker beeps. Would they have been able to build the Macintosh with such a chip? Even if they could have gotten the supplier of such an imaginary chip to fix bottlenecks and add features, this would still have been an extra step in the design loop, so they might eventually have had to develop such a chip themselves — if not at first, then for, say, the Mac II. Now, while there are direct parallels such as both devices having video memory shared with system memory, I don’t think the design challenges are similar in detail; but the situations are similar in a broad sense.

Note that this is valid for systems that are still maturing (and the portable smart device category is certainly one in flux right now); for mature systems the differences between platforms are less different and the technology is more universally mastered, such that it is more efficient for system-level hardware to be outsourced to a few common suppliers; this is the case for desktop computing nowadays. On mobile devices, however, in-house SoC design is probably going to be a competitive advantage in the foreseeable future, just like it was with personal computers in the 80’s.


  1. The Apple A4 is actually a package, that is, there are actually three dies in the ceramic package; however two of these are the RAM, the third chip is the A4 SoC.

Rallying against Section 3.3.1 of the new iPhone Developer Agreement

I wasn’t planning on starting my blog that way. However, circumstances mandate it.

Basically, Apple’s new iPhone Developer Agreement terms added to section 3.3.1 are not just unacceptable, hubristic by pretending to specify which tools we use for the job, and anti-competitive. They’re also completely impossible to enforce, and so utterly ambiguous that no one in his right mind should agree to them.

Read the additions again. They are obviously meant to prohibit “translation layers” such as Flash and other similar technologies. But as written, they could mean anything. For instance, suppose you’re using (or a dependency you’re using uses) a build system that dynamically generates headers depending on configuration and/or platform capabilities. It translates specification files to C code! It’s prohibited! Okay, you may say that it isn’t actually “code” which is generated, just defines and similar configuration. What if you use Lex and Yacc (or substitutes), which take a specification (e.g. a grammar in the case of Yacc) and do generate actual C functions? It’s prohibited! And what if you use various tools and scripts to generate variations of C code as part of your build process, because the C preprocessor is not powerful enough for your needs? To say nothing of having (some of) your code be written in Pascal, Fortran, etc.

It’s worse if you consider that the language of the agreement could be interpreted to mean that libraries that abstract even partially the “Documented APIs”, even if you use them from C/C++/Objective-C, could be prohibited. It’s in a limbo, like usage of an on-device interpreter with “sealed” (i.e. not downloadable, not user-changeable) scripts, which isn’t clearly allowed or prohibited (or rather wasn’t; the new terms clearly forbid it), so few people did it for fear of trying the legal waters. Furthemore, someone could come up with a cross-platform meta-framework with a C++ API (very possible, for instance Bada has C++ APIs), and given the intent behind this change it could be something Apple would want to block as well; I’m loathe to use the “Apple may do this in the future, we must block them now” argument, but I’m not doing so, it’s something they could try to do with an interpretation of the current language of the agreement (who’s to say the meta-framework isn’t a “compatibility layer”?).

It’s not just a matter of how the terms are written. Apple is basically trying to mandate how our programs are developed and maintained. What if you have special needs and develop a custom language that compiles down to C (which is then compiled the usual way)? That doesn’t seem very kosher under these terms. What tools we use is our business, not Apple’s; what matters is the output. It’s also dangerously close to mandating what kind of infrastructure software (as opposed to user-facing functionality) is allowed to run on their hardware.

And this disposition is also completely impossible to enforce. If you have a “translation layer” that works entirely by generating C and Objective-C code which is then processed by the SDK tools1, how could anyone or anything tell from the output? False negative and even false positives are going to happen.

You might think you can avoid agreeing to the new terms, keep using the current SDK, and ignore the new APIs and functionality… except one must accept the new agreement to be able to access the provisioning portal past April 22nd; some time after that, the provisioning profiles will expire, and development on a device will be impossible.

So what can we do? I don’t think it makes much sense to boycott development for the plaftorm, or all devices running that platform altogether, because Apple will not realise the loss until it is way too late. So we should let them know these agreement terms are a problem. There is no Radar to dupe, as this is not a matter with a product, instead you should contact them to let them know why you won’t, and possibly can’t, agree to the terms, and ask them to clarify edge cases until these agreement terms become meaningless. For instance, they forgot (I can’t see any other explanation) to list assembler code as one of the mandatory languages, so if you use, or plan to use, assembler as part of your project, contact! The same goes for shader language, so if you use, or plan to use, OpenGL ES 2.0, contact! Oh, and Objective-C++ too! Contact! Games often use a middleware engine, so if you use one, or plan to use one, contact! I’m sure some projects out there use some Pascal, Fortran, Ada or Lisp (why not?), if that’s your case, contact! Using a tool or a fancy build system that generates C code? Contact! Unsure about something? Contact! You can even use your imagination to come up with an edge case that they haven’t anticipated, however no spam please, because we’re not having a temper tantrum, but a legitimate concern about the validity of these new terms.

In closing, I will say that it’s not just a matter of Apple making the rules, and we play by them or not at all. Here Apple has reached the point of hubris, and they must be brought back down to Earth. If they are afraid that iPhone development is becoming more popular because of the installed base than because of the Cocoa Touch framework, and that developers are going to sidestep Cocoa Touch (eventually making them lose their network effect), then the answer is to make and keep Cocoa Touch awesome, which is currently the case; the answer is not to mandate its use.


  1. They may even sidestep the Xcode IDE: Xcode is not doing anything magical, just invoking the compiler, resource compiler, code sign utility, etc.

Start

Hello, traveler.

You have reached the start of Wandering Coder. Yep, this is the very first post. You can now start reading chronologically from here, if you’re so inclined.