ARM multicore systems such as the iPad 2 feature a weakly ordered memory model

At the time of this writing, numerous multicore ARM devices are either shipping or set to ship; handsets, of course, but more interestingly this wave of tablets, in particular the iPad 2 (but not only it), seems to be generally based around multicore ARM chips, be it the Tegra 2 from nVidia, or the OMAP 4 from TI, etc. ARM multicore systems did exist before, as the ARM11 was MP-capable, but I’m not aware of it being used in many or any device open for third-party development; this seems to be really exploding now with the Cortex A9.

These devices will also introduce a surprising system behavior to many programmers for the first time, a behavior which if it isn’t understood will cause crashes, or worse.

Let me show what I’m talking about:

BOOL PostItem(FIFO* cont, uint32_t item) /* Bad code, do not use in production */
{ /* Bad code, do not use in production */
#error This is bad code, do not use!
    size_t newWriteIndex = (cont->writeIndex+1)%FIFO_CAPACITY; /* Bad code, do not use in production */
    /* see why at http://wanderingcoder.net/2011/04/01/arm-memory-ordering/ */
    if (newWriteIndex == cont->readIndex) /* Bad code, do not use in production */
        return NO; /* notice that we could still fit one more item,
                    but then readIndex would be equal to writeIndex
                    and it would be impossible to tell from an empty
                    FIFO. */
                    
    cont->buffer[cont->writeIndex] = item; /* Bad code, do not use in production */
    cont->writeIndex = newWriteIndex; /* Bad code, do not use in production */
    
    return YES; /* Bad code, do not use in production */
}

BOOL GetNewItem(FIFO* cont, uint32_t* pItem) /* Bad code, do not use in production */
{
#error This is bad code, do not use!
    if (cont->readIndex == cont->writeIndex) /* Bad code, do not use in production */
        return NO; /* nothing to get. */
        
    *pItem = cont->buffer[cont->readIndex]; /* Bad code, do not use in production */
    /* see why at http://wanderingcoder.net/2011/04/01/arm-memory-ordering/ */
    cont->readIndex = (cont->readIndex+1)%FIFO_CAPACITY; /* Bad code, do not use in production */
    
    return YES; /* Bad code, do not use in production */
}

(This code is taken from the full project, which you can download from Bitbucket in order to reproduce my results.)

This is a lockless FIFO; it looks innocent enough. I tested it in the following setup: a first thread posts consecutive integers slightly more slowly (so that the FIFO is often empty) than a second thread, which gets them and checks that it gets consecutive integers. When this setup was run on the iPad 2, in every run the second thread very quickly (after about 100,000 transfers) got an integer that wasn’t consecutive with the previous one received; instead, it was the expected value minus FIFO_CAPACITY, in other words a leftover value from the buffer.

What happens is that the system allows writes performed by one core (the one which runs the first thread) to be seen out of order from another core. So the second core, running the second thread, first sees that writeIndex was updated, goes on to read the buffer at offset readIndex, and only after that sees the write in the buffer to that location, so it read what was there before that write.

A processor architecture which, like ARM, allows this to happen is referred to as weakly ordered. This behavior might seem scandalous, but remember your code is run on two processing units which, while they share the same memory, are not tightly synchronized, so you cannot expect everything to behave exactly the same way as in the single core case, this is what allows two cores to be faster than one. Many processor architectures permit writes to be reordered (PowerPC for instance), among other things permitting this allows an important reduction in cache synchronization traffic. While it also allows more freedom when designing out of order execution in the processor core, it is not necessary: a system made of in-order processors may reorder writes because of the caches, and it is possible to design a system with out of order processors that does not reorder writes.

Speaking of which, on the other hand x86 guarantees that writes won’t be reordered, that architecture is referred to as strongly ordered. This is not to say it doesn’t do any reordering, for instance reads are allowed to happen ahead of writes that come “before” them; this breaks a few algorithms like Peterson’s algorithm. Since this architecture dominates the desktop, and common mobile systems have only featured a single core so far and thus don’t display memory ordering issues, programmers as a result have gotten used to a strongly ordered world and are generally unaware of these issues. But now that the iPad 2 and other mainstream multicore ARM devices are shipping, exposing for the first time a large number of programmers to a weakly ordered memory model, they can no longer afford to remain ignorant—and going from a strongly ordered memory model to a weakly ordered one breaks far more, and much more common, algorithms, like the double-checked lock and this naive FIFO, than going from single processor to a strongly ordered multiprocessor system ever did.

Note that this can in fact cause regressions on already shipping iOS App Store apps (it is unclear whether existing apps are confined to a single core for compatibility or not) since, while very few iOS apps do really take advantage of more than one core yet, some nevertheless will from time to time since they are threaded for other reasons (e.g. to have tasks run in real-time for games or audio/video playback). However, Apple certainly tested existing iOS App Store apps on the iPad 2 hardware and they would have noticed if it caused many issues, so this probably only affects a limited number of apps and/or it occurs rarely. Still, it is important to raise awareness of this behavior, as an unprecedented number of weakly ordered memory devices are going to be in the wild now, and programmers are expected to make use of these two cores.

What now?

So what if you have a memory ordering issue? Well, first you don’t necessarily know that it is one, just like for threading bugs; the only thing you know is that you have an intermittent issue, you won’t know it is memory ordering related until you find the root cause. And if you thought threading bugs were fun, wait until you investigate a memory ordering issue. Like threading issues, scenarios in which memory ordering issues manifest themselves occur rarely, which makes them just as hard (if not harder) to track down.

To add to the fun, the fact your code runs fine on a multicore x86 system (which practically all Intel Macs, and therefore practically all iOS development machines, are) does not prove at all that it will run correctly on a multicore ARM system, since x86, as we’ve seen, is strongly ordered. So these memory ordering issues will manifest themselves only on device, never on the Simulator. You have to debug on device.

Once you find a plausible culprit code, how do you fix it (since often the only way to show the root cause is where you suspect it is, is to fix the code anyway and see if the symptoms disappear)? I advise against memory barriers; at least with threading bugs, you can reason in terms of a sequence of events (instructions of one thread happening, one thread interrupting another, etc.); with memory ordering bugs there is no longer any such thing as a single sequence, each core has its own; as in Einstein’s relativity, simultaneity in different reference frames is now meaningless. This makes memory ordering issues extremely hard to reason about, and the last thing you want is to leave it incorrectly resolved: it’s neither done nor to be done.

Instead, what I do is lock the code with a mutex, as it should have been in the first place. On top of its traditional role, the mutex ensures that a thread that took it sees the writes made before it was previously released elsewhere, taking care of the problem. Your code won’t be called often enough for the mutex to have any performance impact (unless you’re one of the few to be working on the fundamental primitives of the operating system or of a game engine, in which case you don’t need my advice).

For new iOS code, especially for code meant to run on more than one core at the same time, I suggest using Grand Central Dispatch, and using it in place of any other explicit or implicit thread communication mechanism. Even if you don’t want to tie yourself too much to iOS, coding in this way will make the various tasks and their dynamic relationships clear, making any future port easier. If you’re writing code for another platform, try to use similar task management mechanisms, if they exist they’re very likely to be better than what you could come up with.

But the important thing is to be aware of this behavior, and spread the awareness in the organization. Once you’re aware of it, you’re much better equipped to deal with it. As we say in France, “Un homme averti en vaut deux” (a warned man is worth two).

Here are a few references, for further reading:

This post was initially published with entirely different contents as an April’s fools. In the interest of historical preservation, the original content has been moved here.