Besides fused multiply-add, what is the point of ARMv7s?

This post is part of a series on why an iOS developer would want to add a new architecture target to his iOS app project. The posts in this series so far cover ARMv7, ARMv7s (this one), ARM64 (soon!).

You probably remember the kerfuffle when, at the same time the iPhone 5 was announced (it was not even shipping yet), Apple added ARMv7s as a default architecture in Xcode without warning. But just what is it that ARMv7s brings, and why would you want to take advantage of it?

One thing that ARMv7s definitely brings is support for the VFPv4 and VFPv3 half-precision extensions, which consists of the following: fused floating-point multiply-add, and half-precision floating-point values (only for converting to and from single precision, no other operation supports the half-precision format), as well as the vector versions of these operations. Both of these have potential applications, even if they are not universally useful, and therefore it was indispensable for Apple to define an ARM architecture version so that apps could make use of them in practice if they desired: had Apple not defined ARMv7s, even if the iPhone 5 hardware would have been able to run these instructions, no one could have used them in practice as there would have been no way to safely include them in an iOS app (that is, in a way that does not cause the app to crash when run on earlier devices).

So we have determined that it was necessary for Apple to define ARMv7s, or this new functionality of the iPhone 5 processor would have been added for nothing, got it. But what if you are not taking advantage of these new floating-point instructions? It is important to realize that you are not taking advantage of these new floating-point instructions unless you full well know you do: indeed, the compiler will never generate these instructions, so the only way to benefit from this functionality is if your project includes algorithms that were specifically developed to take advantage of these instructions. And if it is not actually the case, then as far as I can tell using ARMv7s… is simply pointless. That is, there is no tangible benefit.

First, let us remember that adding an ARMv7s slice will almost double the executable binary size compared to shipping a binary with only ARMv7, which may or may not be a significant cost depending on whether other data (art assets, outer resources) already dominates the executable binary size, but remains something to pay attention to. So already the decision to include an ARMv7s slice starts in the red.

Go forth and divide

So let us see what other benefits we can find. The other major improvement of ARMv7s is integer division in hardware. So let us try and see how much it improves things.

int ZPDivisions(void* context)
{
    uint32_t i, accum = 0;
    uint32_t iterations = *((uint32_t*)context);
    
    for (i = 0; i < 4*iterations; i+=1)
    {
        accum += (4*iterations)/(i+1);
    }
    
    return accum;
}

OK, let us measure how much time it takes to execute (iterations equals 1000000, running on an iPhone 5S, averaged over three runs):

ARMv7 ARMv7s
divisions 24.951 ms 25.028 ms

…No difference. That can’t be?! The ARMv7 version includes this call to __udivsi3, which should be slower, let us see in the debugger what happens when this is called:

libcompiler_rt.dylib`__udivsi3:
0x3b767740:  tst    r1, r1
0x3b767744:  beq    0x3b767750                ; __udivsi3 + 16
0x3b767748:  udiv   r0, r0, r1
0x3b76774c:  bx     lr
0x3b767750:  mov    r0, #0
0x3b767754:  bx     lr

D’oh! When run on an ARMv7s device, this runtime function simply uses the hardware instruction to perform the division. Indeed, on other platforms such a function may be provided in a library statically linked to your app, whose code is then frozen. But that is not how Apple rolls. On iOS, even such a seemingly trivial functionality is actually provided by the OS, and the call to __udivsi3 is a dynamic library call which is actually resolved at runtime, and uses the most efficient implementation for the hardware, even if your binary only has ARMv7. In other words, on devices which would run your ARMv7s slice, you already benefit from the hardware integer division even without providing an ARMv7s slice.

Take 2

But wait, surely this dynamic library function call has some overhead compared to directly using the instruction, could we not reveal this by improving the test? We need to go deeper. Let’s find out, by performing four divisions during each loop, which will reduce the looping overhead:

int ZPUnrolledDivisions(void* context)
{
    uint32_t i, accum = 0;
    uint32_t iterations = *((uint32_t*)context);
    
    for (i = 0; i < 4*iterations; i+=4)
    {
        accum += (4*iterations)/(i+1) + (4*iterations)/(i+2) + (4*iterations)/(i+3) + (4*iterations)/(i+4);
    }
    
    return accum;
}

And the results are… drum roll… (iterations equals 1000000, running on an iPhone 5S, averaged over three runs):

ARMv7 ARMv7s
unrolled divisions 24.832 ms 24.930 ms

“Come on now, you’re messing with me here, right?” Nope. There is actually a simple explanation for this: even in hardware, integer division is very expensive. A good rule of thumb for the respective costs of execution for the elementary mathematical operations on integers is this:

operation (on 32-bit integers) + - × ÷
cost in cycles 1 1 3 or 4 20+

This approximation remains valid across many different processors. And the amortized cost of a dynamic library function call is pretty low (only slightly more than a regular function call), so it is dwarfed by the execution time of the division instruction itself.

Take 3

I had one last idea of where we could actually look for to observe a penalty when function calls are involved. We need to go even deeper: having these calls to __udivsi3 forces the compiler to put the input variables into the same hardware registers before each division, so the processor is not going to be able run the divisions in parallel, so let us modify the code so that, in ARMv7s, the divisions could actually run in parallel:

int ZPParallelDivisions(void* context)
{
    uint32_t i, accum1 = 0, accum2 = 0, accum3 = 0, accum4 = 0;
    uint32_t iterations = *((uint32_t*)context);
    
    for (i = 0; i < 4*iterations; i+=4)
    {
        accum1 += (4*iterations)/(i+1);
        accum2 += (4*iterations)/(i+2);
        accum3 += (4*iterations)/(i+3);
        accum4 += (4*iterations)/(i+4);
    }
    
    return accum1 + accum2 + accum3 + accum4;
}

(iterations equals 1000000, running on an iPhone 5S, averaged over three runs):

ARMv7 ARMv7s
parallel divisions 25.353 ms 24.977 ms

…I give up (the difference has no statistical significance).

There might be other benefits to avoiding a function call for each integer division, such as the compiler not needing to consider the values stored in caller-saved registers as being lost across the call, but honestly I do not see these effects as having any measurable impact on real-world code.

If you want to reproduce my results, you can get the source for these tests on Bitbucket.

What else?

We have already looked pretty far in trying to find benefits in directly using the new integer division instruction, what if we set that aside for now and try and see what else ARMv7s brings? Technically, nothing else: ARMv7s brings VFPv4 and VFPv3-HP, their vector counterparts, integer division in hardware, and that’s it for unprivileged instructions as far as anyone can tell.

However, when compiling an ARMv7s slice, Clang will apparently take advantage of this to optimize the code specifically for the Swift core (according these patches, via Stack Overflow). These optimizations are of the tuning variety, so do not expect that much from them, but the main limitation with those is that not that many iOS devices run on Swift, in the grand scheme of things. If you check the awesome iOS Support Matrix (ARMv7s column), you will see for instance that no iPod Touch model runs it, and that the iPad mini skipped it entirely (going directly from ARMv7 to ARM64). So is it worth optimizing specifically for the 4th generation iPad, the iPhone 5, and the iPhone 5C? Maybe not.

What compiling for ARMv7s won’t bring you

And now it’s time for our regular segment, “let us dispel some misconceptions about what a new ARM architecture version really brings”. Adding an ARMv7s slice will not:

  • make your code run more efficiently on ARMv7 devices, since those will still be running the ARMv7 compiled code; this means it could potentially improve your code only on devices where your app already runs faster.
  • improve performance of the Apple frameworks and libraries: those are already optimized for the device they are running on, even if your code is compiled only for ARMv7 (we saw this effect earlier with __udivsi3).
  • There are a few cases where ARMv7s devices run code less efficiently than ARMv7 ones; this will happen on these devices even if you only compile for ARMv7, so adding (or replacing by) an ARMv7s slice will not help or hurt this in any way.
  • If you have third-party dependencies with libraries that provide only an ARMv7 slice (you can check with otool -vf <library name>), the code of this dependency won’t become more efficient if you compile for ARMv7s (if they do provide an ARMv7s slice, compiling for ARMv7s will allow you to use it, maybe making it more efficient).

I need you

Seems clear-cut, right? Not so fast. Sure, we explored some places where we thought direct use of hardware integer division could have improved things, but maybe there are actual improvements in places I did not explore, places with a more complex mix between integer division and other operations. Maybe tuning for Swift does improve things for Cyclone too (which is represented by the ARM64 column devices in iOS Support Matrix), and maybe it could be worth it. Maybe I am wrong and Clang can take advantage of fused multiply-add without you needing to do a thing about it. Maybe I completely missed some other instructions that ARMv7s brings.

And most of all, I have not actually run any real benchmark here, for one good reason: I have little idea of the kind of algorithms iOS apps spend significant CPU time on (outside of the frameworks), so I do not know what kind of benchmark to run in the first place (as for Geekbench, I do not think it really represents tasks commonly done in iOS apps, and in addition I am wary of CPU benchmarks I cannot see the source code of). A good benchmark would avoid us missing the forest for the trees, in case that is what is happening here.

So I need you. I need you to run your app with, and without, an ARMv7s slice on a Swift device (as well as a Cyclone device, if you are so inclined), and report the outcome (such as increased performance or decreased processor usage, the latter is important for battery life). Failing that, I need you to tell me the improvements you remember seeing on Swift devices when you added an ARMv7s slice, or what were the conclusions of the evaluation to add an ARMv7s slice, what you can share of it at least. I need you to tell me if I missed something.

And that is why I am exceptionally going to allow comments on this post. In fact, they should appear immediately without having to go through moderation, in order to facilitate the conversation. But first, be wary of Akismet: if your comment is flagged as spam, try and rework it a bit and post again. Second, comments with nothing to do with the matter at hand will be subject to instant vaporization.

So have at it:

Leave a comment

4 Comments

  1. Note that at the time of this writing, the iOS Support Matrix site has not yet been updated for the iPads introduced fall 2013.

  2. mason moore

     /  February 15, 2014

    May I have your permission to post this on my twitter?

  3. @mason moore: sure, you can tweet a link to this post; you can even retweet http://twitter.com/wanderingcoder/status/434624750141583360 if you prefer.

  4. Alleluia! The iOS support matrix site has been updated with the iPads introduced October 2013.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: