Grand Central Dispatch is a douchebag

I’ve always had trouble with writing code for multithreading; I could get threading well enough for other purposes (e.g. real-time applications), but introducing parallelism for the specific purpose of exploiting execution resources changed the problem space enough that I had little handle on it. Now thanks to my recent completion of a project that was amenable to multithreading, I have… less trouble with this code.

So here is what I can share of my learnings; this is centered on Apple platforms, but also has wider applications.

A word of warning: technically, not much has changed with Swift’s support for concurrency since 2014: the behavior of concurrently executing Swift code is not specified at all, and that is barely starting to change with structured concurrency. That being said, when it comes to providing such sample code the Swift language is better suited, and is more future-proof anyway, allowing us to glimpse possible futures; so this article will use Swift.

But if there is no specification, what are the rules then? It would appear that, if you don’t access either variables declared var, or class instances, from multiple threads concurrently, you will be fine. Note that the prohibition includes concurrent access to struct copies that end up referring to the same class instance, unless the struct is a well-known object (like Array) which has special handling to support this use case.

The commandments for efficient multithreading

Even if I did not create it for the day job, I don’t want to share my project yet either; instead I synthesized an example for the purposes of this post. Let us say I have this processing, here in a completely serial form, that I intend to refactor to run in a multithreaded fashion:

//
//  Created by Pierre Lebeaupin on 13/12/2020.
//  Copyright © 2020 Pierre Lebeaupin. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
//
// 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//

import Foundation;

enum Op: String {
    case additive = "+", multiplicative = "*";

    func neutralValue() -> UInt32 {
        return ((self == .additive) ? 0 : 1);
    }

    func combine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a + b) : (a * b));
    }

    func uncombine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a - b) : (a / b));
    }
}


struct ReadyValue {
    let value: UInt32;
    let op: Op?;
    let description: () -> String;
    // Note that as a closure, it can depend on more than
    // the stored properties, contrary to a computed
    // property.

    init(_ inValue: UInt32) {
        value = inValue;
        op = nil;
        description = { return inValue.description;};
    }

    init(value inValue: UInt32,
         op inOp: Op,
         description indescription: @escaping () -> String) {
        value = inValue;
        op = inOp;
        description = indescription;
    }
}

func resolve(_ startGoal: UInt32,
             _ incb: @escaping (_ result: String) -> Void,
             _ primitives: UInt32...)
{
    var l = [ReadyValue]();
    var composedCount = 0;
    var otherComposedCount = 0;

    for element in primitives {
        l.append(ReadyValue(element));
    }

    {
    exploreAdditionOfRightNode(kind: .additive);
    exploreAdditionOfRightNode(kind: .multiplicative);
    }(); // https://bugs.swift.org/browse/SR-12243

    func exploreAdditionOfRightNode(kind: Op) {
        if (l.count != 0) && (l[l.count - 1].op != kind) {
            guard otherComposedCount == 0 else {
                return;
            }

            otherComposedCount = composedCount;
            composedCount = 0;
        }
        defer {
            if (l.count != 0) && (l[l.count - 1].op != kind) {
                composedCount = otherComposedCount;
                otherComposedCount = 0;
            }
        }

        var decomposedValue = [false: kind.neutralValue(),
                               true: kind.neutralValue()];

        {
        iteratePossibleLeftNodes(startingFrom: 0, { _ in return;});
        }(); // https://bugs.swift.org/browse/SR-12243

        func iteratePossibleLeftNodes(startingFrom: Int,
                                      _ walkOperands: @escaping
                                        (_ action: (_ value: ReadyValue,
                                                    _ reverse: Bool)
                                            -> Void)
                                        -> Void) {
            for candidateOffset in
                startingFrom ..< (l.count - composedCount) {
                let rightChild = l.remove(at: candidateOffset);
                if let _ = rightChild.op {
                    otherComposedCount -= 1;
                }
                defer {
                    if let _ = rightChild.op {
                        otherComposedCount += 1;
                    }
                    l.insert(rightChild, at: candidateOffset);
                }

                for phase in 0...1 {
                    let reverse = (phase == 1);
                    { (_ valueComponent: inout UInt32) in
                        valueComponent = kind.combine(valueComponent,
                                                      rightChild.value);
                    }(&decomposedValue[reverse]!);
                    defer {
                        { (_ valueComponent: inout UInt32) in
                            valueComponent = kind.uncombine(valueComponent,
                                                            rightChild.value);
                        }(&decomposedValue[reverse]!);
                    }

                    let selfNode = {(_ action: (_ value: ReadyValue,
                                                _ reverse: Bool)
                                        -> Void)
                        -> Void in
                        action(rightChild, reverse);
                        walkOperands(action);
                    };

                    iteratePossibleLeftNodes(startingFrom: candidateOffset,
                                             selfNode);

                    // close current composition
                    guard ({
                        var num = 0;
                        selfNode({_,_ in num += 1;});
                        return num;
                    }() > 1) && ( (kind == .additive)
                                    ? decomposedValue[false]!
                                       > decomposedValue[true]!
                                    : ((decomposedValue[false]!
                                       % decomposedValue[true]!) == 0) ) else {
                        continue;
                    }

                    let realizedValue = kind.uncombine(decomposedValue[false]!,
                                                       decomposedValue[true]!);
                    let description = { () -> String in
                        var current = "(";
                        selfNode({(_ value: ReadyValue,
                                   _ freverse: Bool) -> Void in
                            current += " ";
                            current += (freverse
                                            ? (kind == .additive ? "-" : "/")
                                            : kind.rawValue);
                            current += " ";
                            current += value.description();
                        });
                        current += ")";

                        return current;
                    };

                    guard l.count > 0 else {
                        if realizedValue == startGoal {
                            incb(description());
                        }
                        continue;
                    }

                    composedCount += 1;
                    l.append(ReadyValue(value: realizedValue,
                                        op: kind,
                                        description: description));
                    defer {
                        l.remove(at: l.count - 1);
                        composedCount -= 1;
                    }

                    exploreAdditionOfRightNode(kind: .additive);
                    exploreAdditionOfRightNode(kind: .multiplicative);
                }
            }
        }

    }
}

(Try it out, e.g. with resolve(70, {print($0);}, 5, 4, 3, 2, 2). You’ll note it changed a bit from the version in the teasers: since then I discovered nested functions could work just as well as anonymous closures to capture variables, while avoiding any reference cycle issue).

What is it going to take to make it work in a multithreaded fashion?

  • First and foremost, most of the processing has to occur as part of a function that performs a chunk of the job, and then returns, ready to be called to perform the next chunk. Just say no to processing that is a single function call that only returns when the whole task is done: that is unacceptable for many, many reasons as we’ll soon see.

While recommendations for parallel processing have varied, shall we say, this one has remained useful for decades: for instance, that was already useful when you wanted your processing to occur asynchronously in traditional MacOS (where your processing had to occur in what was known as “interrupt time”). And parallel code is, more than anything else, asynchronous code — of which there just happens to be multiple instances.

That being established, how big should that individual chunk of work be?

  • Aim for a single chunk of processing to complete in about a tick (1/60th of a second) on a single instance of the slowest target core: the aim is for the overhead of pausing the work on the chunk (by returning), then resuming, to be amortized over useful processing that is as long as possible while still appearing instant to the user when cancelling.
  • Don’t bother making chunks bigger on faster machines just because you can fit more work on the same time period: with a machine-independent chunk size, what is a good enough amortization on the slowest target machine will also be good enough on other machines.
  • All that implies being able to roughly estimate the workload for a given amount of data, e.g. if work is coming in as packets of data received from the network. In fact, if short enough it might not be worth sending to the thread pool and better processed synchronously, or kept around and aggregated. But the latter may not always be possible (e.g. if your server is user interactive like a telnet server or another situation where you would legitimately disable Nagle’s algorithm).
  • Do not overcommit chunks of work. Some parallel processing APIs seem to tell you “just expose all your parallelism to us, we’ll sort it out”, but what do you do if and when you need to cancel the processing? You could tell the scheduling object to stop launching previously enqueued tasks, assuming that’s even possible, but that requires the tasks to have been scheduled on a specific object rather than the global one, adding to the complexity. And don’t even think about checking at the start of the chunk whether processing has been canceled: all your neatly parallel chunks would need to start with entering a bottleneck (a mutex or some such) just to make that determination… that makes no sense.
  • Do not start “ongoing” tasks either. Again, they are unacceptable when it comes to cancellation, unless you check that at regular intervals, again introducing bottlenecks in your otherwise neatly parallel tasks, which will confuse the heck out of the parallel processing API. More on that later.
  • And in a related fashion, do not pre-partition the work at the start to have only as many partitions as there are local processing resources, expecting them to complete simultaneously: besides the traditional issues with that (what happens if there is a high-priority task taking up a core in the background? One of the partitions will not get a chance to start until another one completes, and you’ll have wait for that late comer, doubling processing time), there is now the issue of heterogeneous multiprocessing on the Apple silicon Macs released so far, where some cores are more sober than others.
  • Instead, arrange for a single object to be able to dole out chunks of work to perform on request, also known as work stealing; no problem if that object has to be protected by a mutex itself or if the doling out has to otherwise be performed serially: it’s presumably not the part that is going to be a performance concern, and we will see how to have that bit of serialization without confusing GCD. Also, no problem if extracting and spawning such a chunk of work takes far less that a tick: this is not the part of the job where you want to amortize overhead.
  • For now, our best bet is to start off by scheduling only as many chunks as there are local processing resources (on the Mac, use [NS]ProcessInfo.processInfo.activeProcessorCount), let us call that N; then whenever a chunk completes, it returns (as discussed), but not prior to having scheduled on a serial queue (which I call the reference queue)… a request for the “doling out” object to provide a chunk of work and schedule it. In fact, the initial scheduling is best performed by scheduling N such requests on the serial queue.

Here is how I apply that to our example:

//
//  Created by Pierre Lebeaupin on 13/12/2020.
//  Copyright © 2020 Pierre Lebeaupin. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
//
// 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//

import Foundation;

enum Op: String {
    case additive = "+", multiplicative = "*";

    func neutralValue() -> UInt32 {
        return ((self == .additive) ? 0 : 1);
    }

    func combine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a + b) : (a * b));
    }

    func uncombine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a - b) : (a / b));
    }
}


struct ReadyValue {
    let value: UInt32;
    let op: Op?;
    let description: () -> String;
    // Note that as a closure, it can depend on more than
    // the stored properties, contrary to a computed
    // property.

    init(_ inValue: UInt32) {
        value = inValue;
        op = nil;
        description = { return inValue.description;};
    }

    init(value inValue: UInt32,
         op inOp: Op,
         description indescription: @escaping () -> String) {
        value = inValue;
        op = inOp;
        description = indescription;
    }
}

class ResolveObject {
    let semaphoreWaitingHost: Thread;
    // Using a thread, as the function is going to spend
    // its time blocked in a semaphore, so better directly
    // use a thread than gum up the libdispatch machinery.

    init(_ startGoal: UInt32,
         _ inqueue: DispatchQueue,
         _ incb: @escaping (_ result: String) -> Void,
         _ completionCb: @escaping () -> Void,
         _ primitives: UInt32...) {
        semaphoreWaitingHost = Thread(block:{
            let /* 's */ avoidOvercommitment =
                DispatchSemaphore(value:
                ProcessInfo.processInfo.activeProcessorCount - 1);
            // We take one from the get go

            let completion = DispatchGroup();
            completion.enter();
            // Compensated when the thread exits

            completion.notify(queue: inqueue,
                              execute: completionCb);

            {
                var referenceL = [ReadyValue]();
                var referenceComposedCount = 0;
                var referenceOtherComposedCount = 0;


                for element in primitives {
                    referenceL.append(ReadyValue(element));
                }

                exploreAdditionOfRightNode(&referenceL,
                                           &referenceComposedCount,
                                           &referenceOtherComposedCount,
                                           isParent: true,
                                           kind: .additive);
                exploreAdditionOfRightNode(&referenceL,
                                           &referenceComposedCount,
                                           &referenceOtherComposedCount,
                                           isParent: true,
                                           kind: .multiplicative);
            }(); // https://bugs.swift.org/browse/SR-12243


            func exploreAdditionOfRightNode(_ l: inout [ReadyValue],
                                            _ composedCount: inout Int,
                                            _ otherComposedCount: inout Int,
                                            isParent: Bool,
                                            kind: Op) {
                if (l.count != 0) && (l[l.count - 1].op != kind) {
                    guard otherComposedCount == 0 else {
                        return;
                    }

                    otherComposedCount = composedCount;
                    composedCount = 0;
                }
                defer {
                    if (l.count != 0) && (l[l.count - 1].op != kind) {
                        composedCount = otherComposedCount;
                        otherComposedCount = 0;
                    }
                }

                var referenceDecomposedValue = [false: kind.neutralValue(),
                                                true: kind.neutralValue()];

                iteratePossibleLeftNodesDispatch(&l,
                                                 &composedCount,
                                                 &otherComposedCount,
                                                 &referenceDecomposedValue,
                                                 isParent: isParent,
                                                 kind: kind,
                                                 startingFrom: 0,
                                                 { _ in return;});
            }

            func iteratePossibleLeftNodesDispatch(_ l: inout [ReadyValue],
                                                  _ composedCount: inout Int,
                                                  _ otherComposedCount: inout Int,
                                                  _ decomposedValue: inout [Bool:UInt32],
                                                  isParent: Bool,
                                                  kind: Op,
                                                  startingFrom: Int,
                                                  _ walkOperands: @escaping
                                                    (_ action:
                                                        (_ value: ReadyValue,
                                                         _ reverse: Bool)
                                                        -> Void)
                                                    -> Void) {
                guard isParent else {
                    return iteratePossibleLeftNodes(&l,
                                                    &composedCount,
                                                    &otherComposedCount,
                                                    &decomposedValue,
                                                    isParent: isParent,
                                                    kind: kind,
                                                    startingFrom: startingFrom,
                                                    walkOperands);
                }

                var imminentlyViableOrBetter = false;
                walkOperands({_,_ in imminentlyViableOrBetter = true;});

                let workloadEstimator = l.count + (imminentlyViableOrBetter ? 1 : 0);
                /* Among other properties, this estimator value is monotonic. */

                // 6 may be too many to fit in a tick (1/60th of a second)
                // 4 already means too few possibilities to explore
                // 3 is right out
                if workloadEstimator == 5 {
                    // reseat this divergence over a copy of the whole state
                    var childL = l;
                    var childComposedCount = composedCount;
                    var childOtherComposedCount = otherComposedCount;
                    var childDecomposedValue = decomposedValue;

                    DispatchQueue.global(qos:.userInitiated).async(group: completion) {
                        iteratePossibleLeftNodes(&childL,
                                                 &childComposedCount,
                                                 &childOtherComposedCount,
                                                 &childDecomposedValue,
                                                 isParent: false,
                                                 kind: kind,
                                                 startingFrom: startingFrom,
                                                 walkOperands);

                        avoidOvercommitment.signal();
                    }

                    avoidOvercommitment.wait();
                } else {
                    return iteratePossibleLeftNodes(&l,
                                                    &composedCount,
                                                    &otherComposedCount,
                                                    &decomposedValue,
                                                    isParent: isParent,
                                                    kind: kind,
                                                    startingFrom: startingFrom,
                                                    walkOperands);
                }
            }

            func iteratePossibleLeftNodes(_ l: inout [ReadyValue],
                                          _ composedCount: inout Int,
                                          _ otherComposedCount: inout Int,
                                          _ decomposedValue: inout [Bool:UInt32],
                                          isParent: Bool,
                                          kind: Op,
                                          startingFrom: Int,
                                          _ walkOperands: @escaping
                                            (_ action:
                                                (_ value: ReadyValue,
                                                 _ reverse: Bool)
                                                -> Void)
                                            -> Void) {
                for candidateOffset in
                    startingFrom ..< (l.count - composedCount) {
                    let rightChild = l.remove(at: candidateOffset);
                    if let _ = rightChild.op {
                        otherComposedCount -= 1;
                    }
                    defer {
                        if let _ = rightChild.op {
                            otherComposedCount += 1;
                        }
                        l.insert(rightChild, at: candidateOffset);
                    }

                    for phase in 0...1 {
                        let reverse = (phase == 1);
                        { (_ valueComponent: inout UInt32) in
                            valueComponent = kind.combine(valueComponent,
                                                          rightChild.value);
                        }(&decomposedValue[reverse]!);
                        defer {
                            { (_ valueComponent: inout UInt32) in
                                valueComponent = kind.uncombine(valueComponent,
                                                                rightChild.value);
                            }(&decomposedValue[reverse]!);
                        }

                        let selfNode = {(_ action:
                                            (_ value: ReadyValue,
                                             _ reverse: Bool)
                                            -> Void)
                                         -> Void in
                            action(rightChild, reverse);
                            walkOperands(action);
                        };

                        iteratePossibleLeftNodesDispatch(&l,
                                                         &composedCount,
                                                         &otherComposedCount,
                                                         &decomposedValue,
                                                         isParent: isParent,
                                                         kind: kind,
                                                         startingFrom: candidateOffset,
                                                         selfNode);

                        // close current composition
                        guard ({
                            var num = 0;
                            selfNode({_,_ in num += 1;});
                            return num;
                        }() > 1) && ( (kind == .additive)
                                        ? decomposedValue[false]!
                                           > decomposedValue[true]!
                                        : ((decomposedValue[false]!
                                            % decomposedValue[true]!) == 0) ) else {
                            continue;
                        }

                        let realizedValue = kind.uncombine(decomposedValue[false]!,
                                                           decomposedValue[true]!);
                        let description = { () -> String in
                            var current = "(";
                            selfNode({(_ value: ReadyValue, _ freverse: Bool) -> Void in
                                current += " ";
                                current += (freverse
                                                ? (kind == .additive ? "-" : "/")
                                                : kind.rawValue);
                                current += " ";
                                current += value.description();
                            });
                            current += ")";

                            return current;
                        };

                        guard l.count > 0 else {
                            if realizedValue == startGoal {
                                let solution = description();
                                inqueue.async { incb(solution); };
                            }
                            continue;
                        }

                        composedCount += 1;
                        l.append(ReadyValue(value: realizedValue,
                                            op: kind,
                                            description: description));
                        defer {
                            l.remove(at: l.count - 1);
                            composedCount -= 1;
                        }

                        exploreAdditionOfRightNode(&l,
                                                   &composedCount,
                                                   &otherComposedCount,
                                                   isParent: isParent,
                                                   kind: .additive);
                        exploreAdditionOfRightNode(&l,
                                                   &composedCount,
                                                   &otherComposedCount,
                                                   isParent: isParent,
                                                   kind: .multiplicative);
                    }
                }
            }

            completion.leave();
        });
    }

    func start() {
        semaphoreWaitingHost.start();
    }
}

Cancellation has been left as an exercise for the reader. But the important part is that the bulk of the processing does happen in these tasks sent to Grand Central Dispatch that do work by processing a chunk then returning.

You will note that I couldn’t manage to cut the main function into bits that processes chunks in the same way, returning at the end of chunk: due to the recursion there is no simple way to do that; but let me assure you that I did follow my own rules in my own project. So instead I put that main function under its own thread and limited it through a semaphore: it is through the semaphore that completed chunks make a “request” to the central function that a new chunk be dispatched.

This provides the following benefits:

  • At any given time, there are always at least N bits of work either actively executing or waiting to execute, making sure processing resources can be exploited.
  • Even if there turns out to be fewer available processing resources (other tasks executing, etc.), the small granularity of work chunks means they will be evenly scheduled on the remaining resources, avoiding any lopsided execution, such as a large portion of the time with a single task left executing, with the other cores remaining idle
  • Cancellation can now be centralized on the “doling out” object and the reference queue: the processing being cancelled simply translates to the “doling out” object refusing to provide more work to schedule, and that can be triggered by scheduling a simple request to the reference queue. As a bonus, the dispatch group object that is normally set up to tell when processing is complete can now instead tell when cancellation is complete.
  • There is now no risk of thread explosion, as libdispatch (the library that implements Grand Central Dispatch, or GCD) is “aware” of the bottleneck: if the tasks instead synchronously scheduled work on the serial queue, they would risk blocking waiting for the kernel, leading libdispatch to spawn more threads to keep processing resources busy.

…Oh, yes, I’d forgotten to tell about the most important commandment:

  • Do not, under any circumstance, synchronously wait from within a scheduled task (any kind of wait: select, dispatch_sync, sem_wait, pthread_mutex_lock, even read(2), even on a local filesystem) unless you know for a fact what you’re waiting on can arbitrarily scale. Not on penalty of you inheriting the lack of scaling: on penalty of thread explosion.

This is sufficiently important to deserve further development.

~ end of chapter 1 ~

The trouble with libdispatch

blocking on the kernel

You probably ought to get up to speed with a proper computer science curriculum if that’s not already the case, but for the purposes of this post I’ll give you the cliff’s notes.

On a reasonably modern multitasking OS (which includes everything less obsolete than MacOS 9) your program is given direct access the processor core to perform its own computations on its own data structures, but any other activity has to go through the kernel. This includes access to shared resources such as memory allocation, filesystem access, interprocess I/O (including process management), peripheral I/O including mouse, keyboard, and screen or touchscreen, other I/O such as the network, etc. In 99% of the cases, the kernel provides the service in a way that resembles a subroutine call: it is given control, and processing resumes at the point you gave it control once its job is done.

In some cases, the kernel call actually behaves like a subroutine call: after switching to the kernel environment such as its own protected memory, it will perform its own processing while still running off the same core before transferring back control to your program (though not before having exited the kernel environment).

But that is the exception.

In the general case, the kernel may need more than just processor time to fulfill your request. The typical example is reading from a file, where the kernel will try to fetch the information from cache, and if it can’t, have to request it from the physical drive by sending it a request. And not take any further action: the physical drive will raise an interrupt once the data is ready. This was originally the case for spinning drives, it is still the case today, even for SSDs, and even for NVMe (Flash accessed through PCI)! So it would be pointless for the kernel to twiddle its thumbs on the processor in the meantime; instead, it will mark the thread as “not ready to execute”, or more simply “blocked”, and enter the scheduler as if the time slice of the thread was up to give another thread the opportunity to execute. If another thread is available to run, this allows the processor core to keep performing useful work; otherwise, well, we tried.

Blocking can also happen when writing to a file: think when write buffers are full and the OS needs your program to slow down writing data; it obviously happens when waiting on (and possibly sending) interprocess I/O, on peripheral I/O, and most famously on network I/O. But it can also happen when taking a lock: your thread may need to wait in the kernel in the contended case.

The important thing to note is that a blocked thread does take up all the structures of a thread (stack memory along with its virtual memory address range, in particular, but also a slot in the scheduler data structures) while simultaneously being unable to contribute work to perform to a processor core. So for heavily I/O-based workloads you can imagine the overhead of blocked threads can become prohibitive, and it would, if ways to improve the situation hadn’t been developed while still keeping the base model. The most famous of which is select (and refinements: poll, epoll, kevent/kqueue, etc.), which allows taking multiple I/O objects and multiplexing any blocking over them on a single thread: as soon as any of these objects (called file descriptors, even if they are for non-file I/O) would be done blocking, the select call returns, allowing the operation to be performed on the object and immediately succeeding. You can even build great asynchronous APIs above that, and thus offload the burden of blocking on network events to a single background thread in your process.

libdispatch as a feedback loop

That being established, we can now model the transformations that get you from raw tasks to useful work being performed.

At the end you have the desired production: useful work being performed. Yes, we’ll start from the end.

Circles on four conveyor belts, meant to evoke the output of four assembly lines

That is produced by the third stage: execution proper. Prior to that stage, you only have a bunch of threads in various states, depending on the execution time they previously got, but also depending on the makeup of the code in that thread. The execution stage is mostly characterized by the performance and amount of the cores.

Thick wavy lines, meant to evoke literal threads

Since we’re focusing on libdispatch, the bunch of threads, in turn, are the production of the second stage: task launching.

a sequence of squares with a large arrow pointing the the rightmost one, meant to evoke a priority queue

This is where the tasks are fired off, and this stage is fed from data structures that together amount to a priority queue of tasks. This stage has a dial/input port for the number of threads, which is its main characteristic.

Finally, the priority queue of tasks is produced by the first stage: dispatch proper. It is fed raw tasks in the way you defined them for your processing and submitted them to dispatch_async().

squares reading "^{}" floating around, meant to evoke C language blocks

Out of these, task launching is pretty dumb. Don’t get me wrong, there are plenty of clever algorithms, lock-free data structures, and other such engineering to minimize overhead and scale well; but for the purposes of our model it consists of picking up the top item from a priority queue whenever needed.

Where libdispatch starts displaying interesting behavior is in the feedback/control of that stage. Indeed, libdispatch also has a feedback stage which looks at core occupation (as a proxy for the amount of useful work being performed), and if it falls short of the goal, then it increases the number of threads in the task launching stage: this input is driven by the feedback stage.

all of the above, plus a feedback stage that goes from the execution stage to the launching stage

The goal being, of course, all cores 100% occupied. Decreasing the number of threads is not perfectly symmetrical, if only because you can’t have more core occupation than there are cores in the system, but the principle is the same. And there are subtleties: the number of threads won’t increase if there is no ready task left in the priority queue, for instance.

But on the whole, this can be analyzed with the tools of control theory. And this is where libdispatch breaks down.

where does libdispatch go wrong?

The fundamental assumption of this feedback system is that increasing the number of threads will increase core occupation, as long as there are tasks ready to be dequeued. And I have to admit it’s rare (but possible) for additional threads to end up decreasing core occupation. But there are many common ways for additional threads to instead have no or marginal impact on processor occupation.

Imagine for instance that the tasks being scheduled all begin by attempting to acquire the same mutex, and perform all their processing while holding it. Obviously, on a dual-core machine you will never get more than 50% core occupation since only a single task will ever get to run at any given time, the others being stuck in the kernel, excluded from the code thanks to the mutex. That, of course, is contrived, but imagine code that frequently accesses a common data structure that is very coarsely locked: much of the processing time of the task could end up having to occur while while under the lock, and since other tasks are excluded from taking the lock at the same time, any new thread will not help much with core occupancy since that bottleneck remains. This is already much more realistic: coarse locking is common when creating the first iteration of thread-safe data structures, and significant reengineering may be required to obtain finer locking.

But wait, there’s worse.

In such a simple scenario (constant fraction of processing time of a task having to occur under a common bottleneck), it can be proven that there is a maximum core occupation that can be reached, it can never be higher (that value, interestingly, is the inverse of that fraction). So it’s not just that additional threads will help less, but still help: with the first threads you will reach that maximum, and then additional threads will not help at all. So if libdispatch is not satisfied with core occupation even at that maximum, it will create new threads in vain since they won’t improve the situation, but it will keep doing so anyway, until an internal maximum is reached. And you get thread explosion.

But wait, there’s worse.

Even in more realistic setups (finer, hierarchical locking, or blocking on external limitations such as disk IOPS or bandwidth rather than just locks, etc.) there is almost always such a maximum for processor parallelism which is considered the ability of your system to scale. So it’s not just that additional threads will bring increasingly diminishing returns on core occupation, it’s that this core occupation will tend towards a limit, and if that limit (which libdispatch is not aware of) is not to the satisfaction of libdispatch, you will get thread explosion by the same logic as above.

But wait, there’s worse.

You could have customers report thread explosion even though you never observed it on your machine, and upon closer examination realize this is because their hardware has more cores than yours: on their hardware, libdispatch desperately tries to fill up these cores while on yours libdispatch is satisfied, because it turns out your system’s ability to scale is sufficient to occupy all cores on your machine, but insufficient to occupy all cores on your customer’s machine. The result? You will have to acquire hardware at least as wide (in terms of number of cores if nothing else) as any of your customers’ in order to reproduce their thread explosion problems. And stay on top of your customers as new hardware gets released.

But wait, there’s worse.

Your customer (or you) could reproduce thread explosions at some point, then fail to reproduce the issue, only because of slightly different conditions: other tasks running in the background (which interestingly enough would decrease the odds of the issue occurring), or different filesystem performance because of different disk occupation, or different network server performance! With libdispatch if you aren’t careful you get exposed to numerous potential sources of non-reproducibility. And the only thing worse than a showstopper bug is a hard-to-reproduce showstopper bug.

Here’s how I view Grand Central Dispatch as a result. It’s akin to a manager that claims not to micromanage you, but indirectly forces you to stay at your desk: if you don’t, GCD will put someone else to work at your desk in your absence, and when you come back, everything will be messed up because these computers don’t support multiple user sessions. So you take it as a hint to instead sleep at your desk when you’re waiting to hear from the outside to resume working, but in that case GCD becomes frantic with worry that people aren’t looking busy enough, and tries to acquire additional desks, floor space, whole buildings, and hire the corresponding workers, in an attempt to reach a threshold in the number of visibly busy workers. With the issues that entails: more difficulty reaching your desk, natural resources and land use, etc.

And you know how we call such a person? We call them douchebags.

~ end of chapter 2 ~

Dealing with libdispatch

Now that we know that Grand Central Dispatch is a douchebag, how do we work with it anyway? I haven’t found any better way to avoid these scenarios than by not blocking in the kernel unless you know for a fact that doing so will arbitrarily scale; and consequently, you need only launch a limited number of tasks: with rare exceptions you will not need libdispatch to schedule an unpredictable number of them to reach full core occupation. Any bottleneck, then, has instead to be expressed as tasks on a serial queue that are never waited on by dispatch_sync, but instead work asynchronously: they schedule a new task, as if calling a callback, once they are done (more on that later).

But how does this translate into what you can and can’t call from a queue? It’s time for a round of “Will It Scale?”:

Allocating memory (big and small)
yes
Nowadays memory allocators have gotten pretty smart, in particular maintaining per-thread pools of memory to allocate from, so as to reduce contention to a minimum. Even if a round trip to the kernel is necessary, such as for big allocations, the kernel too is meant to scale well in this instance.

In fact, memory allocation nowadays has to scale since multithreading libraries rely on it in the first place as part of most operations (e.g. memory allocation occurs as part of Block_copy(), in the case of libdispatch).

Making non-blocking calls on a socket
yes
Note that it would be mostly pointless to make such a call without any indication that the socket is ready for it (e.g. read(2) when no data has been received), so it only makes sense to do this from a function registered to be called when the socket is ready for this operation. By “socket” I mean any file descriptor that is not a regular file or block device; that includes sockets in local domains (AF_UNIX), TTYs, pipes, etc.
Making non-blocking calls to libdispatch
yes
You kind of have to make such calls at some point, e.g. dispatch_async to access data protected by a serial queue.
Making non-blocking async-signal safe calls
yes
These include posting to a semaphore for instance.
Making synchronous calls to access files that could reside on a remote filesystem
no
Regardless of how it is performed, regardless of whether it is for reading or writing, accessing a remote file not only causes blocking in the kernel but will limit the business of your program by what the network and the remote server can provide (or absorb, in case you’re writing to it); and if libdispatch catches your code being less busy than the available cores allow, thread explosion will result. Usage of non-blocking flags is pointless: they have no effect for regular files or block devices, even when located on networked filesystems.
Making synchronous calls to access files known to be local
no
If your code is at the point where you are multithreading it, then even the fastest mass storage interface won’t be able to keep up with your code running on all processor cores, so these calls will prevent you from scaling up to that point, and thread explosion will result.

I know this because I tested it: this project reads files as fast as it can while still processing every single byte, and you can configure the amount of tasks that are allowed to be simultaneously in-flight (note: if you use an M1 Mac as your daily driver, I welcome any patch to add native ARM support). On my 4-core (2 physical ones) machine with integrated flash storage, there are benefits to having more than 4 threads, and more than even 8, thus showing the impressive capabilities of the flash controller, enabled by the NVMe interface. However, performance won’t improve after 12 or so threads, but it doesn’t matter to libdispatch: if I allow 16 in-flight tasks, it will dutifully create 16 threads, if I allow 32, 32 threads will be created for still no performance improvement, and it seems libdispatch will get up to 64 threads or so before backing off, at which point performance has actually degraded from the 32 threads case.

Making synchronous calls to browse a local filesystem
no
You can’t count on that to scale, either. And with libdispatch, “not scaling with symptoms being threads blocking in the kernel” translates to “thread explosion”.
Making blocking calls on a socket
heck no
Synchronous networking is bad already, do I need to explain how worse it is when done from a dispatch queue?
Acquiring a mutex
scaling now becomes your responsibility, on penalty of thread explosion if you fail. Are you sure you want it?
The very principle of mutual exclusion means the chunks of your tasks under the lock are serialized, and this limits how your tasks can scale. That limit may be unattainable in practice (e.g. allow for thousand-fold parallelism), but you won’t know that limit until you reach it. And a mutex works by blocking your thread in the kernel until the light is green, as opposed to a serial queue (as long as you never call dispatch_sync on the latter, of course).

So you can’t rule out your mutex being responsible for thread explosion, short of owning the most parallel machine in the world to perform your tests on. Unless your intent was to ensure your ability to scale being always sufficient by way of a formal proof? In that case, don’t forget to update your proof whenever you make a change, no matter how insignificant. Easy.

Calling purely computational libraries
yes
These include system libraries such as Accelerate.
Calling into other third-party code
no
You have no idea of what it could be doing, so you can’t exclude it calling one of the kernel services that don’t scale at some point. In the end, you’d expose yourself to thread explosion.
Calling into other system libraries
no
Same as for other third-party code: you have no idea of what it could be doing. For instance, typesetting can easily become the biggest consumer of CPU time of an iOS app, so why not try and run these jobs concurrently? Unfortunately, what do you think happens when your string contains a character outside the repertoire of the font you’ve specified? Font substitution kicks in so that character can be rendered, which means loading the font from storage, and since Core Text APIs are synchronous, this loading is going to happen synchronously, too (I have seen this take on the order of 600 ms, though that was a long time ago). Congratulations, you are now exposed to thread explosion issues.

So you might be wondering, wait, what is libdispatch good for, anyway? And that is a very good question. Here are the situations where I would reach for it, from most suitable to least.

  • Where it shines most of all is for code you can factor as a factory of micro-batches (again, about one tick of processing time) that can be executed independently, where the inside of micro-batches is computational but not regular at all (involving data structures, allocations, branches, etc.), such that you’d have a hard time parallelizing using other methods, such as with vector-oriented APIs.
  • Since it can do that, it is also good at more regular cases: big vectors, striped image processing, some cases of offline audio or video processing, etc.
  • Combined with Dispatch I/O (more on that in a bit), it’s also good at file processing, though you won’t be able to reuse libraries that access the file formats synchronously, at least not from dispatched tasks, plus you have to contend with asynchronous code (more on that in a bit).
  • The same way, combined with Dispatch Source (more on that in a bit) it’s also good at the initial request processing of a server or daemon, though again you’ll probably have to write that from scratch.
  • And in theory it’s going to be suitable for filesystem indexing (thumbnail generation for instance), though one difficulty would be the obtention of directory listings solely using Dispatch I/O.

(But in these cases, why not use threads directly? First, to avoid the overhead of spawning a thread per such micro-batch, and second, to share threads with other such activity in the same process while still optimally exploiting processing resources)

~ end of chapter 3 ~

Working asynchronously

The designers of libdispatch did much of the same analysis and did reach the same conclusions as to the suitability of kernel entry points to being called from within dispatched tasks, and that is why they created Dispatch I/O for instance, since without it you would be left with crap when it comes to asynchronous file APIs on Mac OS X (interestingly, classic MacOS did have first-class support for asynchronous filesystem I/O, owing to the fact it originally had to run off of floppies, but that neatly applied to networked filesystems as well).

By the same token, they created Dispatch Source to replace any select call: Dispatch Source allows you to make a non-blocking call to a socket from within your dispatched task with the confidence that the socket is ready for it: the only requirement is that your task be launched from the dispatch source.

As for mutexes, they propose serial queues as an alternative for mutual exclusion, with the intent that the task you submit to that queue itself submits a callback function to run once the transaction is done.

All of these work asynchronously: they schedule a new task, as if calling a callback, once they are done. This way of working has two main drawbacks, let us see them in order.

poor language support

Factoring the continuation of your work as a new function to be provided to an asynchronous API requires you to explicitly move your state off of the stack into some sort of object in order to be able to recover it. This is, it has to be said, a pain in the ass. Even the addition of blocks to the C language has only alleviated this issue a little: in the best case, you get arrow code; in the worst case, code that is error-prone and impossible to follow.

As that last link shows, this is set to improve soonish in Swift, but at the time of this writing this is still experimental (e.g. I haven’t managed to get my example processing to work with it). In the meantime, nothing unsurmountable as long as you don’t have to have too much code that needs to work that way; so my advice for the poor language support issue would be: focus on the small chunks of processing that would most benefit from parallelism; everything else should be run the traditional way: in the thread you already have, sequentially, using locks and synchronous calls, etc.

the fake dilemma

Note that these asynchronous libdispatch APIs take a queue, which is where the continuation task will be scheduled; the intent being that you will pass the queue you are running on, so the continuation will run under the same conditions.

If that queue is a concurrent queue, no problem. You weren’t counting on it to protect the consistency of your state anyway, so it’s not a problem to leave your state as-is until you get called back.

If that queue is a serial queue, however, you have a real problem: as soon as your task has “returned” after calling the asynchronous API, there is nothing that prevents another unrelated task on the serial queue from launching, and finding the state not as it was when you usually start a task, but as you left it at the point you needed to call the async API.

So that leaves you with a rotten choice: either keep the synchronous call instead and risk thread explosion, or factor your code so your state is consistent at the point you hand over to the asynchronous API. But, wait. I don’t call that a choice: I call that a fake dilemma. Presumably, you were using the serial queue for the purpose of state protection, and if that queue can’t ensure its purpose, then the problem is with the queue. I haven’t found any way to “reserve” the queue until the completion has run, that does not seem to be possible with libdispatch. There is no simple solution here, to put it bluntly. If you have few enough serial queues in that situation, then I give you special dispensation to perform blocking calls from it, but in compensation every single such serial queue you create has to be considered equivalent to creating a thread, resource-wise.

(You will note that in my example I directly used a thread instead, but that is because the only inter-task communication mechanism I needed in that simple example was a semaphore; in the general case, better standardize on dispatch queues so you can just use dispatch_async for everything.)

If there are too many such serial queues for this to be reasonable, you can also try reworking your code to have a consistent state at any point you make an asynchronous call. Unfortunately, if that is too complex, then you’re out of luck: it is best to sidestep Grand Central Dispatch altogether.

case study: having a consistent state whenever a serial queue is relinquished

As an aside in this discussion, let us see how it could work in practice. Suppose you have some code that requires serialization, and currently works synchronously. By way of example, I will take this code where a value is lazily loaded from a file (synchronous operation) at the first request, and kept around to satisfy future requests. As a corollary, it is necessary to protect this state management with some serialization, but this happens automatically in the single-threaded, synchronous case:

//
//  ObjectCache.swift
//  AsyncObjectCache
//
//  Created by Pierre Lebeaupin on 10/02/2021.
//

import Foundation

class ObjectCache {
    class Slot {
        var data: Double?;

        init() {
            data = nil;
        }

        func getData(_ basePath: String,
                     _ index: UInt32) -> Double {
            if let solidData = data {
                return solidData; // our work here is done.
            }

            // Do the dirty work.
            let path = basePath.appendingFormat("/%lu.sprite", index);

            path.utf8CString.withUnsafeBufferPointer { ptr in
                let fd = open(ptr.baseAddress!, O_RDONLY);

                guard fd >= 0 else {
                    fatalError();
                }

                lseek(fd, off_t(8), SEEK_SET);

                var accum = ContiguousArray<Int8>();
                accum.reserveCapacity(8);

                accum.withUnsafeMutableBufferPointer { mPtr in
                    guard read(fd, mPtr.baseAddress!, 8) == 8 else {
                        fatalError();
                    }
                };

                close(fd);

                var intermediate = UInt64(0);

                for byte in accum {
                    // decode from big-endian format
                    intermediate = (intermediate << 8) + UInt64(byte);
                }

                // Alleluia!
                data = Double(bitPattern: intermediate);
            };

            return data!;
        }
    }


    let base: String;
    let slots: Array<Slot>;

    init(base inbase: String, count: UInt32) {
        base = inbase;
        var protoslots = Array<Slot>();
        protoslots.reserveCapacity(Int(count));

        for _ in 0 ..< count {
            protoslots.append(Slot());
        }

        slots = protoslots;
    }

    func getData(_ index: UInt32) -> Double {
        return slots[Int(index)].getData(base, index);
    }
}

Except for the use of Unix filesystem APIs, seems straightforward enough. So how can we make that work asynchronously so as to be safely callable from libdispatch tasks? Indeed, even if the synchronous filesystem call happens once per cache slot, this is enough for the first phase of processing to be dominated by filesystem activity, and so enough for thread explosion to happen. The feedback mechanism may not even get around to cleaning up these threads even once the processing settles in a compute-bound phase! To avoid that, the name of the game here is to explicitly mark when you’re in a transition state, and if it is the case, hold off incoming requests:

//
//  AsyncObjectCache.swift
//  AsyncObjectCache
//
//  Created by Pierre Lebeaupin on 10/02/2021.
//

import Foundation

class AsyncObjectCache {
    class Slot {
        struct Functor {
            // silly struct just so I can make a
            // recursive function definition
            var invoke: (_ data: Double) -> Functor?;
        }

        var data: Double?;
        var insertNext: ((_ functor: Functor) -> Void)?;
        let q: DispatchQueue;
        var initialFunctor: Functor?;
        var launched: Bool;

        init() {
            q = DispatchQueue(label: "AsyncObjectCache slot protection");
            launched = false;

            insertNext = { [self] functor in
                initialFunctor = functor;
            }
            // Note that at this point there is a strong reference cycle
            // between the class instance and the closure; I see no way
            // around that.
            // The manual cleanup of the line, through initialFunctor,
            // ought to be enough to solve the situation.
        }

        func asyncGetData(_ basePath: String,
                          _ index: UInt32,
                          _ continuationQueue: DispatchQueue,
                          _ continuation: @escaping (_ data: Double) -> Void) {
            q.async { [self] in
                if let solidData = data {
                    continuationQueue.async {
                        continuation(solidData);
                    };
                    return; // our work here is done.
                }

                // TODO: move allocation of closures off of the
                // critical section to improve scalability?
                var next: Functor?

                insertNext!(Functor(invoke: { incomingData in
                    continuationQueue.async {
                        continuation(incomingData);
                    };

                    return next;
                }));

                insertNext = { functor in
                    next = functor;
                };

                guard !launched else {
                    return;
                }
                launched = true;

                // Do the dirty work.
                let path = basePath.appendingFormat("/%lu.sprite", index);

                path.utf8CString.withUnsafeBufferPointer { ptr in
                    let querier = DispatchIO(type: .random,
                                             path: ptr.baseAddress!,
                                             oflag: O_RDONLY,
                                             mode: 0,
                                             queue: q,
                                             cleanupHandler: {
                                                error in guard error == 0 else {
                                                    fatalError();
                                                }
                                             });

                    var accum = DispatchData.empty;

                    // At this point, launched is true, so newly
                    // launched tasks on our queue that are not the
                    // following one will not enter this clause:
                    // our state is consistent at this point, which
                    // is necessary for us to safely relinquish q.
                    querier!.read(offset: 8, length: 8, queue: q) {
                        (done, rawData, error)
                        in
                        guard error == 0 else {
                            fatalError();
                        }

                        accum.append(rawData!);

                        guard done else {
                            return;
                        }

                        var intermediate = UInt64(0);

                        for byte in accum {
                            // decode from big-endian format
                            intermediate = (intermediate << 8) + UInt64(byte);
                        }

                        // Alleluia!
                        let incomingData = Double(bitPattern: intermediate);
                        var current = initialFunctor;
                        while let solidCurrent = current {
                            current = solidCurrent.invoke(incomingData);
                        }

                        // clean up after ourselves
                        data = incomingData;
                        insertNext = nil;
                        initialFunctor = nil;
                    };
                };
            };
        }
    }


    let base: String;
    let slots: Array<Slot>;

    init(base inbase: String, count: UInt32) {
        base = inbase;
        var protoslots = Array<Slot>();
        protoslots.reserveCapacity(Int(count));

        for _ in 0 ..< count {
            protoslots.append(Slot());
        }

        slots = protoslots;
    }

    func asyncGetData(_ index: UInt32,
                      _ continuationQueue: DispatchQueue,
                      _ continuation: @escaping (_ data: Double) -> Void) {
        slots[Int(index)].asyncGetData(base, index, continuationQueue, continuation);
    }
}

Not only is this more complex, but this is pretty error-prone, between not forgetting to enqueue tasks that are held off or the temporary strong reference cycle that must be broken manually. Not to mention the temptations to cut corners: calling held off tasks in backwards order and/or doing so recursively would make the code easier, but could easily trip you up at some point (e.g. imagine having to fire off a hundred held off tasks).

So I don’t think such an operation is viable except for similarly simple cases.

~ end of chapter 4 ~

Future directions

Can Grand Central Dispatch be fixed? What can be learned of it for the purposes of designing concurrent APIs? Here are my humble contributions.

generic thread pool

What if you could have a thread pool that would also be useful for at least some blocking operations? Indeed, asynchronous code brings unavoidable complexity, with more opportunities for bugs, inefficiencies, etc. In the case of local filesystems, you can imagine a way: the kernel is aware of how busy local storage is; or at the very least, it knows when it starts being saturated: either when dirty data grows for writes, or when requests start having to be queued instead of being able to be issued immediately for reads. That information could then be exposed in some way to a thread pool, which could then use that as a basis for the feedback mechanism for increasing or decreasing the number of threads, just like libdispatch uses processor occupation as a basis for the same purpose.

I don’t think we will see that added to libdispatch any time soon, though. For one, there can be multiple local drives hosting each an independent filesystem, so you would need tasks to declare somehow which filesystems they would be accessing, and how; but the API would still eventually map these tasks to the same thread pool. And internally, you would need the launching stage to pick the next task to execute depending on which is more busy: the processors, or the local drives (and which one), which is dynamic information, as opposed to just picking the head of the priority queue.

Architecturally and API-wise, this would mean a very different beast. So if you need a thread pool for local filesystem access, better roll your own.

I will note that in theory, the existing feedback system of libdispatch could use more advanced control functions which could end up detecting when adding a thread ends up having no effect on core occupation, without having to obtain information about local storage. However, such control functions do exist but take a long time to converge, which would be fine for batch processing, but is completely incompatible with an API meant for use on interactive systems.

(As for creating threads to deal with network access, Java famously explored that, and it turned out not to scale remotely well since you need one thread per connection; you can’t add a feedback mechanism similar to that of libdispatch to limit the number of threads in the pool: how would you know how busy the remote device is or devices are? Forget about it: I am under the impression this is an open research problem)

solutions for poor language support

Swift is going to start providing solutions to this soonish, but I think they don’t go far enough. Let us see that with our example.

Some parts of the refactor will forever remain my responsibility, first among them the need to roughly estimate workload for a potential task so as to keep each of them under a tick.

But some parts ought to be managed by the language, with the help of the API: I shouldn’t need to limit the number of dispatched tasks myself. Either the API should allow me to un-dispatch tasks as long as they haven’t been launched (admittedly, easier said than done), or it should run my tasks in such a way that code that ends up dispatching a new task is not run unless that new task can be reasonably expected to launch soon. Otherwise, you risk accumulating tasks that are not launched, but whose launching cannot be cancelled either, adding to cancellation latency; limiting the number of in-flight tasks (either dispatched or launched) to [NS]ProcessInfo.processInfo.activeProcessorCount is insufficient in the case where your tasks have to content for cores with other system activity.

Another part that the language ought to take care of is state duplication: it is obviously unsafe by default to access the same state from independent tasks. Swift closures, like blocks in C, help in that regard, but not enough: variables are captured by reference, not value. So you end up having to copy them yourself, as can be seen in our example, and even that may not be sufficient (more on that later).

So I propose a construct that provides the basis to solve both issues: await fork. Here is how it would be added to our example processing:

//
//  Created by Pierre Lebeaupin on 13/12/2020.
//  Copyright © 2020 Pierre Lebeaupin. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
//
// 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//

import Foundation;

enum Op: String {
    case additive = "+", multiplicative = "*";

    func neutralValue() -> UInt32 {
        return ((self == .additive) ? 0 : 1);
    }

    func combine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a + b) : (a * b));
    }

    func uncombine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a - b) : (a / b));
    }
}


struct ReadyValue {
    let value: UInt32;
    let op: Op?;
    let description: () -> String;
    // Note that as a closure, it can depend on more than
    // the stored properties, contrary to a computed
    // property.

    init(_ inValue: UInt32) {
        value = inValue;
        op = nil;
        description = { return inValue.description;};
    }

    init(value inValue: UInt32,
         op inOp: Op,
         description indescription: @escaping () -> String) {
        value = inValue;
        op = inOp;
        description = indescription;
    }
}

func resolve(_ startGoal: UInt32,
             _ queue: DispatchQueue,
             _ incb: @escaping (_ result: String) -> Void,
             _ primitives: UInt32...)
{
    var l = [ReadyValue]();
    var composedCount = 0;
    var otherComposedCount = 0;

    var isParent = true;

    for element in primitives {
        l.append(ReadyValue(element));
    }

    {
    await exploreAdditionOfRightNode(kind: .additive);
    await exploreAdditionOfRightNode(kind: .multiplicative);
    }(); // https://bugs.swift.org/browse/SR-12243

    func exploreAdditionOfRightNode(kind: Op) async {
        if (l.count != 0) && (l[l.count - 1].op != kind) {
            guard otherComposedCount == 0 else {
                return;
            }

            otherComposedCount = composedCount;
            composedCount = 0;
        }
        defer {
            if (l.count != 0) && (l[l.count - 1].op != kind) {
                composedCount = otherComposedCount;
                otherComposedCount = 0;
            }
        }

        var decomposedValue = [false: kind.neutralValue(),
                               true: kind.neutralValue()];

        {
        await iteratePossibleLeftNodes(startingFrom: 0, { _ in return;});
        }(); // https://bugs.swift.org/browse/SR-12243

        func iteratePossibleLeftNodes(startingFrom: Int,
                                      _ walkOperands: @escaping
                                        (_ action: (_ value: ReadyValue,
                                                    _ reverse: Bool)
                                            -> Void)
                                        -> Void) async {
            var isChildStart = false;

            if isParent {
                let cost = operationsToExplore(l.count, composedCount, otherComposedCount, startingFrom, walkOperands);

                if cost < maxCost && cost > minCost {
                    guard await fork == .child else {
                        return;
                    }

                    isParent = false;
                    isChildStart = true;
                }
            }
            defer {
                if isChildStart {
                    async exit;
                }
            }

            for candidateOffset in
                startingFrom ..< (l.count - composedCount) {
                let rightChild = l.remove(at: candidateOffset);
                if let _ = rightChild.op {
                    otherComposedCount -= 1;
                }
                defer {
                    if let _ = rightChild.op {
                        otherComposedCount += 1;
                    }
                    l.insert(rightChild, at: candidateOffset);
                }

                for phase in 0...1 {
                    let reverse = (phase == 1);
                    { (_ valueComponent: inout UInt32) in
                        valueComponent = kind.combine(valueComponent,
                                                      rightChild.value);
                    }(&decomposedValue[reverse]!);
                    defer {
                        { (_ valueComponent: inout UInt32) in
                            valueComponent = kind.uncombine(valueComponent,
                                                            rightChild.value);
                        }(&decomposedValue[reverse]!);
                    }

                    let selfNode = {(_ action: (_ value: ReadyValue,
                                                _ reverse: Bool)
                                        -> Void)
                        -> Void in
                        action(rightChild, reverse);
                        walkOperands(action);
                    };

                    await iteratePossibleLeftNodes(startingFrom: candidateOffset,
                                             selfNode);

                    // close current composition
                    guard ({
                        var num = 0;
                        selfNode({_,_ in num += 1;});
                        return num;
                    }() > 1) && ( (kind == .additive)
                                    ? decomposedValue[false]!
                                       > decomposedValue[true]!
                                    : ((decomposedValue[false]!
                                       % decomposedValue[true]!) == 0) ) else {
                        continue;
                    }

                    let realizedValue = kind.uncombine(decomposedValue[false]!,
                                                       decomposedValue[true]!);
                    let description = { () -> String in
                        var current = "(";
                        selfNode({(_ value: ReadyValue,
                                   _ freverse: Bool) -> Void in
                            current += " ";
                            current += (freverse
                                            ? (kind == .additive ? "-" : "/")
                                            : kind.rawValue);
                            current += " ";
                            current += value.description();
                        });
                        current += ")";

                        return current;
                    };

                    guard l.count > 0 else {
                        if realizedValue == startGoal {
                            let solution = description();
                            queue.async { incb(solution); };
                        }
                        continue;
                    }

                    composedCount += 1;
                    l.append(ReadyValue(value: realizedValue,
                                        op: kind,
                                        description: description));
                    defer {
                        l.remove(at: l.count - 1);
                        composedCount -= 1;
                    }

                    await exploreAdditionOfRightNode(kind: .additive);
                    await exploreAdditionOfRightNode(kind: .multiplicative);
                }
            }
        }

    }
}

(While I figure out how to style monospaced block elements without breaking my publishing pipeline, here’s the diff so you can focus on the differences:)

--- AlternateSolver.swift   2021-02-19 23:32:13.000000000 +0100
+++ ImaginarySolver.swift   2021-02-21 22:30:33.000000000 +0100
@@ -56,6 +56,7 @@
 }

 func resolve(_ startGoal: UInt32,
+             _ queue: DispatchQueue,
              _ incb: @escaping (_ result: String) -> Void,
              _ primitives: UInt32...)
 {
@@ -63,16 +64,18 @@
     var composedCount = 0;
     var otherComposedCount = 0;

+    var isParent = true;
+
     for element in primitives {
         l.append(ReadyValue(element));
     }

     {
-    exploreAdditionOfRightNode(kind: .additive);
-    exploreAdditionOfRightNode(kind: .multiplicative);
+    await exploreAdditionOfRightNode(kind: .additive);
+    await exploreAdditionOfRightNode(kind: .multiplicative);
     }(); // https://bugs.swift.org/browse/SR-12243

-    func exploreAdditionOfRightNode(kind: Op) {
+    func exploreAdditionOfRightNode(kind: Op) async {
         if (l.count != 0) && (l[l.count - 1].op != kind) {
             guard otherComposedCount == 0 else {
                 return;
@@ -92,7 +95,7 @@
                                true: kind.neutralValue()];

         {
-        iteratePossibleLeftNodes(startingFrom: 0, { _ in return;});
+        await iteratePossibleLeftNodes(startingFrom: 0, { _ in return;});
         }(); // https://bugs.swift.org/browse/SR-12243

         func iteratePossibleLeftNodes(startingFrom: Int,
@@ -100,7 +103,27 @@
                                         (_ action: (_ value: ReadyValue,
                                                     _ reverse: Bool)
                                             -> Void)
-                                        -> Void) {
+                                        -> Void) async {
+            var isChildStart = false;
+
+            if isParent {
+                let cost = operationsToExplore(l.count, composedCount, otherComposedCount, startingFrom, walkOperands);
+
+                if cost < maxCost && cost > minCost {
+                    guard await fork == .child else {
+                        return;
+                    }
+
+                    isParent = false;
+                    isChildStart = true;
+                }
+            }
+            defer {
+                if isChildStart {
+                    async exit;
+                }
+            }
+
             for candidateOffset in
                 startingFrom ..< (l.count - composedCount) {
                 let rightChild = l.remove(at: candidateOffset);
@@ -135,7 +158,7 @@
                         walkOperands(action);
                     };

-                    iteratePossibleLeftNodes(startingFrom: candidateOffset,
+                    await iteratePossibleLeftNodes(startingFrom: candidateOffset,
                                              selfNode);

                     // close current composition
@@ -171,7 +194,8 @@

                     guard l.count > 0 else {
                         if realizedValue == startGoal {
-                            incb(description());
+                            let solution = description();
+                            queue.async { incb(solution); };
                         }
                         continue;
                     }
@@ -185,8 +209,8 @@
                         composedCount -= 1;
                     }

-                    exploreAdditionOfRightNode(kind: .additive);
-                    exploreAdditionOfRightNode(kind: .multiplicative);
+                    await exploreAdditionOfRightNode(kind: .additive);
+                    await exploreAdditionOfRightNode(kind: .multiplicative);
                 }
             }
         }

The principle of await fork is that, just like the fork syscall, the code after it is run two times: once as a child task, and once as a continuation of the parent task; you know which is which from the return value of the await fork statement. Note how, as a result, the code here is close to the serial version.

await fork can form the basis of the task limitation feature: the system could assign a higher priority to child tasks so that parent tasks are never launched as long as one of its own child tasks is available to launch instead. It is not unheard of for the task launcher to be aware of task groups: for instance in most operating systems the scheduler groups together threads from the same user session, so as to avoid one user taking cycles off of the other users just because that user has many processes running. The outcome would be that the parent would not get to run until all of its extent child tasks are already launched and a new opportunity for launching occurs, at which point the parent will perform its own processing and end up calling await fork again; at that point, the API could run either the continuation of the parent task or the new child task, and will choose the latter, by application of the priority rule.

So the principle of await fork as being simultaneously a suspension point and submission of a subtask works well in that case, but it is also resilient for other situations: in the possible case where the parent task cannot provide subtasks fast enough to satisfy core occupation (e.g. in case there are a godawful number of cores), then the submitted subtask will be run on an available core, and the core from which await fork was called will simply resume the parent task.

And await fork obviously takes care of state duplication. In our example processing, we were lucky: the description closures referenced by ChildL only ever access let variables which they are never going to mutate or which are never going to be mutated from under them. But that may not necessarily be the case: I believe it is possible to remove the @escaping qualifier from most closure parameters here, but that can only happen if I can prove to the compiler that they never escape, and that would require replacing l by closures that link to one another in a linked list. So now the closures that are passed around couldn’t just be shallow copies, otherwise the parent task would be going to mess with the child task whenever an element is inserted back into the linked list…

So now you need (or rather you need the language) to make a structured copy, JavaScript-like, of the data that the child task will access. But wait, you can’t just make a structured copy of each closure or structure independently, as some of those could be meant to access common data through a variable capture, and with independent structured copies they would access independent instances. So you need to make a structured copy of the whole data as a single item.

At this point, wouldn’t it be easier to make a structured copy of the whole coroutine stack in the first place? Hence the concept of await fork.

Note that the pervasiveness of copy-on-write behaviors in Swift means some data (data buffers, kernel handles such as file descriptors, etc.) does not need to be copied at that point, and may not even need to be at any point, so async fork should reasonably be “pay-as-you-go”. Still, you may object that there is usually little point for the subtask to be able to keep accessing their own snapshotted copy of lower stack frames; but I think that this is almost always useful, on the contrary. For instance, if you’re implementing a multithreaded indexer, at one point or another you’re going to have filesystem access issues, this is basically unavoidable. And a subtask would raise an exception in that case, but if it only has errno to work with, diagnostic information is going to be limited. However, if it does have access to the snapshot of earlier frames corresponding to the directory hierarchy leading to the file being processed, it would be able to query these directories and provide good diagnostic information. Note that it would be unsafe for this querying to occur as a preflight, prior to the subtask launching: this would raise time-of-check time-of-use issues. Better to try, then handle failure, and you need everything you can get your hands on when handling failure.

Of course, the point is not to implement await fork using fork(2): the two tasks are meant to be running in the same address space. And here we get to what I admit is the most contentious part of the proposal: how to internally remap references so they point to the child task copies of the data?

The first way would be to copy the coroutine stack wholesale, as a memory block, and have all references, including in data structures and references to captured variables, be relative to a stack base that is passed to the task whenever it is launched (or relaunched, in case of an asynchronous continuation): these relative references would not need to be updated. That does raise some thorny ABI issues as to how to recover that base when going through code that is not aware of this. That also means suboptimal codegen, somewhat compensated by the fact most modern processors implement relative addressing, but suboptimal nevertheless. So this is probably not the right way; I only mention this since it’s by far the simplest and least error-prone way to simulate await fork without language support.

The second way would be to have all allocations on the coroutine stack be made from a specific memory pool, and when await fork occurs, copy that whole pool, but then perform the in-place equivalent of structured copy on the second pool: adjusting references and reference counts on the copied pool until all allocations have been visited. That way the copy is still relatively fast and codegen does not need to change, and a child task terminating with async exit (the counterpart of the await fork) ought not to fragment memory too much: its coroutine stack pool ought to be empty as a result of that exit. Even if it is only almost empty, in case of a result that was exfiltrated and retained as a result for instance, it will likely soon be empty.

The third way would be to just have a structured copy starting from all the references the code has access to at the point of the await fork, without any pool.

Finally, note that classes, since they would not belong to the coroutine stack by any measure, would not be copied (or allocated in the coroutine stack pool, if applicable): the two stacks would have two references to the same instance. That is consistent with class being reference types. However closures would have to be copied at least when the references in them would need to be adjusted themselves, since these references could be to structure variables or other value type variables on the coroutine stack.

solutions for the fake dilemma

That one seems rather obvious: the serial queue or equivalent “asynchronous lock” should keep being held across any asynchronous API; that way, you would no longer have to choose between risking thread explosion and reworking your code in a complex and error-prone fashion. And that would be a start; but what happens if your task has .userInteractive quality of service class, but its filesystem request has to wait among a flurry of background filesystem activity? You task could end up being delayed by other system activity of lower priority: if the priority only applies to processor access, it won’t get you very far in the end. This is important including for concurrent queues, but doubly so for serial queues given their status as a potential bottleneck.

So there is a need for the priority/QoS/etc. of the queue to be applied across all asynchronous APIs it depends on, instead of the async API “just” maintaining the queue in the state where it was prior to the call. Many APIs have some support for that already, and the matter would be integration with the existing priority levels of these APIs: APFS for instance has support for I/O prioritization. I believe that the future structured concurrency proposals in Swift will support such priority inheritance. That includes the case where you submit a task to a serial queue along with a completion closure, of course: that whole serial queue needs to inherit at least the priority of the submitting queue until that task is complete.

~ end of chapter 5 ~

Other takes

I read with interest Thomas Clement’s take on Grand Central Dispatch. Funnily enough, I had completed preparatory work (task workload estimation, switching internal state to Swift structs so as to be able to duplicate state for a pseudo-fork, etc.) and was right in the middle of integrating with libdispatch when Michael J. Tsai linked to it, so I had already made my design choices by the time I read it; however, like him I had benefitted from similar collective hindsight on libdispatch’s… flaws.

Still, I would quibble with a few things. Not so much with his recommendations, which amount to the specific case of my “special dispensation”.

First, I would quibble with what he finds hard: what he describes is not multithreading being hard, it’s asynchronous code being hard! Indeed, asynchronous code is hard even when it all happens on the main thread. Of my whole career, the achievement I’m most proud of is still scrubbing in CineXPlayer. And the media player I reused having to run on its own thread (more on that in a bit) wasn’t a problem: for callbacks run from that thread, I would just call [performSelectorOnMainThread: withObject: waitUntilDone:NO] and be done with it; in some cases such as progress indication I would coalesce the updates at the callback level so as to avoid processing redundant progress updates from the main thread, but that was basically it for code that had to be thread-safe. The remainder could be as thread-unsafe as it wanted… but it was hard nevertheless, because events, whether coming in from user input or these player callbacks (and a few coming from the system, such as network status) could come in and be interleaved in any order, and I had to handle them anyway. I ended with setting up a state machine that would model the desired state of the player, and call the next player API to get closer to there, unless the player was already handling a prior request, in which case I (asynchronously) waited for it to no longer be busy. So my message would be: don’t worry about code that you spin off to a separate thread/task/etc; do worry about code that has to work asynchronously, regardless of where it runs.

Second, I would quibble with his dismissal of actors. Just like concurrent queues (you have seen earlier how I used them to great extent, the condition being that their contents be free from kernel calls that don’t scale for sure), actors have their role: when your processing has to be built around its own specific thread. When does that happen? Mostly when your processing has real-time constraints. One example would be a media player, but a more useful one would be an animation engine like CoreAnimation. The thing about an animation engine is that, when you call an API on it, it could either be:

  • Busy with something right now, so whatever you want it to do needs to be queued until such time the animation engine is ready to handle that,
  • In the middle of something but not currently busy, in which case it needs to know about your API call right now in case that call would impact its future plans (e.g. for animation retargeting)
  • Not doing anything at all, in which case it needs to be woken up.

You could attempt to implement all that yourself. And when you’d be done, you would realize you had just reimplemented an actor. I should know, because multiple times in my career I have said “OK, this thing that needs its own long-lived thread, I now happen have the opportunity to do it from scratch as we can’t use an existing implementation, and I know enough to do it properly from the mistake of the implementations I saw previously,” and once I was done… I realized I had reimplemented an actor.

So the important thing here is not to see actors as the new framework for concurrency, but just for what they are: a replacement for threads when they are used for real-time purposes. That is the general lesson here: concurrent queues should similarly be seen as the way to parallelize tasks that we can already guarantee to be compute-bound, etc. The pitfall here is deeming anything to be “the general framework for concurrency”, and that goes doubly for its vendor. As long as we avoid that pitfall, and do not let any particular abstraction become our universal savior, it will have no power over us and will not be able to hurt us.

Third Teaser: Practical Multithreading

Guys, my future post on practical multithreading is going to be of epic scale (previous teaser). I wish I could split it up into multiple posts, but that would be too risky: I don’t want readers stumbling on a later part without having first read the earlier caveats.

So here is another teaser in the meantime. Suppose you have this code that works in a synchronous fashion:

//
//  ObjectCache.swift
//  AsyncObjectCache
//
//  Created by Pierre Lebeaupin on 10/02/2021.
//

import Foundation

class ObjectCache {
    class Slot {
        var data: Double?;
        
        init() {
            data = nil;
        }
        
        func getData(_ basePath: String,
                     _ index: UInt32) -> Double {
            if let solidData = data {
                return solidData; // our work here is done.
            }
            
            // Do the dirty work.
            let path = basePath.appendingFormat("/%lu.sprite", index);
            
            path.utf8CString.withUnsafeBufferPointer { ptr in
                let fd = open(ptr.baseAddress!, O_RDONLY);
                
                guard fd >= 0 else {
                    fatalError();
                }
                
                lseek(fd, off_t(8), SEEK_SET);
                
                var accum = ContiguousArray();
                accum.reserveCapacity(8);
                
                accum.withUnsafeMutableBufferPointer { mPtr in
                    guard read(fd, mPtr.baseAddress!, 8) == 8 else {
                        fatalError();
                    }
                };
                
                close(fd);
                
                var intermediate = UInt64(0);
                
                for byte in accum {
                    // decode from big-endian format
                    intermediate = (intermediate << 8) + UInt64(byte);
                }
                
                // Alleluia!
                data = Double(bitPattern: intermediate);
            };
            
            return data!;
        }
    }

    
    let base: String;
    let slots: Array;
    
    init(base inbase: String, count: UInt32) {
        base = inbase;
        var protoslots = Array();
        protoslots.reserveCapacity(Int(count));
        
        for _ in 0 .. Double {
        return slots[Int(index)].getData(base, index);
    }
}

Except for the use of Unix filesystem APIs, seems straightforward enough. But what if I want to make that work asynchronously so as to be safely callable from Grand Central Dispatch tasks? Here’s what you need to do:

//
//  AsyncObjectCache.swift
//  AsyncObjectCache
//
//  Created by Pierre Lebeaupin on 10/02/2021.
//

import Foundation

class AsyncObjectCache {
    class Slot {
        struct Functor {
            // silly struct just so I can make a
            // recursive function definition
            var invoke: (_ data: Double) -> Functor?;
        }

        var data: Double?;
        var insertNext: ((_ functor: Functor) -> Void)?;
        let q: DispatchQueue;
        var initialFunctor: Functor?;
        var launched: Bool;
        
        init() {
            q = DispatchQueue(label: "AsyncObjectCache slot protection");
            launched = false;

            insertNext = { [self] functor in
                initialFunctor = functor;
            }
            // Note that at this point there is a strong reference cycle
            // between the class instance and the closure; I see no way
            // around that.
            // The manual cleanup of the line, through initialFunctor,
            // ought to be enough to solve the situation.
        }
        
        func asyncGetData(_ basePath: String,
                          _ index: UInt32,
                          _ continuationQueue: DispatchQueue,
                          _ continuation: @escaping (_ data: Double) -> Void) {
            q.async { [self] in
                if let solidData = data {
                    continuationQueue.async {
                        continuation(solidData);
                    };
                    return; // our work here is done.
                }

                // TODO: move allocation of closures off of the
                // critical section to improve scalability?
                var next: Functor?

                insertNext!(Functor(invoke: { incomingData in
                    continuationQueue.async {
                        continuation(incomingData);
                    };

                    return next;
                }));
                
                insertNext = { functor in
                    next = functor;
                };
                
                guard !launched else {
                    return;
                }
                launched = true;
                
                // Do the dirty work.
                let path = basePath.appendingFormat("/%lu.sprite", index);
                
                path.utf8CString.withUnsafeBufferPointer { ptr in
                    let querier = DispatchIO(type: .random,
                                             path: ptr.baseAddress!,
                                             oflag: O_RDONLY,
                                             mode: 0,
                                             queue: q,
                                             cleanupHandler: {
                                                error in guard error == 0 else {
                                                    fatalError();
                                                }
                                             });
                    
                    var accum = DispatchData.empty;
                    
                    // At this point, launched is true, so newly
                    // launched tasks on our queue that are not the
                    // following one will not enter this clause:
                    // our state is consistent at this point, which
                    // is necessary for us to safely relinquish q.
                    querier!.read(offset: 8, length: 8, queue: q) {
                        (done, rawData, error)
                        in
                        guard error == 0 else {
                            fatalError();
                        }
                        
                        accum.append(rawData!);
                        
                        guard done else {
                            return;
                        }
                        
                        var intermediate = UInt64(0);
                        
                        for byte in accum {
                            // decode from big-endian format
                            intermediate = (intermediate << 8) + UInt64(byte);
                        }
                        
                        // Alleluia!
                        let incomingData = Double(bitPattern: intermediate);
                        var current = initialFunctor;
                        while let solidCurrent = current {
                            current = solidCurrent.invoke(incomingData);
                        }
                        
                        // clean up after ourselves
                        data = incomingData;
                        insertNext = nil;
                        initialFunctor = nil;
                    };
                };
            };
        }
    }

    
    let base: String;
    let slots: Array;
    
    init(base inbase: String, count: UInt32) {
        base = inbase;
        var protoslots = Array();
        protoslots.reserveCapacity(Int(count));
        
        for _ in 0 .. Void) {
        slots[Int(index)].asyncGetData(base, index, continuationQueue, continuation);
    }
}

Much less straightforward now, unfortunately.

Second Teaser: Practical Multithreading

My future practical multithreading post (previous teaser) is going to be a doozy. Let’s just say it will have diagrams. So here is another teaser to keep you waiting.

In the previous teaser, I showed you how I think you should be able to transform a long-lived function in the future in order to parallelize it. Now, it turns out such a transformation is already possible using non-experimental APIs or Swift features, but the result is clearly less elegant and less generalizable:

//
//  Created by Pierre Lebeaupin on 13/12/2020.
//  Copyright © 2020 Pierre Lebeaupin. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
//
// 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//

import Foundation;

enum Op: String {
    case additive = "+", multiplicative = "*";
    
    func neutralValue() -> UInt32 {
        return ((self == .additive) ? 0 : 1);
    }
    
    func combine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a + b) : (a * b));
    }

    func uncombine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a - b) : (a / b));
    }
}


struct ReadyValue {
    let value: UInt32;
    let op: Op?;
    let description: () -> String;
    // Note that as a closure, it can depend on more than the stored properties, contrary to a
    // computed property.

    init(_ inValue: UInt32) {
        value = inValue;
        op = nil;
        description = { return inValue.description;};
    }
    
    init(value inValue: UInt32, op inOp: Op, description indescription: @escaping () -> String) {
        value = inValue;
        op = inOp;
        description = indescription;
    }
}

class ResolveObject {
    let semaphoreWaitingHost: Thread;
    // Using a thread, as the function is going to spend its time blocked in a semaphore,
    // so better directly use a thread than gum up the libdispatch machinery.
    
    init(_ startGoal: UInt32, _ inqueue: DispatchQueue, _ incb: @escaping (_ result: String) -> Void, _ completionCb: @escaping () -> Void, _ primitives: UInt32...) {
        semaphoreWaitingHost = Thread(block:{
            let /* 's */ avoidOvercommitment = DispatchSemaphore(value: ProcessInfo.processInfo.activeProcessorCount - 1);
            // We take one from the get go
            
            let completion = DispatchGroup();
            completion.enter(); // Compensated when the thread exits
            completion.notify(queue: inqueue, execute: completionCb);

            {
                var referenceL = [ReadyValue]();
                var referenceComposedCount = 0;
                var referenceOtherComposedCount = 0;
            
            
                for element in primitives {
                    referenceL.append(ReadyValue(element));
                }
            
                exploreAdditionOfRightNode(&referenceL,
                                           &referenceComposedCount,
                                           &referenceOtherComposedCount,
                                           isParent: true,
                                           kind: .additive);
                exploreAdditionOfRightNode(&referenceL,
                                           &referenceComposedCount,
                                           &referenceOtherComposedCount,
                                           isParent: true,
                                           kind: .multiplicative);
            }(); // https://bugs.swift.org/browse/SR-12243

            
            func exploreAdditionOfRightNode(_ l: inout [ReadyValue],
                                            _ composedCount: inout Int,
                                            _ otherComposedCount: inout Int,
                                            isParent: Bool,
                                            kind: Op) {
                if (l.count != 0) && (l[l.count - 1].op != kind) {
                    guard otherComposedCount == 0 else {
                        return;
                    }
                    
                    otherComposedCount = composedCount;
                    composedCount = 0;
                }
                defer {
                    if (l.count != 0) && (l[l.count - 1].op != kind) {
                        composedCount = otherComposedCount;
                        otherComposedCount = 0;
                    }
                }
                
                var referenceDecomposedValue = [false: kind.neutralValue(), true: kind.neutralValue()];
                
                iteratePossibleLeftNodesDispatch(&l,
                                                 &composedCount,
                                                 &otherComposedCount,
                                                 &referenceDecomposedValue,
                                                 isParent: isParent,
                                                 kind: kind,
                                                 startingFrom: 0,
                                                 { _ in return;});
            }

            func iteratePossibleLeftNodesDispatch(_ l: inout [ReadyValue],
                                                  _ composedCount: inout Int,
                                                  _ otherComposedCount: inout Int,
                                                  _ decomposedValue: inout [Bool:UInt32],
                                                  isParent: Bool,
                                                  kind: Op,
                                                  startingFrom: Int,
                                                  _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void) -> Void) {
                guard isParent else {
                    return iteratePossibleLeftNodes(&l,
                                                    &composedCount,
                                                    &otherComposedCount,
                                                    &decomposedValue,
                                                    isParent: isParent,
                                                    kind: kind,
                                                    startingFrom: startingFrom,
                                                    walkOperands);
                }
                
                var imminentlyViableOrBetter = false;
                walkOperands({_,_ in imminentlyViableOrBetter = true;});
                
                let workloadEstimator = l.count + (imminentlyViableOrBetter ? 1 : 0);
                /* Among other properties, this estimator value is monotonic. */
                
                // 6 may be too many to fit in a tick (1/60th of a second)
                // 4 already means too few possibilities to explore
                // 3 is right out
                if workloadEstimator == 5 {
                    // reseat this divergence over a copy of the whole state
                    var childL = l;
                    var childComposedCount = composedCount;
                    var childOtherComposedCount = otherComposedCount;
                    var childDecomposedValue = decomposedValue;
                    
                    DispatchQueue.global(qos:.userInitiated).async(group: completion) {
                        iteratePossibleLeftNodes(&childL,
                                                 &childComposedCount,
                                                 &childOtherComposedCount,
                                                 &childDecomposedValue,
                                                 isParent: false,
                                                 kind: kind,
                                                 startingFrom: startingFrom,
                                                 walkOperands);
                        
                        avoidOvercommitment.signal();
                    }
                    
                    avoidOvercommitment.wait();
                } else {
                    return iteratePossibleLeftNodes(&l,
                                                    &composedCount,
                                                    &otherComposedCount,
                                                    &decomposedValue,
                                                    isParent: isParent,
                                                    kind: kind,
                                                    startingFrom: startingFrom,
                                                    walkOperands);
                }
            }

            func iteratePossibleLeftNodes(_ l: inout [ReadyValue],
                                          _ composedCount: inout Int,
                                          _ otherComposedCount: inout Int,
                                          _ decomposedValue: inout [Bool:UInt32],
                                          isParent: Bool,
                                          kind: Op,
                                          startingFrom: Int,
                                          _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void) -> Void) {
                for candidateOffset in startingFrom .. Void) -> Void in
                            action(rightChild, reverse);
                            walkOperands(action);
                        };
                        
                        iteratePossibleLeftNodesDispatch(&l,
                                                         &composedCount,
                                                         &otherComposedCount,
                                                         &decomposedValue,
                                                         isParent: isParent,
                                                         kind: kind,
                                                         startingFrom: candidateOffset,
                                                         selfNode);
                        
                        // close current composition
                        guard ({
                            var num = 0;
                            selfNode({_,_ in num += 1;});
                            return num;
                        }() > 1) && ( (kind == .additive) ? decomposedValue[false]! > decomposedValue[true]! :
                                        ((decomposedValue[false]! % decomposedValue[true]!) == 0) ) else {
                            continue;
                        }
                        
                        let realizedValue = kind.uncombine(decomposedValue[false]!, decomposedValue[true]!);
                        let description = { () -> String in
                            var current = "(";
                            selfNode({(_ value: ReadyValue, _ freverse: Bool) -> Void in
                                current += " ";
                                current += (freverse ? (kind == .additive ? "-" : "/") : kind.rawValue);
                                current += " ";
                                current += value.description();
                            });
                            current += ")";
                            
                            return current;
                        };
                        
                        guard l.count > 0 else {
                            if realizedValue == startGoal {
                                let solution = description();
                                inqueue.async { incb(solution); };
                            }
                            continue;
                        }
                        
                        composedCount += 1;
                        l.append(ReadyValue(value: realizedValue, op: kind, description: description));
                        defer {
                            l.remove(at: l.count - 1);
                            composedCount -= 1;
                        }
                        
                        exploreAdditionOfRightNode(&l,
                                                   &composedCount,
                                                   &otherComposedCount,
                                                   isParent: isParent,
                                                   kind: .additive);
                        exploreAdditionOfRightNode(&l,
                                                   &composedCount,
                                                   &otherComposedCount,
                                                   isParent: isParent,
                                                   kind: .multiplicative);
                    }
                }
            }
            
            completion.leave();
        });
    }
    
    func start() {
        semaphoreWaitingHost.start();
    }
}

Teaser: Practical Multithreading

Before I go offline for the end of year, I have a teaser for a blog post I’m preparing on multithreading.

In short, if I have this non-trivial processing (which I quickly whipped up for the example):

//
//  Created by Pierre Lebeaupin on 13/12/2020.
//  Copyright © 2020 Pierre Lebeaupin. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
//
// 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//

import Foundation;

enum Op: String {
    case additive = "+", multiplicative = "*";
    
    func neutralValue() -> UInt32 {
        return ((self == .additive) ? 0 : 1);
    }
    
    func combine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a + b) : (a * b));
    }

    func uncombine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a - b) : (a / b));
    }
}


struct ReadyValue {
    let value: UInt32;
    let op: Op?;
    let description: () -> String;
    // Note that as a closure, it can depend on more than the stored properties, contrary to a
    // computed property.

    init(_ inValue: UInt32) {
        value = inValue;
        op = nil;
        description = { return inValue.description;};
    }
    
    init(value inValue: UInt32, op inOp: Op, description indescription: @escaping () -> String) {
        value = inValue;
        op = inOp;
        description = indescription;
    }
}

func resolve(_ startGoal: UInt32, _ incb: @escaping (_ result: String) -> Void, _ primitives: UInt32...)
{
    var l = [ReadyValue]();
    var composedCount = 0;
    var otherComposedCount = 0;
    
    var exploreAdditionOfRightNode = { (_ kind: Op) -> Void in return; };
    exploreAdditionOfRightNode = { (_ kind: Op) -> Void in
        if (l.count != 0) && (l[l.count - 1].op != kind) {
            guard otherComposedCount == 0 else {
                return;
            }
            
            otherComposedCount = composedCount;
            composedCount = 0;
        }
        defer {
            if (l.count != 0) && (l[l.count - 1].op != kind) {
                composedCount = otherComposedCount;
                otherComposedCount = 0;
            }
        }

        var decomposedValue = [false: kind.neutralValue(), true: kind.neutralValue()];
        
        var iteratePossibleLeftNodes = { (_ startingFrom: Int, _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void) -> Void) -> Void in return; }
        iteratePossibleLeftNodes = { (_ startingFrom: Int, _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void) -> Void) -> Void in
            for candidateOffset in startingFrom ..< (l.count - composedCount) {
                let rightChild = l.remove(at: candidateOffset);
                if let _ = rightChild.op {
                    otherComposedCount -= 1;
                }
                defer {
                    if let _ = rightChild.op {
                        otherComposedCount += 1;
                    }
                    l.insert(rightChild, at: candidateOffset);
                }

                for phase in 0...1 {
                    let reverse = (phase == 1);
                    { (_ valueComponent: inout UInt32) in
                        valueComponent = kind.combine(valueComponent, rightChild.value);
                    }(&decomposedValue[reverse]!);
                    defer {
                        { (_ valueComponent: inout UInt32) in
                            valueComponent = kind.uncombine(valueComponent, rightChild.value);
                        }(&decomposedValue[reverse]!);
                    }
                    
                    let selfNode = {(_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void) -> Void in
                        action(rightChild, reverse);
                        walkOperands(action);
                    };
                    
                    iteratePossibleLeftNodes(/* starting from */ candidateOffset, selfNode);
                    
                    // close current composition
                    guard ({
                        var num = 0;
                        selfNode({_,_ in num += 1;});
                        return num;
                    }() > 1) && ( (kind == .additive) ? decomposedValue[false]! > decomposedValue[true]! :
                                    ((decomposedValue[false]! % decomposedValue[true]!) == 0) ) else {
                        continue;
                    }
                    
                    let realizedValue = kind.uncombine(decomposedValue[false]!, decomposedValue[true]!);
                    let description = { () -> String in
                        var current = "(";
                        selfNode({(_ value: ReadyValue, _ freverse: Bool) -> Void in
                            current += " ";
                            current += (freverse ? (kind == .additive ? "-" : "/") : kind.rawValue);
                            current += " ";
                            current += value.description();
                        });
                        current += ")";
                            
                        return current;
                    };
                    
                    guard l.count > 0 else {
                        if realizedValue == startGoal {
                            incb(description());
                        }
                        continue;
                    }

                    composedCount += 1;
                    l.append(ReadyValue(value: realizedValue, op: kind, description: description));
                    defer {
                        l.remove(at: l.count - 1);
                        composedCount -= 1;
                    }

                    exploreAdditionOfRightNode(.additive);
                    exploreAdditionOfRightNode(.multiplicative);
                }
            }
        };
        
        iteratePossibleLeftNodes(/* starting from */ 0, { _ in return;});

        // Why, I must ask… Why? Why why why?
        iteratePossibleLeftNodes = { _,_ in return;};
    };

    for element in primitives {
        l.append(ReadyValue(element));
    }

    exploreAdditionOfRightNode(.additive);
    exploreAdditionOfRightNode(.multiplicative);

    // Why, I must ask… Why? Why why why?
    exploreAdditionOfRightNode = { _ in return;};
}

(try it out, e.g. with resolve(70, {print($0);}, 5, 4, 3, 2, 2);)

Given this code, I should be able to parallelize it with these simple changes (changes in bold):

//
//  Created by Pierre Lebeaupin on 13/12/2020.
//  Copyright © 2020 Pierre Lebeaupin. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
//
// 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//

import Foundation;

enum Op: String {
    case additive = "+", multiplicative = "*";
    
    func neutralValue() -> UInt32 {
        return ((self == .additive) ? 0 : 1);
    }
    
    func combine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a + b) : (a * b));
    }

    func uncombine(_ a: UInt32, _ b: UInt32) -> UInt32 {
        return ((self == .additive) ? (a - b) : (a / b));
    }
}


struct ReadyValue {
    let value: UInt32;
    let op: Op?;
    let description: () -> String;
    // Note that as a closure, it can depend on more than the stored properties, contrary to a
    // computed property.

    init(_ inValue: UInt32) {
        value = inValue;
        op = nil;
        description = { return inValue.description;};
    }
    
    init(value inValue: UInt32, op inOp: Op, description indescription: @escaping () -> String) {
        value = inValue;
        op = inOp;
        description = indescription;
    }
}

func resolve(_ startGoal: UInt32, _ queue: DispatchQueue, _ incb: @escaping (_ result: String) -> Void, _ primitives: UInt32...) async
{
    var l = [ReadyValue]();
    var composedCount = 0;
    var otherComposedCount = 0;

    var isParent = true;
    
    let exploreAdditionOfRightNode = { (_ kind: Op) async -> Void in
        if (l.count != 0) && (l[l.count - 1].op != kind) {
            guard otherComposedCount == 0 else {
                return;
            }
            
            otherComposedCount = composedCount;
            composedCount = 0;
        }
        defer {
            if (l.count != 0) && (l[l.count - 1].op != kind) {
                composedCount = otherComposedCount;
                otherComposedCount = 0;
            }
        }

        var decomposedValue = [false: kind.neutralValue(), true: kind.neutralValue()];
        
        let iteratePossibleLeftNodes = { (_ startingFrom: Int, _ walkOperands: @escaping (_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void) -> Void) async -> Void in
            var isChildStart = false;

            if isParent {
                let cost = operationsToExplore(l.count, composedCount, otherComposedCount, startingFrom, walkOperands);

                if cost < maxCost && cost > minCost {
                    guard await fork == .child else {
                        return;
                    }

                    isParent = false;
                    isChildStart = true;
                }
            }
            defer {
                if isChildStart {
                    async exit;
                }
            }
                

            for candidateOffset in startingFrom ..< (l.count - composedCount) {
                let rightChild = l.remove(at: candidateOffset);
                if let _ = rightChild.op {
                    otherComposedCount -= 1;
                }
                defer {
                    if let _ = rightChild.op {
                        otherComposedCount += 1;
                    }
                    l.insert(rightChild, at: candidateOffset);
                }

                for phase in 0...1 {
                    let reverse = (phase == 1);
                    { (_ valueComponent: inout UInt32) in
                        valueComponent = kind.combine(valueComponent, rightChild.value);
                    }(&decomposedValue[reverse]!);
                    defer {
                        { (_ valueComponent: inout UInt32) in
                            valueComponent = kind.uncombine(valueComponent, rightChild.value);
                        }(&decomposedValue[reverse]!);
                    }
                    
                    let selfNode = {(_ action: (_ value: ReadyValue, _ reverse: Bool) -> Void) -> Void in
                        action(rightChild, reverse);
                        walkOperands(action);
                    };
                    
                    await iteratePossibleLeftNodes(/* starting from */ candidateOffset, selfNode);
                    
                    // close current composition
                    guard ({
                        var num = 0;
                        selfNode({_,_ in num += 1;});
                        return num;
                    }() > 1) && ( (kind == .additive) ? decomposedValue[false]! > decomposedValue[true]! :
                                    ((decomposedValue[false]! % decomposedValue[true]!) == 0) ) else {
                        continue;
                    }
                    
                    let realizedValue = kind.uncombine(decomposedValue[false]!, decomposedValue[true]!);
                    let description = { () -> String in
                        var current = "(";
                        selfNode({(_ value: ReadyValue, _ freverse: Bool) -> Void in
                            current += " ";
                            current += (freverse ? (kind == .additive ? "-" : "/") : kind.rawValue);
                            current += " ";
                            current += value.description();
                        });
                        current += ")";
                            
                        return current;
                    };
                    
                    guard l.count > 0 else {
                        if realizedValue == startGoal {
                            let solution = description();
                            queue.async { incb(solution); };
                        }
                        continue;
                    }

                    composedCount += 1;
                    l.append(ReadyValue(value: realizedValue, op: kind, description: description));
                    defer {
                        l.remove(at: l.count - 1);
                        composedCount -= 1;
                    }

                    await exploreAdditionOfRightNode(.additive);
                    await exploreAdditionOfRightNode(.multiplicative);
                }
            }
        };
        
        await iteratePossibleLeftNodes(/* starting from */ 0, { _ in return;});
    };

    for element in primitives {
        l.append(ReadyValue(element));
    }

    await exploreAdditionOfRightNode(.additive);
    await exploreAdditionOfRightNode(.multiplicative);
}

(The implementation of operationsToExplore() is left as an exercise for the reader).

Details in the forthcoming post…

#FreeFortnite

Don’t underestimate what just happened: Epic remotely adding an in-app circumvention for in-app purchases in Fortnite on mobile is an engagement of hostilities on par with the bombing of Pearl Harbor. Or the people of Paris taking the Bastille, depending on your point of view. This isn’t just the Hey saga: the relationship between Epic and Apple (and Epic and Google) can never be the same after that.

What’s at stake? Just the whole principle of tight control of mobile platforms by their stewarts, that’s pretty much it. Indeed, you can’t keep the conversation on the “dirty percent” without then reviewing the agency model and the exclusivity (or extinguishing dominance, in the case of Android) of the platform application store, which brings into question app review, which leads us to Apple’s meddling and misguidance and around-pushing and self-dealing and feeding us garbage and capriciousness (too many incidents to list) and inconsistency and preferential treatments (again, too many) and dual-personality and overpowering control and petrification of developers (worth blowing one NYT free read).

Any single one of those is a fundamental injustice, in the sense that each of them represent a damned-if-you-do (Apple could catch you), damned-if-you-don’t (you have to take user criticism/you get punished by the market for being non-competitive/etc.) alternative for the developer. And yet with few exceptions (coverage by Charles Arthur at The Guardian comes to mind), this never reaches the mainstream. But Epic has accomplished that: the news has spread everywhere by now, leading to unprecedented scrutiny of Apple’s actions, and it’s not about to stop. For instance, we’ve always had to assume the best possible intentions from Apple whenever they introduced some restriction, because conspiratorial thinking leads to nowhere, but as Steve Troughton-Smith raises that which we could learn through the lawsuit could be very damaging through revealing what their planning was for introducing them. Yes, it’s a stunt. Of course it’s a stunt. Of course it’s self-serving. But many significant actions were initially dismissed as stunts; what matters is whether it brings us closer to justice.

I think Epic’s message is rather awkward and is preaching too much to the choir, and that them targeting the Play Store as well (which you could always circumvent through Android side-loading) is muddying the message. But not only did it reach the mainstream as mentioned, Epic is making an important point that through sheer airspace saturation has a good chance of reaching the rank-and-file at Apple (and Google): didn’t you become that which you fought? Such a targeting is something that I feel has been missing in the tech conversation. That’s the second thing they’re doing right.

I also hear they are a big faceless company, and I myself wouldn’t trust them any farther than I can throw them, but I can’t answer this argument better than mcc did: given the sheer size of Apple and Google, how could an indie possibly hope to meaningfully take on them? Still, that leaves the question of what will they actually do if they win… but, as it turns out, that doesn’t matter. It’s not a succession war with a zero-sum outcome of just determining who will lord over us: Epic can’t dethrone Apple as the platform stewart. The best they can hope for is having their own store as a peer to the Apple-managed iOS App Store, and if that happens, guess what will happen? Valve will be able to do so as well, and so will instantly bring Steam to iOS. Activision Blizzard will probably want in on the action, as well. etc. And now you have healthy competition of app stores on the platform, such that none of them has veto power over any developer. That is the third and most important reason for why their campaign deserves to be supported.

Look, it can simultaneously be true that Epic is full of shit when they “pass on” the savings for cosmetic add-ons for which they have full pricing power and zero fixed cost, and that they are in the right to take on this Apple policy, because that policy is abusive for the many, many cases where there are fixed costs and pricing power is more limited, so it is abusive in general. And while I think that Epic has little chance of winning their lawsuit given current U.S. law (or elsewhere, should they attempt in France or the E.U.), I also think they should eventually win this fight. So, yeah, #FreeFortnite.

Dear Mr. Cédric O

Ce billet est également disponible en Français

Dear Mr. Cédric O

Small aside for my regular readers: Cédric O is in charge of digital matters in the French government. Of France. No, I am not reducing his name to its initial, this is indeed his name. So while I’ve got your attention, allow me to state that when requested to add name validation that mandates a minimum number of characters, the only ethical solution for a software developer is to refuse to implement that. Back to the matter at hand.

First, thank you for your recent public pronouncements (of which iMore provides good English-language coverage, in my opinion), which allow me to give a theme to this post I intended to write down: that of the open letter.

Indeed, I want to react to the direction you want to steer the StopCovid project towards, which is that of direct standoff with Apple. Now I will readily admit to a bias in favor of Apple, as the coverage of this blog shows: a lot of it centers on Apple technologies, with a focus on iOS. But I hope to convince you anyway that entering a public standoff with Apple would be futile, even if you’d wished for Apple to be pilloried.

Why

It was basically an excellent idea to start and develop the StopCovid project, don’t get me wrong, but I think it’s useful to go over exactly why. Indeed, some observers (some of which I have a lot of respect for, such as Bruce Schneier) have dismissed this kind of computer-based contact tracing infrastructure as being tech solutionism, and these accusations deserve closer scrutiny. I don’t feel the need to go over the principle and usefulness of contact tracing, not only to warn those a newly diagnosed individual could have infected without his knowledge, but also to attempt to figure out the individual who himself infected the newly diagnosed one, if not known already, and thereby potentially walk back the contamination chain.

However, I do feel it is useful to restate the main use case, which is containment of reemerging outbreaks, be it upon the easing of lockdown or imported from endemic areas. As a result, we can assume being a context of a breach that needs repair, and therefore (as an example) means that see little use in the observation phase, such as tests on asymptomatic people, can suddenly be deployed on a large scale as soon as a confirmed case is raised. One of the aims of the lockdown (along with easing the pressure on hospital care) is precisely to get back to a situation where repairing the breaches is feasible anew, as opposed to a situation where there are more breaches than digits available to plug them.

But as we saw when looking for the index case in the Oise (where I’m told it is still being searched for; more than two months after the fact, the expected benefit seems rather thin in my opinion), current techniques appear to be outmatched in the case of Covid-19. And research coming from Oxford university confirms that assessment, as it suggests the inferred epidemiological characteristics of Covid-19 squash any hope of efficient contact tracing by traditional techniques, validating the need for an automated contact recording solution within a more general effort.

That qualification is useful to make, as no application will be a panacea, but rather a link in a chain where all elements must be present. Such as widely available testing (as France appears to be ready to provide as May begins) that also has quick turnover: if the swab result does not come back until three days later, we reach the contacts, and the tests of those don’t come back until three days after that, it is obvious that even if the application allowed for instant contact tracing, the disease would nevertheless outpace us. As a result, the buildup of a waiting list for testing must be avoided at any cost (PCR itself taking only a few hours), and we must ensure the swabs can be quickly taken to PCR spots. And no application can ever replace actual tracing squads, if only in order to trace contacts not covered, those where one of the two is not equipped with a compatible mobile phone, or any mobile phone at all, for instance. That is why it makes sense to display tracing capabilities at the départemental scale, rather than at the regional scale.

How: the physical layer

All that having been said, we can now start going over the different methods for automatically establishing a contact has occurred. Geolocation can be dismissed outright: it is simultaneously not accurate enough for the need: as GPS is not available everywhere or instantly, smartphones often fall back to cell tower or Wifi hotspot geolocation, and too invasive, for obvious reasons. Bluetooth, on the other hand, is designed to allow direct transmission between digital devices, including peripherals, and its low energy variant has even been designed to support very constrained environments such as being integrated in a key fob, thus enabling it to be found by proximity detection. This notion of proximity is essential: for determining potential contamination, we don’t so much want to compute an exact distance as to reduce testing to a list of potential contacts, erring on the side of caution, rather than having to test everyone in a 100 miles radius in case of a confirmed case.

How: Data processing

OK, so we will retrospectively determine a contact by the fact the mobile phones of the parties in question have been able to exchange (or “exchange strongly enough” by measure of signal intensity among other means) data through Bluetooth. But how should this data be kept (and which data), and where should that computation be performed when the time comes?

Any mobile phone already broadcasts Bluetooth messages to any interested party for various reasons (that is what allows them to appear in the list of reachable devices on the selection interface of a personal computer, for instance). So a first idea would be to set up passive listening of broadcast Bluetooth messages and locally store the Bluetooth identifier of the emitter. But that quickly runs into some issues: for privacy reasons, as it happens, that identifier rotates at regular intervals for implementations of recent versions of the standard, such that being forgotten by is own emitter, it becomes useless; furthermore, many mobile phones will throttle if not stop broadcasting when they are not in active use so as to save battery life, which has never ceased to be an important design constraint on mobile phone design.

So it seems necessary to install a change of behavior on both sides of the contact, which shifts the problem space: now both sides have to be modified for even either side to benefit from the system. As a result, it’s kind of like masks: if the source of the contamination did not previously install the application, the contaminated will get no benefit from having installed the application, so a reaching a sufficient density of participants is paramount. That could lead to consider providing smart watches (which are harder to forget at home) to those who, as is their right, do not own a compatible mobile phone.

Now that we can freely design the software on both sides of the interaction, the design space is greatly expanded, too much to explore it here. However, one significant choice is that of the processing which determines whether a contact previously occurred with a newly diagnosed individual, processing which therefore needs access to any necessary information for that purpose: should it occur on the mobile phone (either one: the two now being “hermaphroditic”, what one can determine, symmetrically the other will be able to, as well), or on a central server?

In the second case, since the aim is to warn the bearer of the phone involved in the contact, that would by all appearances entail that the server has the means to contact any user of the system (directly or indirectly: if every user must regularly query the server by asking “I am the user with pseudonym x, is there anything new for me? Send the answer to IP address 192.0.2.42”, that is equivalent to being reachable).

In the first case, however, it is impossible to prevent the user from figuring out on which occasion he has been in contact with a newly diagnosed person (though still coming short of directly informing of his identity): even if the processing which determines contact were a difficult to reverse cryptographic function, it would suffice for him to run that function on subsets of his contact database, through dichotomy if need be, until the single event sufficient to make the function return a positive is found.

The Apple and Google system

That having been established, let us look at the choices made by Apple and Google, as well as the API provided, over which an application will be able to be built.

To begin with, “Exposure Notification” is a service as far as Bluetooth Low Energy is concerned, that is to say a protocol relying on BLE for over the air data exchange, as HTTP is a protocol relying on TCP for over the Internet (usually) data transmission. It is documented here as I type this; as such, the consortium managing Bluetooth has provided a protocol identifier specifically for its use. The format as it appears on the wire (so to speak) is simple: beyond the mandatory Bluetooth framing, it’s mostly the rotating proximity identifier, but it comes with a metadata field, whose only purpose so far (beyond versioning the protocol, allowing to confirm implementation compatibility) is the improve signal intensity loss computations.

As the name suggests, the rotating proximity identifier rotates at more or less regularly: more or less because if rotation was too regular, that would make it easier to trace people, and render these changes useless: rotation occurs at most every ten minutes, and at most every 20 minutes. All this is properly detailed in the crypto document, which describes how data sent by the protocol mentioned above is generated, how data sent and received is stored, and in particular how to determine a potentially contaminating contact has occurred.

The most important property of the “Exposure Notification” system is that this determination is performed locally. The system assumes the presence of at least one server, but the latter only broadcasts non-nominative data collected (with permission) from diagnosed individuals so as to enable the recipient to make this determination: nothing is uploaded for users that have not been positively diagnosed yet. Even the data that does get uploaded reveals little, to the extent that it amounts to revealing the randomly generated data that was used for sending the rotating proximity identifiers, without any personal data, least of all identifying data, being involved.

The system credibly claims other properties: for instance, it would appear to preclude the collection of information emitted by others, only to later make that be broadcast by passing it off as one’s own information (innocents won’t be the only people positively diagnosed with Covid-19, you have to assume adversaries will too, and as a result minimize their nuisance potential).

That being said, the system does not ensure by itself that only positively diagnosed individuals will be able to upload their alleged contamination information: I have a hard time seeing how it could provide any control in this area, therefore it relies on the health authorities for such a filter.

That is apparent in the third and last document, which describes the API an application (here for an Apple device running iOS) will have to use for interfacing with the service. The API manages in the background everything in relation with the Bluetooth service and data storage, but does not involve itself with network interactions, which is the responsibility of the application proper: parts of the API consists of explicitly taking this data, or providing data to be sent; this is more explicit in the API documentation for a Google device running Android, which otherwise describes the same API, apart from the use of the Java language, as required by Android.

Aside from that, the API in general is unsurprising: it is modern Objective-C, with blocks used for callbacks when data processing is complete for example. It seems to provide good coverage for the future applications usage scenarios: an ENManager class for interaction with the system as a whole such as displaying the current status (allowed or not, mostly) and recover data to be uploaded in case of a positive test result, and a ENExposureDetectionSession to be used when checking whether the latest batch of uploaded data would not trigger an exposure when combined with the internally stored data. The only surprise being Objective-C: we would have expected Swift instead, given how trendy the language is for Apple, but that does not affect the interface functionally, it is even likely that it can directly be used from Swift.

The API reveals one final intriguing functionality, which is the possibility to declare your risk level to the system as soon as any contamination is suspected, before even a formal positive test result, so as to recursively warn contacts; with a lower intensity, of course. That will nevertheless have to go through the health authorities, so it remains to see what they will use this for.

The Apple and Google system: the unsaid

As we see, while the documentation mentions the role played by the server, it remains silent as to how many there will be, how it or they will be managed, and by whom. Multiplying the servers would be unfortunate to the extent it would force the installation of as many apps as there would be applicable servers in order to be properly covered (as much to receive the information broadcast by each server, as to be able to send to each of these servers should it become necessary); that could be acceptable (though we would rather do without) in specific cases, such as that of commuters who cross a border, it is however completely unacceptable for the average user.

Both companies intend to severely restrict access to this API: quoting from their commen document gathering answers to frequently asked questions:

10.How will apps get approval to use this system?

Apps will receive approval based on a specific set of criteria designed to ensure they are only administered in conjunction with public health authorities, meet our privacy requirements, and protect user data.

The criteria are detailed separately in agreements that developers enter into to use the API, and are organized around the principles of functionality and user privacy. There will be restrictions on the data that apps can collect when using the API, including not being able to request access to location services, and restrictions on how data can be used.

Apple in particular has the means to enforce these restrictions: an application in general not only cannot be distributed (beyond a handful of copies) without their permission, but it is certain that a sandbox entitlement (the sandbox being the environment in which an iOS app runs, restricting their access to the operating system to the necessary services with some exceptions, only recognized with a signed waiver from Apple) will be necessary, with very few entities being able to claim such an entitlement (state actors, mostly); sorry for those who would like to play with it: com.apple.developer.exposure-notification will be the most restrictive entitlement ever available for the iPhone… It is a sure bet that Apple will not hesitate to invalidate an entitlement that would have leaked or become abused.

Given the arbitrator position at least Apple holds, I therefore wonder about the lack of any rule or even recommendation on the multiplication of servers. I can conceive that neither Apple nor Google want to expose themselves even more to charges that they are holding themselves as above states, but a confirmation that one server at most would be allowed per country (defined as an entity with diplomatic representations or equivalent) would be desirable, while still allowing each country to have multiple apps if needed, as long as they are all based on the national server. I of course hope that the EU will set up a common server that all member states will adopt, and ideally you would only need one per continent, but it would seem difficult designate for each continent the political entity that would be able to be put in charge of that (as for having a single global server, if that were viable politically as well as technically, Apple and Google would already have handled that).

Additionally, the documentation mentions a second phase where the system will be able to be activated out of the box through the operating system without requiring any app to be installed, with the suggestion to install one appearing once the system has determined a potentially contaminating contact; but if the system can determine that out of the box, that implies the ability to recover data from a server without any previously installed app, so which one is used?

And for that matter, there is no guidance on the role of health authorities to ensure the reliability of data that their servers would broadcast. I recall an incident that was reported to me while working in the telecommunication industry: wireless networks used by police, firefighters, EMT, etc. provide a “distress call” functionality these customers take seriously, you can understand why. In practice, it is a voice call, but flagged so as to trigger some specific processing (in relation to priority, in particular) and raises alarms at many levels. And when initially interconnected with the system covering a neighboring district, it did not go over exactly as planned. Indeed, even though the interconnection protocol did specify how to transmit the distress status of the call, it left a lot of leeway in how to react to such a status; and as it happens, at least one of the systems would consider that distress status to matter so much as to make it sticky for the channel in question, up until explicitly cleared by a supervisor. Which by itself can make sense, but in the case where the channel would belong to the other interconnected system, that one having a different configuration would as a result never send such a confirmation, such that the channel would perpetually be considered in distress after that, even for ordinary calls. Pretty quickly, all channels in this situation ended up flagged as in distress without any way to clear them, and when all calls are distress calls, well none are any more. They had to be supplied an emergency patch.

So it would appear risky to invest resources (engineering, quality assurance, etc.) setting up a system without derisking the possibility that it be for naught if it ends up giving back too many false positives to remain usable. I can’t imagine doing without rules to ensure the reliability of information broadcast by the server or servers.

Finally, still in the geographical matter a risk (raised by Moxie Marlinspike) exists where the database to download could become too heavy (in particular in case the epidemic flares back up) to be practical to download, such that it would become necessary to partition it according to location… thus reintroducing geolocation for this purpose, ruining part of the assurances offered. Similarly to server multiplication, I think this is a matter for which Apple and Google should state their intentions.

The standoff

StopCovid, as with many other projects, was started before the Apple/Google initiative was made public; the project has followed different principles, a process which I respect; the protocol, called ROBERT, is documented. The choice was notably made of an architecture where contamination determination is centralized, with the benefits and drawbacks this entails, I won’t go over them again.

As for the matter of server multiplication, we could question already the necessity of protocol multiplication: will the average user need to install one application for each protocol? But that is not where the issue currently lies: since the ROBERT protocol relies on Bluetooth as expected, its implementation on iPhone meets well-established restrictions by Apple on the ability to use Bluetooth by an app which is not in active usage; these restrictions are documented by Apple, even if in vague terms; I have no doubt the project was able to assess them first hand. They aim at preserving battery life and protecting privacy. They drastically reduce the viability of the solution.

Technologically, it was important to try, if only in order to be credible and not depend on a partner that would have been chosen out of a lack of any alternative. The StopCovid project, notably through its experiments on (approximate) distance measurement using Bluetooth, has already accomplished advances which will be more generally useful, and as a project it could go forward within the framework of a more general protocol, with it adopting another protocol not being synonymous with the end of the project.

Because let’s be clear: I can hear media mention systems that could be “made compatible”, but the only way I know of to make two systems communicate is to have them implement the same protocol. It can consist of two different implementations, but they must be of the same protocol.

For when confronted with these longstanding restrictions, you have chosen the path of the standoff with Apple: you express surprise that Apple would not lift these restrictions on demand, and you insist that the project be delivered in production according to its current principles, invoking a matter of technological sovereignty.

Such a standoff is futile and even harmful, and for at least two reasons.

Futile for now

Beyond these restrictions, Apple has more generally asserted the principle that they control the software that can be distributed on the mobile devices showing their brand. As time went on this control sat less and less well with me, such that I have looked for alternatives, in particular through the web, to be able to distribute my personal projects. This is becoming viable for some applications, such as file processing, but when it comes to Bluetooth there is no alternative to the “native” SDK.

So even if I personally object to some of Apple policies, do you seriously believe that after 12 years of dictating their conditions as they saw fit for iPhone software, witout exception, except those they defined themselves (there are a few more, less well documented), you were going to be able to just walk in and convince them or force their hand just like that? That is delusional.

It is all the more delusional as in other situations where they were largely more exposed (I am referring in particular to a decryption request from the FBI), Apple did not cave in to pressure, and they were proud of it and they still are. That is a matter among others of professional pride, as much in relation to the preservation of battery life as it is of privacy protection. Do you really think big words will be enough to make them change their minds?

If at least you were to invoke some argumentation, such as on the potential advantages for privacy of a centralized solution like ROBERT when compared to their solution, that could make them think twice about it. But instead, only denunciation of their financial health (insolent for sure, but what is the relationship with the matter at hand?) or invocation of technological sovereignty is brought.

You could get the upper hand through legal constraints, but its is certain it will take time, a lot of time. So defending technological sovereignty of France could make sense… in other circumstances. Because, I don’t know if you noticed, but France is in the middle of a pandemic right now. And France will only be able to eventually get rid of it through herd immunity; and I don’t know about you, but I’d rather acquire it through a vaccine, or failing that as late as possible. But by the time you’d have forced Apple’s hand through legal means, I think a vaccine will have been found.

Therefore, your insistence on technological sovereignty tells me that attempting to enforce it in the immediate situation is more important for you than having an efficient contact tracing solution, able to save lives. These priorities are backwards. Technological sovereignty matters, but there will be opportunities to enforce it later, or in other ways.

Maybe it’s unfair for Apple to be dictating their terms in such a way. Maybe it ought to be otherwise. But in the meantime, in the here and now, they hold the keys to the system.

Futile in the long run

Let us now assume Apple has given you the keys. Your troubles are not over. As the reasons for which they have set up these restrictions in the first place have not gone away, especially that of battery preservation.

So what is to say your solution will not excessively drain the battery? In particular, you will have a hard time finding skilled developers on the market when it comes to reasonable usage of Bluetooth when unshackled from these restrictions: even those who are currently doing so on Android may come to a few surprises once on iPhone.

This is particularly important as some iPhones remain in active usage for far longer than Android devices, so often with a degraded battery. I personally still haven’t replaced my 6-year-old iPhone 5S which still works fine, except its autonomy is no longer what it once was, and I don’t think I am alone. The matter isn’t merely that I will have to pay more attention to my battery level once StopCovid will be installed; the matter is how will the general public react to an application that, once installed, will cause the battery to drain more quickly, leading it to become fully drained if not carefully watched? A fraction at least will disable the application. And did not we tell earlier how important sufficient penetration was for the system to function?

Once again, the established situation matters, and the established situation is that Apple keeps maintaining many devices in the wild that others would consider obsolete (but which do feature Bluetooth Low Energy), and any shift in this equilibrium will be noticed. For instance, do you really want to expose yourself to accusations of trying to drive premature device renewal, when the additional battery drain of StopCovid will be realized, thus potentially forcing some to renew hardware that was previously more functional?

You could respond that Apple itself will encounter the same challenges and risk increasing the battery drain when they will implement their own system. That may be the case, but I would rather trust them in that domain than I do the developers of StopCovid, no offense meant.

Conclusion

I refuse the false dichotomy brought by some commenters, who would reduce the choice to the entity to which I would have to confide myself. With the right protocol, the right architecture, we can avoid having to trust any particular party more than we are comfortable with, and reconcile seemingly opposite requirements.

While Germany initially had its own project, it has recently announced joining the Apple and Google system, but that will not prevent it from proposing its own application based on that system. What is to prevent France from following the same path? We will all benefit by avoiding balkanization of solutions, and France here is not in a position of strength.

Cher M. Cédric O

This post is also available in English

Cher M. Cédric O,

Petite parenthèse à l’intention de mes lecteurs habituels: Cédric O est le secrétaire d’état au numérique du gouvernement français. De France. Non, je ne suis pas en train de réduire son nom à une initiale, c’est bien le sien. J’en profite donc pour dire que si on demande la mise en place d’une validation qui exige un nom d’au moins N caractères, la seule solution éthique pour un développeur est de refuser de l’implémenter. Fin de la parenthèse.

D’abord, je vous remercie pour vos récentes interventions, qui me permettent de donner un thème à ce billet que je souhaitais écrire: celui de la lettre ouverte.

Je souhaite donc réagir à la direction que vous voulez faire prendre au projet StopCovid, celle de la confrontation avec Apple. Je confesse volontiers un biais envers Apple, comme le montrent les sujets traités sur ce blog, qui tournent beaucoup autour des technologies Apple, iOS en particulier. Mais j’espère tout de même vous convaincre que aller à la confrontation avec Apple serait futile, quand bien même vous voueriez Apple aux gémonies.

Pourquoi

C’était à la base une excellente idée de démarrer et de développer le projet StopCovid, soyons clair, mais je pense qu’il est utile de développer pourquoi. En effet, certains observateurs (dont des que je respecte beaucoup, comme Bruce Schneier) ont dénoncé ce genre de système informatique de suivi de contacts comme étant du solutionnisme technologique, et cette accusation mérite donc d’être examinée. Il est inutile de rappeler ici le principe et l’importance du suivi de contacts, non seulement pour avertir ceux qu’un individu nouvellement diagnostiqué a pu contaminer à son insu, mais aussi tenter de retrouver l’individu ayant contaminé celui nouvellement diagnostiqué s’il est inconnu, et ainsi potentiellement remonter les chaines de contamination.

Par contre, il est utile de repréciser le cas d’utilisation principal, qui est de contenir des réémergences de l’épidémie, que ce soit en sortie du confinement ou importées de zones endémiques. On peut donc supposer qu’on est dans le cas d’une brêche qu’il faut colmater, et donc par exemple que des moyens peu déployés lors de la phase d’observation, comme les tests sur personnes non symptomatiques, peuvent être subitement largement mobilisés dès la première contamination diagnostiquée. L’un des buts du confinement (avec la diminution de la pression sur les services hospitaliers) est d’ailleurs de se ramener à une situation où le colmatage des brêches redevient gérable, au lieu d’une situation où il y a plus de brêches que de doigts disponibles pour les boucher.

Mais comme on l’a vu dans la recherche du patient zéro dans l’Oise (où il parait qu’on cherche encore; plus de deux mois après, l’utilité me semble limitée), les méthodes actuelles semblent défaillantes dans le cas du Covid-19. Et c’est ce que confirment des recherches venant de l’université d’Oxford, qui suggèrent que les caractéristiques épidémiques supposées du Covid-19 rendent futile l’espoir d’un suivi par les méthodes traditionnelles, confirmant le besoin d’un suivi automatisé dans le cadre d’un dispositif plus général.

Cette précision n’est pas inutile, car aucune application ne sera une panacée, mais plutôt le maillon d’une chaine dont tous les éléments doivent être présents. A commencer par des test massifs (ce que la France semble pouvoir fournir en ce début mai) mais aussi à retour rapide: si le résultat du prélèvement ne revient que trois jours après, qu’on prévient les contacts, et que les tests de ceux-ci ne parviennent que trois jours après cela, il est évident que même si l’application permet de déterminer instantanément les contacts, on se ferait distancer par la maladie. Il faut donc éviter à tout prix la constitution de listes d’attente pour les tests (la PCR elle-même ne prenant que quelques heures), et faire en sorte que les prélèvements puissent être rapidement apportés aux lieux de PCR. Et une application dispense moins que jamais la constitution d’équipes d’enquêteurs, ne serait-ce que pour tracer les contacts non couverts, lorsque l’un des deux ne dispose pas d’un téléphone compatible, ou pas de téléphone du tout, par exemple. C’est d’ailleurs pour tout cela qu’il n’est pas absurde de représenter la capacité de suivi département par département, plutôt que région par région.

Comment: la couche physique

Cela étant dit, on peut maintenant envisager les différentes manières d’enregistrer automatiquement les contacts. Ecartons d’emblée la géolocalisation: c’est à la fois pas assez précis pour les besoins: le GPS n’étant pas disponible partout ou instantanément, les smartphones se basent souvent sur la localisation par émetteur cellulaire ou Wifi, et trop invasif, pour des raisons évidentes. Le Bluetooth, lui, est conçu pour permettre des communications directes entre des appareils numériques, y compris des périphériques, et sa variante basse energie a même été conçue pour supporter des applications très contraintes comme être intégrée dans une étiquette de porte-clés, permettant ainsi d’en déterminer la proximité. Cette notion de proximité est essentielle: pour la recherche de contagions potentielles, on cherche moins à déterminer la distance exacte qu’à réduire le dépistage à une liste de contacts potentiels, quitte à ce qu’il y en ait un peu trop, plutôt que de devoir dépister tout le monde à 100km à la ronde en case de contamination confirmée.

Comment: le traitement des données

OK, donc un contact se déterminera rétrospectivement par le fait que les téléphones portables des intéressés ont été en mesure d’échanger (ou « d’échanger suffisamment fort » par mesure d’intensité du signal entre autres) des données par Bluetooth. Mais comment conserver ces données (et lesquelles), et où faire ce calcul le moment venu?

N’importe quel téléphone portable diffuse déjà des messages Bluetooth à qui veut les entendre pour des raisons diverses (c’est ce qui leur permet d’apparaitre dans la liste des périphériques joignables sur l’écran de sélection d’un ordinateur de bureau, par exemple). Donc une première idée est de mettre en place l’écoute des messages Bluetooth diffusés et stocker localement l’identificateur Bluetooth de l’émetteur. Mais cela montre assez rapidement ses limites: précisément pour des raisons de protection de la vie privée cet identificateur change régulièrement dans les mises en place des versions récentes de la norme, de sorte qu’étant oublié par son propre émetteur, il devient inutile; de plus, beaucoup de téléphones vont limiter si ce n’est arrêter les émissions lorsqu’ils ne sont pas activement utilisés afin d’économiser la batterie, qui n’a jamais cessé d’être une contrainte importante sur la conception des téléphones mobiles.

Donc, il semble indispensable de mettre en place une modification de comportement des deux côtés du contact, ce qui change les données du problème: il faut maintenant que les deux côtés soient modifiés pour que ne serait-ce qu’un côté bénéficie du système. En conséquence, c’est comme pour les masques: si son contaminateur n’avait pas l’application installée, son contaminé ne tirera aucun bénéfice d’avoir installé l’application, il importe donc grandement qu’une densité suffisante de participants soit atteinte. Cela conduirait à envisager la fourniture de montres connectées (qu’il est plus difficile d’oublier chez soi) à ceux qui, comme c’est leur droit, ne disposent pas d’un téléphone portable compatible.

Maintenant que nous avons la liberté de concevoir le logiciel des deux côtés de l’interaction, les possibilité de conception augmentent grandement, trop pour revenir dessus. Cependant, un choix significatif est celui du traitement permettant de déterminer si un contact avec un individu nouvellement diagnostiqué a eu lieu, traitement qui doit donc disposer des données nécessaires à cette détermination: doit-il se faire sur le téléphone (n’importe lequel: les deux étant désormais « hermaphrodites », ce que l’un peut déterminer, symmétriquement l’autre le pourra aussi), ou sur un serveur central?

Dans le deuxième cas, le but étant de prévenir le porteur du téléphone en contact, cela emporte a priori que le serveur dispose du moyen de contacter n’importe quel utilisateur du système (directement ou indirectement: si chaque utilisateur doit, à intervalle régulier, interroger le serveur avec « je suis l’utilisateur avec le pseudonyme x, y a-t’il du nouveau pour moi? Envoyez la réponse à l’addresse IP 192.0.2.42 », c’est équivalent à pouvoir être retrouvé).

Dans le premier cas, par contre, il est impossible d’empêcher l’utilisateur de déterminer à quelle occasion il a été en contact avec une personne nouvellement diagnostiquée (à défaut d’être informé directement de l’identité de celle-ci): quand bien même le traitement de détermination serait une fonction cryptographique difficile à inverser, il lui suffirait de la faire tourner sur des sous-ensembles de sa base de contacts, par dichotomie si besoin, jusqu’à trouver l’évènement unique suffisant pour que le traitement sorte qu’il y a contamination potentielle.

Le système d’Apple et Google

Cela étant établi, examinons maintenant les choix effectués par Apple et Google, ainsi que l’API fournie, par dessus laquelle une application pourra être réalisée.

Pour commencer, « Exposure Notification » est un service du point de vue de Bluetooth Low Energy, c’est-à-dire un protocole utilisant BLE pour la transmission radio des données, comme HTTP est un protocole utilisant TCP pour effectuer la transmission des données sur (généralement) Internet. Il est documenté ici à l’heure où j’écris ces lignes; en tant quel tel, il a été enregistré auprès du consortium gérant Bluetooth, avec son propre identifiant. Le format tel que transmis est simple: outre la trame Bluetooth obligatoire, c’est surtout l’identifiant de proximité en vigueur, mais celui-ci est accompagné par un champ de métadonnées, dont la seule utilité actuellement (outre versionner le protocole, ce qui permet de confirmer que les implémentations sont compatibles) est d’affiner le calcul de perte d’intensité du signal.

Comme son nom l’indique, l’identifiant de proximité en vigueur change à intervalles plus ou moins réguliers: plus ou moins car si le changement était trop régulier, cela faciliterait le pistage et rendrait ces changements inutiles: le changement s’effectue au plus toutes les 10 minutes, et au moins toutes les 20 minutes. Tout cela est décrit plus précisément dans la documentation cryptographique, qui décrit la génération des données envoyées par le protocole ci-dessus, comment sont stockées les données reçues et envoyées, et surtout comment déterminer qu’il y a eu contamination potentielle.

La propriété la plus importante du système « Exposure Notification » est que cette détermination est effectuée localement. Le système suppose la présence d’au moins un serveur, mais celui-ci ne fait que diffuser des données non-nominatives collectées (volontairement) auprès des individus diagnostiqués afin de permettre cette détermination par le récepteur: rien n’est remonté pour les utilisateurs tant qu’ils ne sont pas diagnostiqués positivement. Même les données envoyées révèlent peu, dans la mesure où elles reviennent à révéler les données générées aléatoirement qui ont servi pour l’envoi des identifiants de proximité en vigueur, sans qu’aucune autre donnée personnelle, notamment d’identification, ne soit impliquée.

Le système revendique de manière crédible d’autres propriétés: il semble par exemple impossible de collecter des information émises par d’autres, pour ensuite faire diffuser que ces autres ont été contaminants en faisant passer ces informations pour siennes (il n’y a pas que les innocents qui seront diagnostiqués positifs au Covid-19, il faut supposer que des adversaires le seront aussi, et en conséquence minimiser leur pouvoir de nuisance).

Cela étant, le système ne garantit pas en soi que seuls les individus diagnostiqués positivement pourront envoyer leurs informations: je vois difficilement comment il pourrait contrôler quoi que ce soit dans ce domaine, il repose donc sur les autorités sanitaires pour effectuer ce filtrage.

Cela est manifeste dans le troisième et dernier document, qui décrit l’API qu’une application (ici pour appareil Apple sous iOS) devra utiliser pour servir d’interface au service. L’API gère en coulisses tout ce qui est protocole Bluetooth et stockage des données, mais ne s’occupe d’aucune interaction réseau, l’application devant effectuer ces tâches elle-même: une partie des fonctions de l’API consiste à se faire fournir explicitement ces données, ou fournir les données devant être envoyées; c’est plus explicite dans le documentation d’API pour appareil Google sous OS Android, qui décrit par ailleurs la même API, langage Java mis à part, Android oblige.

Mis à part cela, l’API en général ne présente pas de surprise: c’est de l’Objective-C moderne, avec des blocs utilisés pour que l’application puisse se faire rappeler une fois le traitement des données complété par exemple. Elle semble bien couvrir les modes d’utilisation des futures applications: une classe ENManager pour les interactions avec le système en général comme afficher le statut (autorisé ou pas, principalement) et récupérer les données à envoyer au serveur en cas de diagnostic, et une classe ENExposureDetectionSession à utiliser au moment de vérifier si la dernière fournée d’informations de diagnostic ne déclencherait pas une exposition lorsqu’on les combine avec les données enregistrées en interne. La seule surprise étant l’Objective-C: on se serait plutôt attendu à du Swift vu comme le langage est à la mode chez Apple, mais cela ne change pas la fonctionnalité de l’interface, il est même probable qu’elle puisse être utilisée directement depuis Swift.

L’API révèle une dernière fonctionnalité intéressante, qui est la possibilité de déclarer son risque au système en cas de suspicion de contamination, avant même le diagnostic formel, afin d’avertir en cascade les contacts; avec une intensité moindre, bien entendu. Cela devra tout de même passer par les autorités sanitaires, reste à voir ce qu’elles vont en faire.

Le système d’Apple et Google: les non-dits

Comme on le voit, si la documentation du système mentionne le rôle du serveur, elle reste muette sur combien existent, comment il sera géré ou ils seront gérés, et par qui. Multiplier les serveur serait dommageable dans la mesure où cela obligerait à installer autant d’applications qu’il y aurait de serveurs en vigueur pour bénéficier d’une couverture correcte (tant pour pouvoir récupérer les informations diffusées par chacun de ces serveurs, que pour pouvoir envoyer à chacun de ces serveurs le cas échéant); cela pourrait être acceptable (mais on pourrait s’en dispenser) dans des cas particuliers, comme les travailleurs transfrontaliers, c’est par contre complètement inacceptable pour l’utilisateur moyen.

Les deux compagnies ont indiqué l’intention de sévèrement restreindre l’accès à ces API: je cite leur document commun regroupant les réponses au questions fréquemment posées:

10. Comment les applications obtiendront-elles l’autorisation d’accéder à ce système?

Les applications recevront cette autorisation selon un ensemble déterminé de critères choisis pour s’assurer qu’elles ne seront administrées qu’en relation avec les autorités de santé publiques, atteignent nos exigences de protection de la vie privée, et protègent les données utilisateur.

Ces critères sont spécifiés séparément dans des accords que les développeurs ont à accepter afin d’utiliser les API, et sont organisés autour des principes de fonctionnalité et de protection de la vie privée. Il existera des restrictions sur les données que les applications seront en mesure de collecter lorsqu’elles utiliseront l’API, y compris l’impossibilité d’accéder aux services de localisation, ainsi que des limitations sur l’usage qui pourra être fait des données.

Apple en particulier dispose de mesures de contraintes pour que cela soit appliqué: non seulement une application en général ne peut être distribuée (au-delà d’une poignée d’exemplaires) sans leur aval, mais il est certain qu’une permission pour accéder au service depuis le sandbox (« bas à sable » qui héberge les applications dans iOS, restreignant leur accès au système d’exploitation aux services nécessaires hors passe-droit, reconnu uniquement si signé par Apple) sera nécessaire, avec fort peu d’entités pouvant éventuellement y prétendre (étatiques, principalement); désolé pour ceux qui voudraient jouer avec: com.apple.developer.exposure-notification sera la permission distribuée de manière la plus restrictive de toutes celles disponibles pour l’iPhone… Il y a fort à parier qu’Apple n’hésitera aucunement à invalider une permission qui aurait fuité ou aurait été abusée.

Vu la position d’arbitre d’au moins Apple, l’absence de régle ou même de recommandation sur la multiplication des serveurs m’interroge, donc. Je peux concevoir que ni Apple ni Google ne souhaitent s’exposer plus encore à des accusations de se placer au-dessus des états, mais il serait souhaitable d’avoir confirmation que pas plus d’un serveur ne serait autorisé par pays (défini comme une entité disposant de représentations diplomatiques ou assimilées), sans que cela obère la capacité de chaque pays a avoir plusieurs applications selon les besoins, tant que celles-ci se basent sur le serveur national. J’espère bien sûr que l’UE mettra en place un serveur commun que tous les pays adopteront, et idéalement il en faudrait un seul par continent, mais il semble difficile de désigner pour chaque continent l’entité politique qui pourrait prendre cela en charge (quant à avoir un serveur mondial, si cela était viable tant politiquement que techniquement, Apple et Google s’en seraient déjà chargés).

De plus, la documentation mentionne une deuxième phase où le système pourra être activé « de base » par le système d’exploitation sans aucune application installée, avec suggestion d’installer une des application si le système détermine un contact potentiellement contaminant; or si le système « de base » peut déterminer cela, ça implique la récupération de données depuis un serveur sans qu’une appli n’ait été installée, alors lequel est utilisé?

Et sinon, aucune précision sur le rôle des autorités de santé pour s’assurer de la fiabilité des informations diffusées par leur serveur. Je pense à un incident qui m’a été rapporté lorsque je travaillais dans les télécoms: les réseaux de communications sans fil pour policiers et pompiers ont une fonctionnalité « appel de détresse » que ces clients prennent très au sérieux, on peut comprendre pourquoi. En pratique, c’est un appel vocal, mais avec un drapeau qui déclenche certains traitements spécifiques (de priorité notamment) et soulève des alarmes un peu partout. Et lors d’une première interconnexion avec le système du district venant d’un concurrent, tout ne s’est pas passé comme prévu. En effet, même si la norme d’interconnexion précisait comment faire passer l’information qu’un appel était de détresse, elle laissait beaucoup de liberté sur comment prendre en compte cette information; et il s’avère qu’au moins un des deux systèmes considérait le statut détresse comme sufisamment important pour qu’il doive être mémorisé pour le canal en question, jusqu’à ce que ce statut soit explicitement levé par un superviseur. Ce qui n’est pas absurde en soi, mais si le canal appartient à l’autre système en interconnexion, celui-ci n’ayant pas la même configuration n’envoyait jamais une telle confirmation, de sorte que le canal était dorénavant considéré comme étant perpétuellement en détresse, même pour les appels ordinaires. Assez rapidement, tous les canaux dans ce cas ont fini par passer en détresse sans pouvoir les faire redescendre, et lorsque tous les appels sont de détresse, eh bien plus aucun ne l’est. Il a fallu envoyer une correction en urgence.

Il semble donc risqué d’investir des ressources (d’ingénieurie, validation, etc.) dans la mise en place d’un système sans écarter le risque que cela ne soit pour du beurre s’il finit par donner trop de faux positifs pour rester utilisable. Des règles pour assurer la fiabilité des informations diffusées par le ou les serveurs me semblent indispensables.

Enfin, pour rester dans la question géographique, un risque (soulevé par Moxie Marlinspike) est que la base de données à télécharger pourrait devenir trop grosse (en particulier en cas de reprise de l’épidémie) pour être pratique à télécharger, de sorte qu’il faille la partitionner selon la localisation… et réintroduire de la géolocalisation pour ce faire, ruinant une partie des garanties offertes. Comme pour la multiplication des serveurs, je pense que c’est quelque chose au sujet duquel Apple et Google doivent déclarer leurs intentions.

La confrontation

StopCovid, comme de nombreux autres projets, a été démarré avant la publication de l’initiative d’Apple et Google; le projet est parti sur des principes différents, ce que je respecte en soi; le protocole, appelé ROBERT, est documenté. Le choix a notamment été fait d’une architecture où la détermination de contamination est centralisée, avec les avantages et inconvénients que cela comporte, je ne reviendrai pas dessus.

Déjà, comme pour la question de la multiplication des serveurs, on peut se poser la question de la multiplication des protocoles: faudra-t’il que tout un chacun s’équipe d’une application pour chaque protocole? Mais ce n’est pas là où le bat blesse dans l’immédiat: le protocole ROBERT utilisant, comme attendu, Bluetooth, son implémentation sur iPhone rencontre des restrictions établies de longue date par Apple sur les possibilités d’utilisation du Bluetooth par une application non-active; ces restrictions sont documentées par Apple, même si de manière peu précise; je ne doute pas que la projet ait pu les mesurer par lui-même. Elles ont pour objet la préservation de la batterie et la protection de la vie privée. Elles réduisent drastiquement la viabilité de la solution.

Technologiquement, il était important d’essayer, ne serait-ce que pour être crédible et ne pas dépendre d’un partenaire qui aurait été choisi faute d’alternative. Le projet StopCovid, notamment par les expérimentations de mesure (approximative) de distance par Bluetooth, a déjà accompli des avancées qui serviront plus généralement, et en tant que projet il pourrait continuer dans le cadre d’un protocole plus général, l’adoption d’un autre protocole ne signifiant pas la fin du projet.

Parce que soyons clairs: les médias parlent de systèmes qui pourraient être « rendus compatibles », mais la seule manière que je connaisse de faire communiquer deux systèmes, c’est qu’ils implémentent le même protocole. Il peut s’agir de deux implémentations différentes, mais il doit s’agir du même protocole.

Car le choix que vous avez fait, devant l’existence de ces restrictions qui ne sont pas nouvelles, est celui de la confrontation avec Apple: vous vous étonnez qu’Apple ne lève pas ces restrictions à la demande, et vous insistez pour que le projet puisse arriver en production selon ses principes actuels, invoquant une question de souveraineté technologique.

Une telle confrontation est futile et même nuisible, et ce pour au moins deux raisons.

Futile dans l’immédiat

Au-delà de ces restrictions, Apple a plus généralement établi le pricipe qu’ils contrôlent les logiciels pouvant être distribués sur les appareils mobiles de leur marque. Au fil des années ce contrôle m’a de moins en moins plu, de sorte que je cherche des alternatives, notamment par le web, pour distribuer mon travail personnel. Cela devient viable pour certaines applications, comme le traitement de fichiers, mais dans le cas du Bluetooth il n’y a pas d’alternative au SDK « natif ».

Donc même si j’objecte moi-même à certaines des politiques d’Apple, croyez-vous sérieusement qu’après qu’ils aient passé 12 ans à avoir dicté leurs conditions comme ils l’entendaient pour les logiciels iPhone, sans exception, à part celles dont ils ont eux-même défini les termes (il en existe quelques autres, moins documentées), vous alliez pouvoir les convaincre ou leur forcer la main si rapidement? C’est illusoire.

C’est d’autant plus illusoire que dans d’autres situations où ils étaient autrement plus exposés (je pense notamment à une requète de déchiffrage venant du FBI), Apple a résisté aux pressions visant à les faire céder, et ils en étaient fiers et le sont toujours. C’est entre autres une question de fierté profesionnelle, tant pour ce qui est de la préservation de la batterie que la protction de la vie privée. Croyez-vous que des mouvements de menton vont suffire à leur faire changer d’avis?

Si encore vous vous prévaliez d’arguments, comme les avantages potentiels d’une solution centralisée comme ROBERT sur la protection de la vie privée face à leur solution, c’est quelque chose à laquelle ils pourraient être sensibles. Mais non, il n’est question que de dénoncer leur santé financière (insolente certes, mais quel est le rapport?) ou de souveraineté technologique.

Vous pourriez obtenir gain de cause par l’une ou l’autre contrainte légale, mais il est certain que cela prendra du temps, beaucoup de temps. Défendre la souveraineté technologique de la France pourrait donc avoir du sens… dans d’autres circonstances. Car je ne sais pas si vous avez remarqué, mais il y a une pandémie qui touche actuellement la France. Et la France ne pourra s’en débarasser à terme qu’en ayant l’immunité collective; et je ne sais pas vous, mais je préfèrerais l’acquérir moi-même par un vaccin éventuel, ou le plus tard possible le cas échéant. Or le temps que vous contraigniez Apple par la voie légale, je pense que le vaccin aura été trouvé.

Votre insistance sur la souveraineté technologique m’indique donc que faire valoir celle-ci dès maintenant est pour vous un objectif plus important que d’avoir une solution efficace de suivi des contacts, susceptible de sauver des vies. Ces priorités sont dans le désordre. La souveraineté technologique est importante, mais il sera toujours temps de la faire valoir plus tard, ou autrement.

Il est peut-être injuste que Apple dicte ainsi ses termes. Il devrait peut-être en être autrement. Mais en attendant, là, ici, et maintenant, ils ont les clés du système.

Futile à terme

Supposons maintenant qu’Apple vous donne les clés. Vos ennuis ne sont pas finis. Car les raisons pour lesquelles ils avaient posés ces restrictions à la base n’ont pas disparu, notamment la préservation de la batterie.

Donc, qu’est-ce fera que votre solution ne va pas drainer trop rapidement la batterie? En particulier, il vous sera difficle de trouver des compétences sur le marché des développeurs pour ce qui est d’utiliser raisonnablement le Bluetooth une fois ces restrictions levées: même ceux qui le font déjà sur Android pourront avoir des surprises une fois sur iPhone.

C’est particulièrement important parce que des iPhones restent en activité beaucoup plus lontemps que les appareils Android, avec souvent donc une batterie dégradée. Je n’ai moi-même toujours pas remplacé mon iPhone 5S d’il y a plus de 6 ans qui fonctionne très bien, sauf que son autonomie n’est plus ce qu’elle était, et je pense ne pas être le seul. La question n’est pas juste que je devrai faire plus attention à ma batterie une fois StopCovid installé; la question est comment le public va réagir à une application qui, après installation, va drainer plus rapidement sa batterie, pouvant conduire à son épuisement s’il n’y prend pas garde? Une partie au moins va désactiver l’application. Et on n’avait pas dit qu’une pénétration suffisante était importante pour que le système fonctionne?

Encore une fois, la situation existante compte, et la situation existante est que Apple continue à entretenir de nombreux appareils dans la nature que d’autres considèreraient obsolètes (mais qui sont équipés du Bluetooth Low Energy), et tout changement dans cet équilibre se remarquera. Voulez-vous par exemple vous exposer à des accusations de vouloir encourager le renouvellement prématuré, lorsqu’on s’apercevra du drain de la batterie induit par StopCovid, susceptible de contraindre à un renouvellement d’un matériel auparavant plus fonctionnel?

Vous pourrez me dire qu’Apple eux-mêmes risque de susciter aussi un drain augmenté de la batterie avec l’implémentation de leur système. C’est possible, mais je leur fais plus confiance dans ce domaine qu’aux développeurs de StopCovid, sans vouloir les offenser.

Conclusion

Je refuse la fausse alternative proposée par certains commentateurs, qui voudraient réduire la choix à l’entité à laquelle on accepte de se confier. Avec le bon protocole, la bonne architecture, on peut éviter de devoir faire confiance à quelqu’un plus qu’on le voudrait et concilier des exigence apparemment opposées.

L’Allemagne, après avoir commencé de son côté, a annoncé rejoindre le système de Apple et Google, mais cela ne l’empêchera pas de proposer son application sur la base de ce système. Qu’est-ce qui empêche la France de faire de même? Il est de l’intérêt général d’éviter la balkanisation des solutions, et la France n’est pas ici en position de force.

Pope Cancels April The First

In an unexpected decree entitled Inter Gravissimas, Pope Francis enacted a calendar reform containing but a single change: the skipping of April 1st, 2020. March 31st 2020 is to be followed directly by April 2nd, 2020; however, weekdays are to follow their usual succession, as a result April 2nd 2020, following Tuesday March 31st, is to be a Wednesday, contrary to expectations (Editor’s note: this blog publishing software could not be corrected in time for publication, so it is erroneously going to show April 1st at first. We apologize for the confusion).

“It has come to our attention that the celebration of Easter is drifting out of sync with the cycle of seasons”, wrote Pope Francis as a way of exposing the motives for the change. “To avoid the error becoming any greater, we are correcting the calendar in this way while there is still time to do so before Easter 2020.” This decision has baffled all observers: while the current calendar does have some systemic error, it was not expected to have accumulated to a full day, and the emergency is all the more questionable given that Easter can fall on any of about 36 days in the calendar, making a shift of a few hours compared to the cycle of seasons hardly visible.

Keen observers are remarking this may be an attempt by Vatican City to get rid of a celebration they don’t otherwise control. While the papal authorities obviously cannot repeat this every year or otherwise eliminate the day from the calendar going forward, it is rumored they hope the cancellation of this year’s celebrations will disrupt observance of the event and dissuade attempts to reintroduce it.

“This calendaring decision is entirely within the Pope’s prerogatives”, a Vatican City spokesperson was quoted as saying. That may be true inside the Vatican and the Catholic church in general, where the Pope concentrates both legislative and executive powers in his hands. “And who do you think is in charge of the World’s calendars? Julius Caesar? Get real.”, the spokesperson continued.

Meanwhile, all the major software vendors appeared to be scrambling to develop an emergency update to take this change into account, as evidenced by the banners appearing on their web sites apologizing for the erroneous dates being currently shown in their interfaces. However, none of Microsoft, Google, Apple, YouTube, Amazon, Blizzard, Stack Overflow, or Facebook were available for comment, and they have seemingly scrapped all their other plans for the day.

What are the very long-term solutions to Meltdown and Spectre going to look like?

While the long-term solutions to Meltdown and Spectre we saw earlier can and probably will be made to work on mainstream hardware, their inefficiencies call into question the fundamental operation of our processors: rather than try and contain speculation, why not avoid the need for speculation in the first place, for instance? Or, on the contrary, embrace it? But revisiting the design space of processors requires taking a long look at why they work that way in the first place, and whether these assumptions are still suitable.

Inspired by the fact many solutions to Spectre rely on a model of the speculative execution behavior of the processor, e.g. “Modern CPUs do not speculate on bit masking” (which is fine for short-term mitigations: we can’t really throw away every single computer and computing device to start over from scratch and first principles at this point), some suggest we should officially take the speculative behavior of the processor into account when developing/compiling software, typically by having the processor expose it as part of its model. This is completely untenable long-term: as processors evolve and keep relying on speculation to improve performance, they will end up being constrained by the model of speculative execution they initially exposed and that existing programs started relying against, i.e. future processors will have to refrain from speculating even an iota more than they currently do (but still speculate, such that we will keep having to maintain the mitigations). Unless they do, in which case we will have to reanalyze all code and update/add mitigations every time a new processor that improves speculation comes out (not to mention the inherent window of vulnerability). Possibly both. This is untenable, as the story of delay slots taught us.

In case you’re new to the concept of delay slots, they were a technique introduced by Berkeley RISC to allow pipelining instructions while avoiding bubbles every time a branch would be encountered: the instruction(s) following the branch would be executed no matter what (slide 156 out of 378), and any jump would only occur after that. And some RISC architectures such as MIPS and Sparc used them, and it worked well. That is, until it was time to create the successor design, with a longer pipeline. You can’t put an additional delay slot, since that would break compatibility with all existing code, so you instead compensate in other ways (branch prediction, etc.), such that delay slots are not longer useful, but you still have to support the existing delay slots: doing otherwise would also break compatibility. In the end, every future version of your processor ends up having to simulate the original pipeline because you encoded that as part of the ISA. Oh, and when I said delay slots initially worked well, that was a lie, because if your processor takes an interrupt just after a branch and before a delay slot, how can the work be resumed? Returning from the interrupt cannot be simply jumping back to the address of the instruction after the branch, there is also the delayed branch state to be taken care of; solutions to these issues were found, obviously, but singularly complexify interrupt handling.

A separate insight of RISC in that area is that, if we could selectively inhibit which instructions would write to flags, then we could prepare flags well ahead of when they would be needed, allowing the processor to know ahead of time whether the branch would be taken or not, enough so that the next instruction could be fetched without any stall, removing the need for either delay slots or to try and predict the branch. That is often useful for “runtime configurable” code, and sometimes for more dynamic situations, however in many cases the compiler does not have many or any instruction to put between the test and the branch, so while it can be a useful tool (compared to delay slots, it does not have the exception-related issues, and the compiler can provision as many instructions in there as the program allows, rather than being limited by the architecturally-defined delay slot size, and conversely if it can’t, it does not have to waste memory filling the slots with nops), it also has many of the same issues as delay slots: as the pipelines get deeper there will be more and more cases where processors will have to resort to branch prediction and speculative executions to keep themselves fed when running existing code; using knowledge that is only available at runtime, as the processor has access to context the compiler does not have. Furthermore, the internal RISC engine of current x86 processors actually goes as far as fusing conditional branches together with the preceding test or comparison instruction, suggesting such a fused instruction is more amenable to dynamic scheduling. RISC-V has an interesting approach: it does away with flags entirely (not even with multiple flag registers as in PPC for instance (cr0 to cr7)), using instead such fused instructions… but it is still possible to put the result of a test well ahead of the branch that requires it, simply by setting a regular register to 0 or 1 depending on the test outcome, then having the fused branch’s test be a comparison of this register with 0, and presumably implementations will be able to take advantage of this.

Generally, there is an unavoidable tension due to the straightjacket of sequential instruction execution, straightjacket which is itself unavoidable due to the need of being able to suspend processing, then resume where it left off. How could we better express our computations in such a way that hardware can execute a lot of it at once, in parallel, while being formally laid out as a sequence of instructions? For instance, while it could be desirable to have vectors of arbitrary lengths rather than fixed-sized ones (as in MMX, Altivec, SSE, NEON, etc.), doing so raises important interruptibility challenges: the Seymour Cray designs either at CDC or Cray did not support interrupts or demand-paged virtual memory! If we give up on those, we give up on the basis of preemptive multitasking and memory protection, so we’d end up with a system that would be worse than MacOS 9, and while MacOS 9 went to heroic lengths in riding a cooperative multitasking, unprotected memory model (and even that OS supported interrupts), no one who has known MacOS 9 ever wants to go back to it: it is dead for a reason (from the story of Audion). Alternatively, we could imagine fundamentally different ways of providing the high-level features, but then you still have to solve the issue of “what if the software developer made a mistake and specified an excruciatingly long vector, such that it will take 10 seconds to complete processing?” So either way, there would need to be some way to pause vector processing to do something else, then resume where it left off, which imposes severe constraints on the instruction set: RISC-V is going to attempt that, but I do not know of anyone else.

One assumption we could challenge is whether we need to be writing system software in C and judge processors according to how well they execute that software (via). To my mind, Spectre and Meltdown are as much the result of having to support Unix or systems with Unix-ish semantic, or possibly even just fast interrupts, as they are the result of having to support C: flat memory, context switching/preemptive multitasking hacked on absolute minimum hardware support (OSes have to manually read, then replace, every bit of architectural state individually and manually!), which itself implies limited architectural state, which results in lots of implicit micro-architectural state to compensate, traps also hacked on absolute minimal hardware support, in particular no quick way to switch to a supervisor context to provide services short of a crude memory-range-based protection (and now we see how that turned out)¹, no collection types in syscall interface (ever looked at select() ?) thus forcing all interactions to be through the kernel reading a block of memory from the calling process address space even for non-scalar syscalls, mmap(), etc. However, especially in the domain of traps, it will be necessary to carefully follow the end to end principle: the history of computing is littered with hardware features that ended up being unused for fail of providing a good match with the high-level, end to end need. This is visible for instance in the x86 instruction set: neither the dedicated device I/O instructions (IN/OUT) nor the protection rings 1 and 2 (user-level code uses ring 3, and kernels use ring 0) are ever used in general purpose computing (except maybe on Windows for compatibility reasons).

But Spectre and Meltdown are also indeed the result of having to support C, in particular synchronization memory primitives are not very good matches for common higher level language operations, such as reference assignment (whether it is in a GC environment or an ARC environment). Unfortunately, a lot of research in that area ends up falling back to C…

Revisiting the assumption of C and Unix is no coincidence. While our computers are in many ways descendants of microcomputers, the general design common to current microprocessor ISAs is inherited from minicomputers, and most of it specifically from the PDP-11, where both C and Unix were developed; this includes memory mapped and interrupt-based device I/O rather than channel I/O, demand-paged virtual memory, the latter two of which imply the need to fault at any time, both byte and word addressing, etc. This in turn shapes higher level features and their implementation: preemptive multitasking, memory protection, IPC, etc. Alternatives such as Lisp machines, Intel 432, Itanium, or Sun Rock did not really pan out; but did these failures disprove their fundamental ideas? Hard to tell. And some choices from the minicomputer era, most of which were related to offloading responsibilities from hardware to software, ended up being useful and/or sticking for reasons sometimes different from the original ones, original ones which could most often be summarized as: we need to make do with the least possible amount of transistors/memory cells (parallels with evolutionary biology: an adaptation that was developed at one time may find itself repurposed for a completely different need). For instance, the PDP-11 supported immutable code (while some other archs would assume instructions could be explicitly or sometimes even implicitly modified, e.g. as late as 1982 the Burroughs B4900 stored the tracking for branch prediction directly in the branch instruction opcode itself, found out while researching the history of branch prediction…) which was immediately useful for having programs directly stored in ROM instead of having to be loaded from mass storage plus occupy precious RAM, was also useful because it enabled code sharing (the PDP-11 supported PIC in addition), but today is also indispensable for security purposes: it’s most definitely safer to have mutable data be marked as no-exec, and therefore have executable code be immutable. The same way, minicomputers eschewed channel I/O to avoid an additional processing unit dedicated to device I/O, thus saving costs and enabling them to fit in a refrigerator-sized enclosure rather than require a dedicated room with raised flooring, but nowadays having the CPU being able to interrupt its processing (and later be able to resume it) is mandatory for preemptive multitasking and real-time purposes such as low-latency audio (think Skype). As a result, it is not possible to decide we can do without feature X of processors just because its original purpose is no longer current: feature X may have been repurposed in the meantime. In particular, virtual memory faults are used everywhere to provide just-in-time services, saving resources even today in the process (physical pages not allocated until actually used, CoW memory, etc.). Careful analysis, and even trial and error (can we build a performant system on this idea?), are necessary. As a significant example, we must not neglect how moving responsibilities from hardware to software enabled rapid developer iteration of computer monitor functionality UX (job control, I/O, supervision, auditing, debugging, etc.). Moving back any of this responsibility to hardware would almost certainly cause the corresponding UX to regress, regardless of the efforts towards this end by the hardware: the lack of rapid iteration would mean any shortcoming would remain unfixable.

Now that I think about it, one avenue of exploration would be to try and build on a high-level memory model that reflects the performance characteristics of the modern memory hierarchy. Indeed, it is important to realize uniform, randomly addressable memory hasn’t been the “reality” for some time: for instance, row selection in JEDEC protocols (DDR, etc.) means it’s faster to read contiguous blocks than random ones, and besides that we have NUMA, and caches of course (need a refresher? Ulrich Drepper’s got you covered). That being said, the Von Neumann abstraction will remain true at some level. So the game here would be not so much to abandon it than to build a more structured abstraction above it that better matches the underlying performance characteristics.

As you can see, this is still very much an open research space. In the meantime, we will have to make do with what we have.

¹e.g. we could imagine a system similar to fasttraps (themselves typically used for floating-point assistance) where a new syscall instruction would take as arguments the address and number of bytes of memory to copy (even if more may end up being necessary, that would still allow processing to start), and automatically switch to the kernel address space, so that the processor would best know how to manage the caches (in relation to address translation caching for instance) instead of second-guessing what the program is attempting to do.

How to design an architectural processor feature, anyway?

Before I present what could be the very long term solutions to Meltdown and Spectre, I thought it would be interesting to look at a case study in how to (and how not to) implement processor features.

So, imagine you’re in charge of designing a potential replacement for the 6809, and you read this article, with the takeaway that, amazing as it is, that hack would quickly become insufficient given the increase in screen size and resolution (VGA is just around the corner, after all) that is going to outpace processor clock speed.

Of course, one of the first solutions for this issue would be to have better dedicated graphics capabilities, but your new processor may be used in computers where there is zero such hardware, or even if there is, there are always going to be use cases that fall through the cracks and are not well supported by such hardware, use cases where programmers are going to rely on your processor instead. In fact, you don’t want to get too tied up to that particular scenario, and instead think of it as being merely the exemplar of a general need: that of efficient, user-level memory copy of arbitrary length between arbitrary addresses that integrates well with other features, interrupts in particular. That being set up, let us look at prospective solutions.

Repeat next instruction(s) a fixed number of times

That one seems obvious: a new system that indicates the next instruction, or possibly the next few instructions, is to be executed a given number of times, not just once; the point being to pay the instruction overhead (decoding, in particular) only once, then having it perform its work N times at full efficiency. This isn’t a new idea, in fact, programmers for a very long time have been requesting the possibility for an instruction to “stick” so it could operate more than once. How long? Well, how about for as long as there have been programmers?

However, that is not going to work that simply, not with interrupts in play. Let us look at a fictional instruction sequence:

IP -> 0000 REP 4
      0001 MOV X++, Y++
      0002 RTS

SP -> 0100 XXXX (don’t care)
      0102 XXXX

But in the middle of the copy, an interrupt is received after the MOV instruction has executed two times, with two more executions remaining. Now does our state (as we enter the interrupt handler) look like this:

      0000 REP 4
      0001 MOV X++, Y++
      0002 RTS

      0100 0001 (saved IP)
SP -> 0102 XXXX

In which case, when we return from the interrupt the MOV will only be executed once more, making it executed only 3 times in total, rather than the expected 4, wreaking havoc in the program; so can we provide this instead:

      0000 REP 4
      0001 MOV X++, Y++
      0002 RTS

      0100 0000 (saved IP)
SP -> 0102 XXXX

Well, no, since then upon return from the interrupt execution will resume at the REP instruction… in which case the MOV instruction will be executed 4 times, even though it has already executed twice, meaning it will execute 2 extra times and 6 times in total.

It’s not possible to modify the REP instruction since your processor has to support executing code directly from ROM given the price of RAM (and making code be read-only is valuable for other reasons, such as being more secure or allowing it to be shared between different processes). How about resetting X and Y to their starting values and resume all iterations on exit? Except operation of the whole loop is not idempotent if the two buffers overlap, and there is no reason not to allow that (e.g. memmove allows it), so restarting the whole sequence is not going to be transparent. What about delaying interrupts until all iterations are completed? With four iterations that might be acceptable, but given your processor clock speed, as little as 16 iterations could raise important issues in the latency of interrupt handling, such that real-time deadlines would be missed and sound be choppy.

Whichever way we look at it, this is not going to work. What will?

Potential inspiration: the effect on the architectural state of the Thumb (T32) If-Then (IT) instruction

Conditional execution (or perhaps better said, predicated execution) is pervasive in ARM, and it is possible in Thumb too, but that latter case requires the If-Then instruction:

(any instruction that sets the EQ condition code)
IP -> 00000100 ITTE EQ
      00000102 ADD R0, R0, R1
      00000104 ST R5, [R3]
      00000106 XOR R0, R0, R0

SP -> 00000200 XXXXXXXX (don’t care)
      00000204 XXXXXXXX

And as if by magic, the ADD and ST instructions only execute if the EQ condition code is set, and XOR, corresponding to the E (for else) in the IT instruction, only executes if the EQ condition code is *not* set, as if you had written this:

(any instruction that sets the EQ condition code)
IP -> 00000100 ADD.EQ R0, R0, R1
      00000102 ST.EQ R5, [R3]
      00000104 XOR.NE R0, R0, R0

That might appear to raise interruptibility challenges as well: what happens if an interrupt has to be handled just after the ADD instruction, or when the ST instruction raises a page fault because the address at R3 must be paged back in? Because when execution resumes at ST, what is to stop XOR from being unconditionally executed?

The answer is ITSTATE, a 4-bit register that is part of the architectural state. What the IT instruction actually does is:

  • take its immediate bits (here, 110), and combine them using a negative-exclusive-or with the repeated condition code bit (we’re going to assume it is 111)
  • set ITSTATE to the result (here, 110), padding missing bits with ones (final result here being 1101)

And that’s it. What then happens is that nearly every T32 instruction (BKPT being a notable exception) starts operation by shifting out the first bit from ITSTATE (shifting in a 1 from the other side), and avoids performing any work if the shifted out bit was 0

This means you never need explicitly invoke ITSTATE, but it is very much there, and in particular it is saved upon interrupt entry, which ARM calls an exception, and restored upon exception return, such that predicated processing can resume as if control had never been hijacked: upon exception return to the ST instruction, ST will execute, then XOR will not since it will shift out a 0 from ITSTATE, the latter having been restored on exception return.

The lesson is: any extra behavior we want to endow the processor with needs to be expressible as state, so that taking an interrupt, saving the state, and later restoring the state and resuming from a given instruction results in the desired behavior being maintained despite the interrupt.

Repeat next instruction(s) a number of times given by state

Instead of having REP N, let us have a counter register C, and a REP instruction which repeats the next instruction the number of times indicated in the register (we need two instructions for this, as we’re going to see):

IP -> 0000 MOV 4, C
      0001 REP C
      0002 MOV X++, Y++
      0003 RTS

SP -> 0100 XXXX (don’t care)
      0102 XXXX

Now if an interrupt occurs after two iterations, the state is simply going to be:

      0000 MOV 4, C
      0001 REP C
      0002 MOV X++, Y++
      0003 RTS

      0100 0001 (saved IP)
SP -> 0102 XXXX

With C equal to 2. Notice the saved IP points to after the MOV to the counter register, but before the REP C, that way, when execution resumes on the REP instruction the memory-to-memory MOV instruction is executed twice and the end state will be the same as if all four iterations had occurred in sequence without any interrupt.

Applied in: the 8086 and all subsequent x86 processors, where REP is called the REP prefix and is hardwired to the CX register (later ECX), and you can use it for memory copy by prepending the MOVS instruction with it (instruction which is itself hardwired to SI (later ESI) for its source, and DI (later EDI) for its destination).

Load/store multiple

The REP C instruction/prefix system has a number of drawbacks, in particular in order to play well with interrupts as we just saw it requires recognition when handling the interrupt that we are in a special mode, followed by creating the conditions necessary for properly resuming execution. It also requires the elementary memory copy to be feasible as a single instruction, which is incompatible with RISC-style load-store architectures where an instruction can only load or store memory, not both.

We can observe that the REP C prefix, since it is only going to apply to a single instruction, will not serve many use cases anyway, so why not instead dedicate a handful of instructions to the memory copy use case and scale the PUL/PSH system with more registers?

That is the principle of the load and store multiple instructions. They take a list of registers on one side, and a register containing an address on the other, with additional adjustement modes (predecrement, postincrement, leave unchanged) so as to be less constrained as with the PUL/PSH case. Such a system requires increasing the number of registers in the architectural state so as to amortize the instruction decoding cost, increase which is going to add to context switching costs, but we were going to do that anyway with RISC.

So now our fictional instruction sequence can look like this:

IP -> 0000 LOADM X++, A-B-C-D
      0001 STOM A-B-C-D, Y++
      0002 RTS

SP -> 0100 XXXX (don’t care)
      0102 XXXX

We still have to promptly handle interrupts, but for the load/store multiple system the solution is simple, if radical: if an interrupt occurs while such an instruction is partially executed, execution is abandoned in the middle, and it will resume at the start of the instruction when interrupt processing is done. This is OK, since these loads and stores are idempotent: restarting them will not be impacted by any previous partial execution they left over (of course, a change to the register used for the address, if any such change is required, is done as the last step, so that no interrupt can cause the instruction to be restarted once this is done).

Well, this is only mostly OK. For instance, some of the loaded registers may have relationships with one another, such as the stack pointer (in the ABI if not architecturally), and naively loading such a register with a load multiple instruction may violate the constraint if the load multiple instruction is restarted. Similar potentially deadly subtleties may exist, such as in relation with virtual memory page faults where the operating system may have to emulate operation of the instruction… or may omit to do so, in which case load/store multiple instructions are not safe to use even if the processor supports them! I think it was the case for PowerPC ldm/stm in Mac OS X.

Sidebar: how do you, as a software developer, know whether it is safe to use the load and store multiple instructions of a processor if it has them? An important principle of assembly programming is that you can’t go wrong by imitating the system C compiler, so compile this (or a variant) at -Os, or whichever option optimizes for code size, to asm:

struct package
{
    size_t a, b, c;
};

void packcopy(struct package* src, struct package* dst)
{
    *dst = *src;
}

if this compiles to a load multiple followed by a store multiple, then those instructions are safe to use for your system.

Applied in: the 68000 (MOVEM), PowerPC, ARM (where their use is pervasive, at least pre-ARM64), etc.

decrement and branch if not zero

One observation you could make about the REP C system would be that it is best seen as implicitly branching back to the start of the instruction each time it is done executing, so why not put that as a plain old branch located after the instruction rather than as a prefix? Of course, that branch would handle counter management so that it could still function as a repetition system contained in a single instruction, but now repetition can be handled with more common test+branch mechanisms, simplifying processor implementation especially as it relates to interrupt management, and generalizes to loops involving more than one instruction, meaning there is no need to have the elementary copy be a single instruction:

IP -> 0000 MOV 4, C
loopstart:
      0001 LOAD X++, A
      0002 STO A, Y++
      0003 DBNZ C, loopstart;
      0004 RTS

From that framework, you can design the processor to try and recognize cases where the body of the loop is only 1 or 2 instructions long, and handle those specially by no longer fetching or decoding instructions while the loop is ongoing: it instead repeats operation of the looped instructions. In that case it still needs to handle exiting that mode in case of an interrupt, but at least it can decide by itself whether it can afford to enter that mode: for instance, depending on the type of looped instruction it could decide it would not be able to cleanly exit in case of interrupt and decide to execute the loop the traditional way instead.

The drawback is that it is a bit all-or-nothing: the loop is either executed fuly optimized or not at all, with the analysis becoming less and less trivial as we want to increase the number of looped instructions supported: regardless of the size of the loop, if there is a single instruction in the loop body, or a single instruction transition, where the engine would fail to set them up to loop in a way where it can handle interrupts, then the whole loop is executed slowly. That being said, it does handle our target use case as specified.

Applied in: the 68010 and later 68k variants such as CPU32-based microcontrollers, where the DBRA/DBcc instruction could trigger a fast loop feature where instructions fetches are halted and operation of the looped instruction is repeated according to the counter.

instruction caches, pipelining, and branch prediction

You could look at the complexity of implementing interrupt processing in any of these features and consider that you could almost as easily implement a proper pipeline, including handling interrupts while instructions are in flight, and end up supporting the use case, but also much more general speedups, just as efficiently. After all, the speed of memory copy is going to be constrained by the interface to the memory bus, your only contribution is to reduce as much as possible instruction fetching and decoding overhead, which is going to be accomplished if that happens in parallel with the memory copy of the previous instruction. Accomplishing that also requires a dedicated instruction cache so instruction can be fetched in parallel with data, but integrating a small amount of memory cells on your processor die is getting cheaper by the day. And keeping the pipeline fed when branches are involved, as here with loops, will require you to add non-trivial branch prediction, but you can at least get loops right with a simple “backwards branches are predicted to be taken” approach. And it turns out that simple branch predictors work well in real-life code for branches beyond loops, compensating the effects of pipelining elsewhere (and if you make the predictor just a little bit more complex you can predict even better, and then a little more complexity will improve performance still, etc.; there is always a performance gain to be had).

Applied in: all microprocessors used in general-purpose computing by now, starting in the beginning of the 90’s. For instance, x86 processors have an instruction implementing the decrement and branch if not zero functionality, but its use is now discouraged (see 3.5.1 Instruction Selection in the Intel optimization reference manual), and modern Intel processors recognise loops even when they use generic instructions and optimize for them, except for the loop exit which keeps being mispredicted.

With all that in mind, next time we’re going to be able to look at how to redesign our processors to avoid the situation that led us to rampant, insecure speculation in the first place.