What are the very long-term solutions to Meltdown and Spectre going to look like?

While the long-term solutions to Meltdown and Spectre we saw earlier can and probably will be made to work on mainstream hardware, their inefficiencies call into question the fundamental operation of our processors: rather than try and contain speculation, why not avoid the need for speculation in the first place, for instance? Or, on the contrary, embrace it? But revisiting the design space of processors requires taking a long look at why they work that way in the first place, and whether these assumptions are still suitable.

Inspired by the fact many solutions to Spectre rely on a model of the speculative execution behavior of the processor, e.g. “Modern CPUs do not speculate on bit masking” (which is fine for short-term mitigations: we can’t really throw away every single computer and computing device to start over from scratch and first principles at this point), some suggest we should officially take the speculative behavior of the processor into account when developing/compiling software, typically by having the processor expose it as part of its model. This is completely untenable long-term: as processors evolve and keep relying on speculation to improve performance, they will end up being constrained by the model of speculative execution they initially exposed and that existing programs started relying against, i.e. future processors will have to refrain from speculating even an iota more than they currently do (but still speculate, such that we will keep having to maintain the mitigations). Unless they do, in which case we will have to reanalyze all code and update/add mitigations every time a new processor that improves speculation comes out (not to mention the inherent window of vulnerability). Possibly both. This is untenable, as the story of delay slots taught us.

In case you’re new to the concept of delay slots, they were a technique introduced by Berkeley RISC to allow pipelining instructions while avoiding bubbles every time a branch would be encountered: the instruction(s) following the branch would be executed no matter what (slide 156 out of 378), and any jump would only occur after that. And some RISC architectures such as MIPS and Sparc used them, and it worked well. That is, until it was time to create the successor design, with a longer pipeline. You can’t put an additional delay slot, since that would break compatibility with all existing code, so you instead compensate in other ways (branch prediction, etc.), such that delay slots are not longer useful, but you still have to support the existing delay slots: doing otherwise would also break compatibility. In the end, every future version of your processor ends up having to simulate the original pipeline because you encoded that as part of the ISA. Oh, and when I said delay slots initially worked well, that was a lie, because if your processor takes an interrupt just after a branch and before a delay slot, how can the work be resumed? Returning from the interrupt cannot be simply jumping back to the address of the instruction after the branch, there is also the delayed branch state to be taken care of; solutions to these issues were found, obviously, but singularly complexify interrupt handling.

A separate insight of RISC in that area is that, if we could selectively inhibit which instructions would write to flags, then we could prepare flags well ahead of when they would be needed, allowing the processor to know ahead of time whether the branch would be taken or not, enough so that the next instruction could be fetched without any stall, removing the need for either delay slots or to try and predict the branch. That is often useful for “runtime configurable” code, and sometimes for more dynamic situations, however in many cases the compiler does not have many or any instruction to put between the test and the branch, so while it can be a useful tool (compared to delay slots, it does not have the exception-related issues, and the compiler can provision as many instructions in there as the program allows, rather than being limited by the architecturally-defined delay slot size, and conversely if it can’t, it does not have to waste memory filling the slots with nops), it also has many of the same issues as delay slots: as the pipelines get deeper there will be more and more cases where processors will have to resort to branch prediction and speculative executions to keep themselves fed when running existing code; using knowledge that is only available at runtime, as the processor has access to context the compiler does not have. Furthermore, the internal RISC engine of current x86 processors actually goes as far as fusing conditional branches together with the preceding test or comparison instruction, suggesting such a fused instruction is more amenable to dynamic scheduling. RISC-V has an interesting approach: it does away with flags entirely (not even with multiple flag registers as in PPC for instance (cr0 to cr7)), using instead such fused instructions… but it is still possible to put the result of a test well ahead of the branch that requires it, simply by setting a regular register to 0 or 1 depending on the test outcome, then having the fused branch’s test be a comparison of this register with 0, and presumably implementations will be able to take advantage of this.

Generally, there is an unavoidable tension due to the straightjacket of sequential instruction execution, straightjacket which is itself unavoidable due to the need of being able to suspend processing, then resume where it left off. How could we better express our computations in such a way that hardware can execute a lot of it at once, in parallel, while being formally laid out as a sequence of instructions? For instance, while it could be desirable to have vectors of arbitrary lengths rather than fixed-sized ones (as in MMX, Altivec, SSE, NEON, etc.), doing so raises important interruptibility challenges: the Seymour Cray designs either at CDC or Cray did not support interrupts or demand-paged virtual memory! If we give up on those, we give up on the basis of preemptive multitasking and memory protection, so we’d end up with a system that would be worse than MacOS 9, and while MacOS 9 went to heroic lengths in riding a cooperative multitasking, unprotected memory model (and even that OS supported interrupts), no one who has known MacOS 9 ever wants to go back to it: it is dead for a reason (from the story of Audion). Alternatively, we could imagine fundamentally different ways of providing the high-level features, but then you still have to solve the issue of “what if the software developer made a mistake and specified an excruciatingly long vector, such that it will take 10 seconds to complete processing?” So either way, there would need to be some way to pause vector processing to do something else, then resume where it left off, which imposes severe constraints on the instruction set: RISC-V is going to attempt that, but I do not know of anyone else.

One assumption we could challenge is whether we need to be writing system software in C and judge processors according to how well they execute that software (via). To my mind, Spectre and Meltdown are as much the result of having to support Unix or systems with Unix-ish semantic, or possibly even just fast interrupts, as they are the result of having to support C: flat memory, context switching/preemptive multitasking hacked on absolute minimum hardware support (OSes have to manually read, then replace, every bit of architectural state individually and manually!), which itself implies limited architectural state, which results in lots of implicit micro-architectural state to compensate, traps also hacked on absolute minimal hardware support, in particular no quick way to switch to a supervisor context to provide services short of a crude memory-range-based protection (and now we see how that turned out)¹, no collection types in syscall interface (ever looked at select() ?) thus forcing all interactions to be through the kernel reading a block of memory from the calling process address space even for non-scalar syscalls, mmap(), etc. However, especially in the domain of traps, it will be necessary to carefully follow the end to end principle: the history of computing is littered with hardware features that ended up being unused for fail of providing a good match with the high-level, end to end need. This is visible for instance in the x86 instruction set: neither the dedicated device I/O instructions (IN/OUT) nor the protection rings 1 and 2 (user-level code uses ring 3, and kernels use ring 0) are ever used in general purpose computing (except maybe on Windows for compatibility reasons).

But Spectre and Meltdown are also indeed the result of having to support C, in particular synchronization memory primitives are not very good matches for common higher level language operations, such as reference assignment (whether it is in a GC environment or an ARC environment). Unfortunately, a lot of research in that area ends up falling back to C…

Revisiting the assumption of C and Unix is no coincidence. While our computers are in many ways descendants of microcomputers, the general design common to current microprocessor ISAs is inherited from minicomputers, and most of it specifically from the PDP-11, where both C and Unix were developed; this includes memory mapped and interrupt-based device I/O rather than channel I/O, demand-paged virtual memory, the latter two of which imply the need to fault at any time, both byte and word addressing, etc. This in turn shapes higher level features and their implementation: preemptive multitasking, memory protection, IPC, etc. Alternatives such as Lisp machines, Intel 432, Itanium, or Sun Rock did not really pan out; but did these failures disprove their fundamental ideas? Hard to tell. And some choices from the minicomputer era, most of which were related to offloading responsibilities from hardware to software, ended up being useful and/or sticking for reasons sometimes different from the original ones, original ones which could most often be summarized as: we need to make do with the least possible amount of transistors/memory cells (parallels with evolutionary biology: an adaptation that was developed at one time may find itself repurposed for a completely different need). For instance, the PDP-11 supported immutable code (while some other archs would assume instructions could be explicitly or sometimes even implicitly modified, e.g. as late as 1982 the Burroughs B4900 stored the tracking for branch prediction directly in the branch instruction opcode itself, found out while researching the history of branch prediction…) which was immediately useful for having programs directly stored in ROM instead of having to be loaded from mass storage plus occupy precious RAM, was also useful because it enabled code sharing (the PDP-11 supported PIC in addition), but today is also indispensable for security purposes: it’s most definitely safer to have mutable data be marked as no-exec, and therefore have executable code be immutable. The same way, minicomputers eschewed channel I/O to avoid an additional processing unit dedicated to device I/O, thus saving costs and enabling them to fit in a refrigerator-sized enclosure rather than require a dedicated room with raised flooring, but nowadays having the CPU being able to interrupt its processing (and later be able to resume it) is mandatory for preemptive multitasking and real-time purposes such as low-latency audio (think Skype). As a result, it is not possible to decide we can do without feature X of processors just because its original purpose is no longer current: feature X may have been repurposed in the meantime. In particular, virtual memory faults are used everywhere to provide just-in-time services, saving resources even today in the process (physical pages not allocated until actually used, CoW memory, etc.). Careful analysis, and even trial and error (can we build a performant system on this idea?), are necessary. As a significant example, we must not neglect how moving responsibilities from hardware to software enabled rapid developer iteration of computer monitor functionality UX (job control, I/O, supervision, auditing, debugging, etc.). Moving back any of this responsibility to hardware would almost certainly cause the corresponding UX to regress, regardless of the efforts towards this end by the hardware: the lack of rapid iteration would mean any shortcoming would remain unfixable.

Now that I think about it, one avenue of exploration would be to try and build on a high-level memory model that reflects the performance characteristics of the modern memory hierarchy. Indeed, it is important to realize uniform, randomly addressable memory hasn’t been the “reality” for some time: for instance, row selection in JEDEC protocols (DDR, etc.) means it’s faster to read contiguous blocks than random ones, and besides that we have NUMA, and caches of course (need a refresher? Ulrich Drepper’s got you covered). That being said, the Von Neumann abstraction will remain true at some level. So the game here would be not so much to abandon it than to build a more structured abstraction above it that better matches the underlying performance characteristics.

As you can see, this is still very much an open research space. In the meantime, we will have to make do with what we have.

¹e.g. we could imagine a system similar to fasttraps (themselves typically used for floating-point assistance) where a new syscall instruction would take as arguments the address and number of bytes of memory to copy (even if more may end up being necessary, that would still allow processing to start), and automatically switch to the kernel address space, so that the processor would best know how to manage the caches (in relation to address translation caching for instance) instead of second-guessing what the program is attempting to do.

How to design an architectural processor feature, anyway?

Before I present what could be the very long term solutions to Meltdown and Spectre, I thought it would be interesting to look at a case study in how to (and how not to) implement processor features.

So, imagine you’re in charge of designing a potential replacement for the 6809, and you read this article, with the takeaway that, amazing as it is, that hack would quickly become insufficient given the increase in screen size and resolution (VGA is just around the corner, after all) that is going to outpace processor clock speed.

Of course, one of the first solutions for this issue would be to have better dedicated graphics capabilities, but your new processor may be used in computers where there is zero such hardware, or even if there is, there are always going to be use cases that fall through the cracks and are not well supported by such hardware, use cases where programmers are going to rely on your processor instead. In fact, you don’t want to get too tied up to that particular scenario, and instead think of it as being merely the exemplar of a general need: that of efficient, user-level memory copy of arbitrary length between arbitrary addresses that integrates well with other features, interrupts in particular. That being set up, let us look at prospective solutions.

Repeat next instruction(s) a fixed number of times

That one seems obvious: a new system that indicates the next instruction, or possibly the next few instructions, is to be executed a given number of times, not just once; the point being to pay the instruction overhead (decoding, in particular) only once, then having it perform its work N times at full efficiency. This isn’t a new idea, in fact, programmers for a very long time have been requesting the possibility for an instruction to “stick” so it could operate more than once. How long? Well, how about for as long as there have been programmers?

However, that is not going to work that simply, not with interrupts in play. Let us look at a fictional instruction sequence:

IP -> 0000 REP 4
      0001 MOV X++, Y++
      0002 RTS

SP -> 0100 XXXX (don’t care)
      0102 XXXX

But in the middle of the copy, an interrupt is received after the MOV instruction has executed two times, with two more executions remaining. Now does our state (as we enter the interrupt handler) look like this:

      0000 REP 4
      0001 MOV X++, Y++
      0002 RTS

      0100 0001 (saved IP)
SP -> 0102 XXXX

In which case, when we return from the interrupt the MOV will only be executed once more, making it executed only 3 times in total, rather than the expected 4, wreaking havoc in the program; so can we provide this instead:

      0000 REP 4
      0001 MOV X++, Y++
      0002 RTS

      0100 0000 (saved IP)
SP -> 0102 XXXX

Well, no, since then upon return from the interrupt execution will resume at the REP instruction… in which case the MOV instruction will be executed 4 times, even though it has already executed twice, meaning it will execute 2 extra times and 6 times in total.

It’s not possible to modify the REP instruction since your processor has to support executing code directly from ROM given the price of RAM (and making code be read-only is valuable for other reasons, such as being more secure or allowing it to be shared between different processes). How about resetting X and Y to their starting values and resume all iterations on exit? Except operation of the whole loop is not idempotent if the two buffers overlap, and there is no reason not to allow that (e.g. memmove allows it), so restarting the whole sequence is not going to be transparent. What about delaying interrupts until all iterations are completed? With four iterations that might be acceptable, but given your processor clock speed, as little as 16 iterations could raise important issues in the latency of interrupt handling, such that real-time deadlines would be missed and sound be choppy.

Whichever way we look at it, this is not going to work. What will?

Potential inspiration: the effect on the architectural state of the Thumb (T32) If-Then (IT) instruction

Conditional execution (or perhaps better said, predicated execution) is pervasive in ARM, and it is possible in Thumb too, but that latter case requires the If-Then instruction:

(any instruction that sets the EQ condition code)
IP -> 00000100 ITTE EQ
      00000102 ADD R0, R0, R1
      00000104 ST R5, [R3]
      00000106 XOR R0, R0, R0

SP -> 00000200 XXXXXXXX (don’t care)
      00000204 XXXXXXXX

And as if by magic, the ADD and ST instructions only execute if the EQ condition code is set, and XOR, corresponding to the E (for else) in the IT instruction, only executes if the EQ condition code is *not* set, as if you had written this:

(any instruction that sets the EQ condition code)
IP -> 00000100 ADD.EQ R0, R0, R1
      00000102 ST.EQ R5, [R3]
      00000104 XOR.NE R0, R0, R0

That might appear to raise interruptibility challenges as well: what happens if an interrupt has to be handled just after the ADD instruction, or when the ST instruction raises a page fault because the address at R3 must be paged back in? Because when execution resumes at ST, what is to stop XOR from being unconditionally executed?

The answer is ITSTATE, a 4-bit register that is part of the architectural state. What the IT instruction actually does is:

  • take its immediate bits (here, 110), and combine them using a negative-exclusive-or with the repeated condition code bit (we’re going to assume it is 111)
  • set ITSTATE to the result (here, 110), padding missing bits with ones (final result here being 1101)

And that’s it. What then happens is that nearly every T32 instruction (BKPT being a notable exception) starts operation by shifting out the first bit from ITSTATE (shifting in a 1 from the other side), and avoids performing any work if the shifted out bit was 0

This means you never need explicitly invoke ITSTATE, but it is very much there, and in particular it is saved upon interrupt entry, which ARM calls an exception, and restored upon exception return, such that predicated processing can resume as if control had never been hijacked: upon exception return to the ST instruction, ST will execute, then XOR will not since it will shift out a 0 from ITSTATE, the latter having been restored on exception return.

The lesson is: any extra behavior we want to endow the processor with needs to be expressible as state, so that taking an interrupt, saving the state, and later restoring the state and resuming from a given instruction results in the desired behavior being maintained despite the interrupt.

Repeat next instruction(s) a number of times given by state

Instead of having REP N, let us have a counter register C, and a REP instruction which repeats the next instruction the number of times indicated in the register (we need two instructions for this, as we’re going to see):

IP -> 0000 MOV 4, C
      0001 REP C
      0002 MOV X++, Y++
      0003 RTS

SP -> 0100 XXXX (don’t care)
      0102 XXXX

Now if an interrupt occurs after two iterations, the state is simply going to be:

      0000 MOV 4, C
      0001 REP C
      0002 MOV X++, Y++
      0003 RTS

      0100 0001 (saved IP)
SP -> 0102 XXXX

With C equal to 2. Notice the saved IP points to after the MOV to the counter register, but before the REP C, that way, when execution resumes on the REP instruction the memory-to-memory MOV instruction is executed twice and the end state will be the same as if all four iterations had occurred in sequence without any interrupt.

Applied in: the 8086 and all subsequent x86 processors, where REP is called the REP prefix and is hardwired to the CX register (later ECX), and you can use it for memory copy by prepending the MOVS instruction with it (instruction which is itself hardwired to SI (later ESI) for its source, and DI (later EDI) for its destination).

Load/store multiple

The REP C instruction/prefix system has a number of drawbacks, in particular in order to play well with interrupts as we just saw it requires recognition when handling the interrupt that we are in a special mode, followed by creating the conditions necessary for properly resuming execution. It also requires the elementary memory copy to be feasible as a single instruction, which is incompatible with RISC-style load-store architectures where an instruction can only load or store memory, not both.

We can observe that the REP C prefix, since it is only going to apply to a single instruction, will not serve many use cases anyway, so why not instead dedicate a handful of instructions to the memory copy use case and scale the PUL/PSH system with more registers?

That is the principle of the load and store multiple instructions. They take a list of registers on one side, and a register containing an address on the other, with additional adjustement modes (predecrement, postincrement, leave unchanged) so as to be less constrained as with the PUL/PSH case. Such a system requires increasing the number of registers in the architectural state so as to amortize the instruction decoding cost, increase which is going to add to context switching costs, but we were going to do that anyway with RISC.

So now our fictional instruction sequence can look like this:

IP -> 0000 LOADM X++, A-B-C-D
      0001 STOM A-B-C-D, Y++
      0002 RTS

SP -> 0100 XXXX (don’t care)
      0102 XXXX

We still have to promptly handle interrupts, but for the load/store multiple system the solution is simple, if radical: if an interrupt occurs while such an instruction is partially executed, execution is abandoned in the middle, and it will resume at the start of the instruction when interrupt processing is done. This is OK, since these loads and stores are idempotent: restarting them will not be impacted by any previous partial execution they left over (of course, a change to the register used for the address, if any such change is required, is done as the last step, so that no interrupt can cause the instruction to be restarted once this is done).

Well, this is only mostly OK. For instance, some of the loaded registers may have relationships with one another, such as the stack pointer (in the ABI if not architecturally), and naively loading such a register with a load multiple instruction may violate the constraint if the load multiple instruction is restarted. Similar potentially deadly subtleties may exist, such as in relation with virtual memory page faults where the operating system may have to emulate operation of the instruction… or may omit to do so, in which case load/store multiple instructions are not safe to use even if the processor supports them! I think it was the case for PowerPC ldm/stm in Mac OS X.

Sidebar: how do you, as a software developer, know whether it is safe to use the load and store multiple instructions of a processor if it has them? An important principle of assembly programming is that you can’t go wrong by imitating the system C compiler, so compile this (or a variant) at -Os, or whichever option optimizes for code size, to asm:

struct package
{
    size_t a, b, c;
};

void packcopy(struct package* src, struct package* dst)
{
    *dst = *src;
}

if this compiles to a load multiple followed by a store multiple, then those instructions are safe to use for your system.

Applied in: the 68000 (MOVEM), PowerPC, ARM (where their use is pervasive, at least pre-ARM64), etc.

decrement and branch if not zero

One observation you could make about the REP C system would be that it is best seen as implicitly branching back to the start of the instruction each time it is done executing, so why not put that as a plain old branch located after the instruction rather than as a prefix? Of course, that branch would handle counter management so that it could still function as a repetition system contained in a single instruction, but now repetition can be handled with more common test+branch mechanisms, simplifying processor implementation especially as it relates to interrupt management, and generalizes to loops involving more than one instruction, meaning there is no need to have the elementary copy be a single instruction:

IP -> 0000 MOV 4, C
loopstart:
      0001 LOAD X++, A
      0002 STO A, Y++
      0003 DBNZ C, loopstart;
      0004 RTS

From that framework, you can design the processor to try and recognize cases where the body of the loop is only 1 or 2 instructions long, and handle those specially by no longer fetching or decoding instructions while the loop is ongoing: it instead repeats operation of the looped instructions. In that case it still needs to handle exiting that mode in case of an interrupt, but at least it can decide by itself whether it can afford to enter that mode: for instance, depending on the type of looped instruction it could decide it would not be able to cleanly exit in case of interrupt and decide to execute the loop the traditional way instead.

The drawback is that it is a bit all-or-nothing: the loop is either executed fuly optimized or not at all, with the analysis becoming less and less trivial as we want to increase the number of looped instructions supported: regardless of the size of the loop, if there is a single instruction in the loop body, or a single instruction transition, where the engine would fail to set them up to loop in a way where it can handle interrupts, then the whole loop is executed slowly. That being said, it does handle our target use case as specified.

Applied in: the 68010 and later 68k variants such as CPU32-based microcontrollers, where the DBRA/DBcc instruction could trigger a fast loop feature where instructions fetches are halted and operation of the looped instruction is repeated according to the counter.

instruction caches, pipelining, and branch prediction

You could look at the complexity of implementing interrupt processing in any of these features and consider that you could almost as easily implement a proper pipeline, including handling interrupts while instructions are in flight, and end up supporting the use case, but also much more general speedups, just as efficiently. After all, the speed of memory copy is going to be constrained by the interface to the memory bus, your only contribution is to reduce as much as possible instruction fetching and decoding overhead, which is going to be accomplished if that happens in parallel with the memory copy of the previous instruction. Accomplishing that also requires a dedicated instruction cache so instruction can be fetched in parallel with data, but integrating a small amount of memory cells on your processor die is getting cheaper by the day. And keeping the pipeline fed when branches are involved, as here with loops, will require you to add non-trivial branch prediction, but you can at least get loops right with a simple “backwards branches are predicted to be taken” approach. And it turns out that simple branch predictors work well in real-life code for branches beyond loops, compensating the effects of pipelining elsewhere (and if you make the predictor just a little bit more complex you can predict even better, and then a little more complexity will improve performance still, etc.; there is always a performance gain to be had).

Applied in: all microprocessors used in general-purpose computing by now, starting in the beginning of the 90’s. For instance, x86 processors have an instruction implementing the decrement and branch if not zero functionality, but its use is now discouraged (see 3.5.1 Instruction Selection in the Intel optimization reference manual), and modern Intel processors recognise loops even when they use generic instructions and optimize for them, except for the loop exit which keeps being mispredicted.

With all that in mind, next time we’re going to be able to look at how to redesign our processors to avoid the situation that led us to rampant, insecure speculation in the first place.

What will the long-term solutions be to Meltdown and Spectre?

It’s hard to believe it has now been more than one year since the disclosure of Meltdown and Spectre. There was so much frenzy in the first days and weeks that it has perhaps obscured the fact any solutions we currently have are temporary, barely secure, spackle-everywhere stopgap mitigations, and now that the dust has settled on that, I thought I’d look at what researchers and other contributors have come up with in the last year to provide secure processors – without of course requiring all of us to rewrite all our software from scratch.

Context

Do I need to remind you of Meltdown and Spectre? No, of course not; even if you’re reading this 20 years from now you will have no trouble finding good intro material to these. So as we discuss solutions, my only contribution would be this: it is important to realize designers were not lazy. For instance, they did not “forget” the caches as part of undoing speculative work in the processor, as you can’t “undo” the effect of speculation on the caches: for one, how would you reload the data that was evicted (necessary in order to be a real undo)? You can’t really have checkpoints in the cache that you roll back to, either: SafeSpec explores that, and besides still leaking state, more importantly it precludes any kind of multi-core or multi-socket configuration (SafeSpec is incompatible with cache coherency protocols), a non-starter in this day and age (undoing cache state is also problematic in multi-core setups, as the cache state resulting from speculative execution would be transitorily visible to other cores).

It is also important to realize preventing aliasing in branch prediction tracking slots would not fundamentally solve anything: even if this was done, attackers could still poison BHS and possibly BTB by coercing the kernel into taking (resp. not taking) the attacked branch, through the execution of ordinary syscalls, and then use speculative execution driven by that to leak data through the caches.

Besides information specific to Meltdown and Spectre, my recommended reading before we begin is Ulrich Drepper on the modern computer memory architecture, still current, and Dan Luu on branch prediction: this will tell you the myriad places where processors generate and store implicit information needed for modern performance.

The goal

As opposed to the current mitigations, we need systemic protection against whole classes of attacks, not just the current ones: it’s not just that hardware cannot be patched, but it also has dramatically longer design cycles which means protecting only against known risks at the start of a projet would make the protections obsolete by the time the hardware ships. And even if patching was a possibility, it’s not like the patch treadmill is desirable, anyway (in particular, adding fences, etc. around manually identified vulnerable sequences feels completely insane to me and amounts to a dangerous game of whack-a-vulnerability: vulnerable sequences will end up being added to kernel code faster than security-conscious people will figure them out). Take for instance, the Intel doc which described the Spectre (and Meltdown) vulnerability as being a variant of the “confused deputy”; this is its correct classification, but I feel this deputy is confused because he has been given responsibility of the effect of speculative executions of his code paths, a staggering responsibility he has never requested in the first place! No, we need to attack these kinds of vulnerabilities at the root, such that they cannot spawn new heads, and those two techniques do so.

DAWG

First is DAWG. The fundamental idea is very intriguing: it is designed to close off any kind of cache side channel state¹, not merely tag state (that is, whether a value is present in the cache or not), and designed to close off data leaks regardless of which phenomenon would feed any such side channel: it is not limited to speculative execution. How do they ensure that? DAWG does so by having the OS dynamically partition all cache levels, and then assign the partitions, in a fashion similar to PCID.

This means that even with a single processor core, there are multiple caches at each level, one per trust domain, each separate from its siblings, and having a proportional fraction of the size and associativity of the corresponding physical cache of that level (cache line size and cache way size are unaffected). This piggybacks on recent evolutions (Intel CAT) to manage the cache as a resource to provision, but CAT is intended for QoS and offers limited security guarantees.

As long as data stays within its trust domain, that is all there is to it. When cross-partition data transfer is necessary, however, the kernel performs it by first setting up a mixed context where reads are to be considered as belonging to one domain, but writes to another, then performs the reads and writes themselves: it affords best possible cache usage during and after transfer.

Such an organization raises a number of sub-problems, but they seem to have done a good job of addressing those. For instance, since each cache level is effectively partitioned, the same cache line may be in multiple places in the same physical cache level, in different domains, which is not traditional and requires some care in the implementation. The kernel has access to separate controls for where eviction can happen, and where hits can happens, this is necessary for a transition period whenever resizing the partitions. DAWG integrates itself with cache coherency protocols, by having each cache partition behave mostly, but not exactly, like logically separate cache for cache coherency purposes: one particularly important limitation we will come back to is that DAWG cannot handle a trust domain attempting to load a line for writing when a different domain already owns that line for writing.

In terms of functional insertion, they have a clever design where they interpose in a limited number of cache operations so as not to insert themselves in the most latency-critical parts (tag detection, hit multiplexing, etc.). It requires some integration with the cache replacement algorithm, and they show how to do so with tree-PLRU (Pseudo Least Recently Used) and NRU (Not Recently Used).

In terms of features, DAWG allows sharing memory read-only, CoW (Copy on Write), and one-way shared memory where only one trust domain can have write access. DAWG only features a modest efficiency hit compared to the insecure baseline, though it depends on policy (CAT has similar policy-dependent behavior).

On the other hand, there are a few, though manageable, limitations.

  • DAWG disallows sharing physical memory between different trust domains where more than one domain has write access, due to impossibility to manage cache coherence when more than one domain wants to write to two cache lines corresponding to the same physical address. I feel this is manageable: such a technique is probably extremely hard to secure given the possibility of a side channel through cache coherency management state, as MeltdownPrime and SpectrePrime have demonstrated, so we would need an overview of the main uses of where such memory sharing happens; off the top of my head, the typical use is for the framebuffer used for IPC with WindowServer/X11, in which case the need in the first place is only for one-way transfer, the solution here would be to change permissions to restrict write rights to one side only.
  • DAWG provides no solution for transfer in/out of shared physical memory between different trust domains where neither is the kernel. But as we just saw, the allocation of such a thing need only be done by specific processes (perhaps those marked with a specific sandbox permission?), and transfer could be performed by the kernel on behalf of the allocating domain through a new syscall.
  • Hot data outside the kernel such as oft-called functions in shared libraries (think objc_msgSend()), while residing in a single place in physical memory, would end up being copied in every cache partition, thus reducing effective capacity of all physical caches (hot data from the kernel would only need to be present in the kernel partition, regardless of which process makes the syscall).
  • Efficient operation relies on the kernel managing the partitioning according to the needs of each trust domain, which is not trivial: partition ID management could be done in a fashion similar to PCID, however that still leaves the determination of partition size, keeping in mind that the cache at every level needs to be partitioned, including those shared between different cores which therefore have more clients and thus require more partitioning, additionally with limited granularity, granularity which depends on the level: a 16-way set associative cache may be partitioned in increments of 1/16th of its capacity, but a 4-way set associative cache only by fourths of its capacity. Easy.
  • DAWG guards between explicit trust domains, so it cannot protect against an attacker in the same process. This could be mitigated by everyone adopting the Chrome method: sorry Robert, but maybe “mixing code with different trust labels in the same address space” needs to become a black art.

InvisiSpec

The basic idea of InvisiSpec corresponds to the avenue I evoked back then, which is that speculative loads only bring data to the processor without affecting cache state (either bringing that data to a cache level where it wasn’t, or modifying cache replacement policy, or other metadata), with the cache being updated only when the load is validated.

Well, that’s it, god job everyone? Of course not, the devil is in the details, including some I never began to suspect: validation cannot happen just any random way. InvisiSpec details how this is done in practice, the main technique being special loads performed solely for validation purposes: once loaded, the processors only uses this data, if ever, to compare it against the speculatively loaded data kept in a speculation buffer, and if the values match, processing can proceed; and while you would think that could raise ABA issues, it is not the case, as we’re going to see.

Overall, InvisiSpec proposes a very interesting model of a modern processor: first, a speculation part that performs computations while “playing pretend”: it doesn’t matter at that point whether data is correct (of course, it needs to be correct most of the time to serve any purpose), then the reorder buffer part, which can be seen as the “real” processing that executes according to the sequential model of the processor, except it uses results already computed by the speculative engine, when they exist. In fact, if these results don’t exist (e.g. the data was invalidated), the reorder buffer has to have the speculative engine retry, and the reorder buffer waits for it to be done: it does not execute the calculations (ALU, etc.) inline. A third part makes sure to quickly (i.e. with low latency) feed the speculative engine with data that is right most of the time, and do so invisibly: loads performed by the speculative engine can fetch from the caches but do not populate any cache, and are instead stored in the speculation buffer in order to remember that any results were obtained from these inputs.

This model piggybacks on existing infrastructure of current out of order processors: the reorder buffer is already the part in charge of making sure instructions “retire”, i.e. commit their effect, in order; in particular, on x86 processors the reorder buffer is responsible for invalidating loads executed out of order, including instructions after those, when it detects cache coherence traffic that invalidates the corresponding cache line. Ever wondered how x86 processors could maintain a strongly ordered memory model while executing instructions out of order? Now you know.

InvisiSpec has to do much more, however, as it cannot rely on cache coherence traffic: since the initial load is invisible, by design, other caches are allowed to think they have exclusive access (Modified/Exclusive/Shared/Invalid, or MESI, model) and won’t externally signal any change. Therefore, if the memory ordering model stipulates that loads must appear to occur in order, then it is necessary for the reorder buffer to perform a full validation, i.e. not only must it perform a fresh, new, non-speculative load as if the load was executed for the first time (thus allowing the caches to be populated), but then it has to wait for it to complete and compare the loaded data with the speculatively loaded one; if they are equal, then the results precomputed by the speculative engine for the downstream computations are correct as well, and the reorder buffer can proceed with these instructions: it does not even matter if A compared equal to A but the memory cell held the value B in between, as the only thing that matters is whether the downstream computation is valid for value A, which is true if and only if the speculative engine was fed an equal value A when it executed.

This leads into a much more tractable model for security: as far as leaking state is concerned, security researchers only need to look at operation of the reorder buffer; on the other hand, performance engineers will mostly look at the upstream parts, to make sure speculation will be invalidated as rarely as possible, but still look at the reorder buffer to make sure validation latencies will be covered, as far as possible.

Notably, InvisiSpec protects against attackers living in the same address space or trust boundary, and since it is cache-agnostic, it does not restrict memory sharing in any way.

The following limitations can be noted in InvisiSpec:

  • InvisiSpec only protects against speculation-related attacks, not other kinds of attacks that also use the cache as a side channel. Additional techniques will be needed for those.
  • InvisiSpec adds a significant efficiency hit compared to insecure baseline, both in execution time (22% to 80% increase on SPEC benchmarks, lower is better) and cache traffic (34% to 60% increase on SPEC benchmarks, lower is better), the latter of which is one of the main drivers of power usage. That will need to be improved before people will switch to a secure processor, otherwise they will keep using “good enough” mitigations; more work is needed in that area. My analysis would be that most of that efficiency hit is related to the requirement to protect against an attacker in the same address space: any two pair of loads could be an attacker/victim pair! The result is that pipelining is mostly defeated to the extent it is used to protect against load latencies. I am skeptical with regard to their suggestion for the processor to disable interrupts after a load has committed, and until the next load gets to commit, so as to allow the latter to start validation early (disabling interrupts serves to remove the last potential source of events that could prevent the latter load from committing): this would add an important constraint to interrupt management, which furthermore is unlikely to compose well with similar constraints.

The future

This isn’t the last we will hear of work needed to secure processors post-Meltdown and Spectre; I am sure novel techniques will be proposed. At any rate, we in the computing industry as a whole need to start demanding of Intel and others what systemic protections they are putting in their processors, be they DAWG or InvisiSpec or something else, which will ensure whole classes of attacks become impossible.


  1. At least on the digital side: DAWG does not guard against power usage or electromagnetic radiation leaks, or rowhammer-like attacks.

Software Reenchantment

I’ve had Nikitonsky’s Software Disenchantment post in my mind ever since it was posted… which was four months ago. It’s fair to say I’m obsessed and need to get the following thoughts out of my system.

First because it resonated with me, of course. I recognize many of the situations he describes, and I share many of his concerns. There is no doubt that many evolutions in software development seem incongruous and at odds with recommendations for writing reliable software. The increasing complexity of the software stack, in particular, is undoubtedly a recipe for bugs to thrive, able to hide in that complexity.

Yet some of that complexification, even controversial, is nevertheless a progress. The example that comes to mind is Chrome and more specifically its architecture of running each tab (HTML rendering, JavaScript, etc.) on its own process for reliability and security, and the related decision to develop a high-performance JavaScript engine, V8, that dynamically compiles JavaScript to native code and runs that (if you need a refresher, Scott McCloud’s comic is still relevant). Yes, this makes Chrome a resource hog, and initially I was skeptical about the need: the JavaScript engine controls the generated code, so if it did its work correctly, it would properly enforce same-origin and other restrictions, without the need of the per-tab process architecture and its overhead of creation, memory occupation, etc. of many shell processes.

But later on I started seeing things differently. It is clear that browser developers have been for the last few years engaged in a competition for performance, features, etc., even if they don’t all favor the same benchmarks. In that fast-paced environment, it would be a hard dilemma between going for features and performance at the risk of bugs, especially security vulnerabilities, slipping through the cracks, and instead moving at a more careful pace, at the risk of being left behind by more innovative browsers and being marginalized; and even if your competitor’s vulnerabilities end up catching up with him in the long term, that still leaves enough time for your browser to be so marginalized that it cannot recover. We’re not far from a variant of the prisoner’s dilemma. Chrome resolved that dilemma by going for performance and features, and at the same time investing up front in an architecture that provides a safety net so that a single vulnerability doesn’t mean the attacker can escape the jail yet, and bugs of other kinds are mitigated. This frees the developers working on most of the browser code, in particular on the JavaScript engine, from excessively needing to worry about security and bugs, with the few people having the most expertise on that instead working on the sandbox architecture of the browser.

So that’s good for the browser ecosystem, but the benefits extend beyond that: indeed the oneupmanship from this competition will also democratize software development. Look, C/C++ is my whole carrier, I love system languages, there are many things you can do only using them even in the applicative space (e.g. real-time code such as for A/V applications), and I intend to work in system languages as long as I possibly can. But I realize these languages, C/C++ in particular, have an unforgiving macho “it’s your fault you failed, you should have been more careful” attitude that makes them unsuitable for most people. Chrome and the other high-performance browsers that the others have become since then vastly extend the opportunities of JavaScript development, with it starting now to be credible for many kinds of desktop-like applications. JavaScript has many faults, but it is also vastly more forgiving than C/C++, if only by virtue of it providing memory safety and garbage collection. And most web users can discover JavaScript by themselves with “view source”. Even if C/C++ isn’t the only game in town for application development (Java and C# are somewhat more approchable, for instance), this nevertheless removes quite a hurdle to starting application development, and this can only be a good thing.

And of course, the per-tab process architecture of Chrome really means it ends up piggybacking on the well-understood process separation mechanism of the OS, itself relying of the privilege separation feature of the processor, and after meltdown and spectre it would seem this bedrock is more fragile than we thought… but process separation still looks like a smart choice even in this context, as a long-term solution will be provided in newer processors for code running in different address spaces (at the cost of more address space separation, itself mitigated by features such as PCID), while running untrusted code in the same address space will have no such solution and is going to become more and more of a black art.

So I hope that, if you considered Chrome to be bloated, you realize now it’s not so clear-cut. So more complexity can be a good thing. On the other hand, I have an inkling that the piling on of dependencies in the npm world in general, and in web development specifically, is soon going to be unsustainable, but I’d love to be shown wrong. We need to take a long, hard look at that area in general.

So yes, it’s becoming harder to even tell if we software engineers can be proud of the current state of our discipline or not. So what can be done to make the situation clearer, and if necessary, improve it?

First, we need to treat software engineering (and processor engineering, as we’re at it) just like any other engineering discipline, by having software developers need to be licensed in order to work on actual software. Just like a public works engineer needs to be licensed before he can design bridges, a software engineer will need to be licensed before he can design software that handles personal data, with the requirement repeating down to the dependencies: for this purpose, only library software that has itself been developed by licensed software engineers could be used. We would need to grandfather in existing software, of course, but this is necessary as software mistakes are (generally) not directly lethal, but can be just as disruptive to society as a whole when personal data massively leaks. Making software development require a license would in particular provide some protection against pressure from the hierarchy and other stakeholders such as marketing, a necessary prerequisite enabling designers to say “No” to unethical requests.

Second, we need philosophers (either comings from our ranks or outsiders) taking a long hard look at the way we do things and trying to make sense of it, to even figure out the questions that need asking for instance, so that we can be better informed of where we as a discipline need to work on. I don’t know of anyone right now doing this very important job.

These, to me, are the first prerequisites to an eventual software reenchantment.

Factory Hiro on iPad is the real deal

I tested for you Factory Hiro, the latest attempt at remaking Factory: The Industrial Devolution on iPad (and iPhone, though I tested it on iPad). After the previous attempt towards that goal (if it could even be called an attempt), caution was certainly warranted, after all. So is it worth it?

The short answer: yes. Factory Hiro is the real deal, buy and download it without fear, it is a worthy remake of the original, finally available with a touch interface where it can shine.

The longer answer: compared to the original, graphics have obviously been remade, but everything in the gameplay will otherwise be familiar. Redirection boxes that you tap to toggle between vertical and horizontal direction, the trash destination and the recycle stream (and of course the default final destination: the delivery truck), assembling steps that you sometimes have to turn on or off, etc. It’s all there. In particular, the regulation of assembly line speed so you get your quota done in time for the end of the day, but not so fast that you end up being overwhelmed by the oncoming components to manage, is still the fundamental challenge of the game.

A nicety was added, though: when some trash gets generated (“Oh No!”), the speed will automatically switch to slowest for you, so it is much easier to manage these crises when they come. Other differences exist, but are less significant.

In non-gameplay aspects, the story was changed as well; these days of course it is told as cutscenes (created by KC Green) depicting the titular Hiro proving his hierarchy that yes, he can get the job done. And you, will you succeed?

Factory Hiro is available on the iOS App Store, as well as on the Google Play store for the Android version, and on PC and Mac through Steam (I got it for 3.49€ on the French iOS App Store; pricing will depend on your region); it was reviewed on an iPad Air 2 running iOS 11.4.1.


P.S.: In similar nostalgia-inspired discoveries, I should mention that I can’t believe I went such a long time without being made aware of Contraption Maker, for the creators of The Incredible Machine; this matters beyond nostalgia, as The Incredible Machine is one of the best pedagogical tools disguised as a game that I have ever come across, and Contraption Make is a worthy successor, improving it on many points such as the addition of rotation physics (want to do a cat flap? You can now.) or more digital logic elements for laser computing than you can shake a stick at.

And the last such discovery is for Two Point Hospital, in which you will find everything you liked from the original Theme Hospital; I haven’t played it yet, but if you’ve played the original and the trailer doesn’t convince you, I don’t know what will.

Copland 2020

Five years ago, I predicted that Apple would port iOS to the desktop, with a compatibility mode to run existing Mac OS X programs; we are now at the deadline I later put for this prediction, so did Apple announce such a thing at WWDC 2018?

Big No on WWDC 2018 Keynote stage, with Craig Federighi on the left

Now I think it is worth going into what did come to pass. Apple is definitely going to port UIKit, an important part of iOS, to the desktop for wide usage; and these last few years have seen the possibilities improve for distribution of iOS apps outside the iOS App Store or Apple review, though they remain limited.

But beyond that? I got it all wrong. The norms established by the current Mac application base will remain, with apps ported from the mobile world being only side additions, there will still be no encouragement for sandboxing except for the clout of the Mac App Store, pointing with pixel accuracy will remain the expectation, most of iOS will remain unported, etc. You are all free to point and laugh at me.

And I can’t help but think: what a missed opportunity.

For instance, in an attempt to revitalize the Mac App Store Apple announced expanded sandboxing entitlements, with developers on board pledging to put their apps on the store. Besides the fact some aspects of the store make me wary of it in general, I can’t help but note this sandboxing thing has been dragging on for years, such that it ought to have been handled like a transition in the first place; it might have been handled as such from a technical standpoint, but definitely not from a platform management standpoint (I mean, can you tell whether any given app you run is sandboxed? I can’t) even though that could be a differentiator for Apple in this era of generalized privacy violations. Oh, and that is assuming the announced apps will eventually manage to be sandboxed this time, and this is far from certain: I still remember back in 2012 the saga of apps that were worked on to be sandboxed, only for the developers to eventually have to give up…

I mean, Apple could have done it: the user base was in the right mindset (I did not see a lot of negative reactions when news of the unified desktop user experience got broken by the press a few months ago, which was in fact Marzipan, the initiative to run UIKit on the desktop), developers would obviously have been less enthusiastic but could be convinced with “Benefit from being one of the first native apps!” incentive, influential users could be convinced by selling it as a privacy improvement (remember: in Mac OS X unsandboxed apps can still read all the data of sandboxed apps), etc. But this year they explicitly prompted the question in order to answer it in the negative, meaning no such thing is going to happen in the next few years, and in fact on the contrary investing in alternative solutions like Marzipan. Sorry Apple, but I can’t help but think of Copland 2020.

Bruno Le Maire, Apple, and Google

A quickie because everyone’s mentioning it: France’s finance minister Bruno Le Maire announced he’d be suing Apple and Google for anticompetitive behavior with regard to people developing apps for their respective smartphone platforms. Here is the source (via).

The translation in the Bloomberg article (via) is more or less correct, with the exception that Bruno Le Maire only mentions “the data”, not “all their data”. So you’re not missing out on much by not being able to listen to him in the original French.

Now as to the contents of the announcement… I sure hope the actual suit is drawn from better information than what we’ve been given here, because while I’m on the record as deeming the current system of exclusive distribution through an app store (something which Google isn’t even guilty of) as being unsustainable in the long run, to have any hope of improving the situation through a suit Apple should be blamed for things it is actually doing. For instance, developers do not sell their wares to Apple (or Google) by any definition of that word, they do have to use a price grid but have full latitude to pick any spot in that grid, and Apple at least does not get that much data from apps.

Beyond that, I believe there could be a case here, but I see many more matters where Apple could be, shall we say, convinced to act better through public action from consumer protection agencies, antitrust agencies, or tax enforcement agencies. Why aren’t there announcements about those too? At any rate, I’m not going to breathlessly pin my hopes on that action until I get more, and more serious, information about it.

On the management of the performance of older iPhone models

When the coverage that Apple had admitted to capping performance of older iPhone models (when it detects their battery may no longer be able to withstand the necessary power draw) reached the (French) morning radio news I usually listen to, I knew Apple had a PR crisis of epic proportions on their hands.

Much has been written about this covert update to these iPhones, but the most important lesson here is the illustration that, once again, Apple completely and utterly fails at communication. If that PR crisis is not taken by the higher echelons as a wake-up call for Apple to take communication seriously from now on, what will?

Let us back up a bit, at the time Apple engineering was devising this solution for the (real) issue of some devices spontaneously resetting because the battery could not sustain the instantaneous power draw. We know it is the case that this was a project of some sort, in particular because the solution was rolled out for some models first, then other models, etc. Such a thing was obviously documented internally, because it is an important change of behavior that their own QA teams will notice when qualifying the software for release, also because it resolves a support issue, so obviously customer support was in the loop so as to provide feedback on which compromises are acceptable, etc. And yet, at the end of the day, when the fix is about to widely land in people’s phones, the system inside Apple is so completely stuck up on secrecy that not even an extremely expurgated version of this documentation makes it out the door? What is wrong with you people?

As a developer-adjacent person, I can see many developers being affected by this when they can’t understand why the performance of their product has regressed, but this pales in front of the perception by the general public of the whole affair… Indeed, many perceive (with or without justification, doesn’t matter) their computing products as slowing down over time, and given a lack of an easy-to-pin-down cause (I think it’s partly perception, and partly the compounded effect of multiple reasonable engineering tradeoffs), they are more than ready to embrace conspiracy theories about this. And now you’ve given them an argument that will feed conspiracy theories about this for decades! Even for issues completely unrelated to this performance capping (e.g. more pervasive performance pathologies). Stellar job here, Apple! Engineering-wise, I can’t fault the technical solution, however faced with someone affected by the issue, I can’t see how I or anyone, even an army of Apple geniuses, could credibly defend Apple. For all of us who trust on Apple to make reasonable compromises on our behalf (because we want to buy a computer/tablet/phone, not a computer/tablet/phone-tinkering hobby), and who told their interlocutors to place the same trust, you have damaged our credibility — and yes, it is something that should worry you Apple, because our credibility is part of your brand, ultimately.

On this particular issue, now that it has reached criticality, I think Apple made the right call in offering to replace batteries for $29 when they are degraded enough to trigger the capping, but that of course reeks of PR damage control, because that is (also) exactly what it is, so the bigger question of Apple’s capability to communicate remains open.

And I can’t close without saluting the work by the Geekbench crew to narrow down and confirm this effect. I’ve been critical of their product in the past, and I still am, but I have to recognize that it has allowed to bring this action from Apple to light, and I am thankful to them for this.

iOS 11 and its built-in App Store: weeks six and seven

Following the disruptive changes in iOS 11 and the fact we will have to use its redesigned App Store app going forward, I am diving headfirst and documenting my experience as it occurs on this blog. (previous)

Not much to report on for these two weeks, as I’ve been very busy covering the Saint-Malo comics festival for Fleen, then writing up these reports, then I had to compensate for lost time on the day job…

Though in a sense, it is an experience I can report on, as I’ve covered this festival, and in particular taken notes, exclusively using an iPhone (5S), an iPad Air 2, and Apple Wireless keyboard, and the setup worked very well. Half of the reports themselves were also typed up with the iPad. The notes were taken, appropriately enough, in the notes app, and reports typed up directly in the mail app (I used one or two additional apps, e.g. iBooks to keep offline access to the festival schedule). It is hard to say how much I benefitted from the new system functionalities, especially as they relate to multitasking, compared to what was already present in iOS 10, but on the other hand I feel they served me well, no regression.

  • I did have to relearn how to put two apps side by side, here notes and mail, but that was only a small learning bump.
  • The system generally does not allow pasting of raw text… which is an issue when composing email with data copied from many different sources. Get on that, Apple.
  • In the mail app the text editor would turn my quotes into French guillemets («»), and while I can explicitly specify straight quotes when using the virtual keyboard with a long tap, I have not found any way to do so with a physical keyboard… So I left them in; my editor had to contend with them when editing my piece for publication.

iOS 11 and its built-in App Store: weeks four and five

Following the disruptive changes in iOS 11 and the fact we will have to use its redesigned App Store app going forward, I am diving headfirst and documenting my experience as it occurs on this blog. (previousnext)

  • Last time, I forgot to mention that not only did the version number go directly from 11.0 to 11.0.2, but the latter was itself quickly superseded by version 11.0.3 (software update which I still do through my Mac for paranoia safety purposes, and regardless requires quite a bit of download). I wonder what happened there…
  • I use ellipses (…) quite often, and it took me some time to realize why I sometimes couldn’t find them on the iPad: they are gone from the French keyboard… I have to stick to the English one.
  • I haven’t managed to transfer old apps on older devices yet, but what I have done is uninstall a number of apps, especially ones that often get updated (Candy Crush, anyone?). This has resulted in a notable decrease of the number of apps I have to update at the end of the day, which is both a relief and a reduction of the download needs.
  • Speaking of which, in the storage preferences an attempt is made to provide the date of last use of the apps, but it does not take into account the use of the app through its extensions for instance (which includes a provided keyboard, a share sheet, etc.)