What are the very long-term solutions to Meltdown and Spectre going to look like?

While the long-term solutions to Meltdown and Spectre we saw earlier can and probably will be made to work on mainstream hardware, their inefficiencies call into question the fundamental operation of our processors: rather than try and contain speculation, why not avoid the need for speculation in the first place, for instance? Or, on the contrary, embrace it? But revisiting the design space of processors requires taking a long look at why they work that way in the first place, and whether these assumptions are still suitable.

Inspired by the fact many solutions to Spectre rely on a model of the speculative execution behavior of the processor, e.g. “Modern CPUs do not speculate on bit masking” (which is fine for short-term mitigations: we can’t really throw away every single computer and computing device to start over from scratch and first principles at this point), some suggest we should officially take the speculative behavior of the processor into account when developing/compiling software, typically by having the processor expose it as part of its model. This is completely untenable long-term: as processors evolve and keep relying on speculation to improve performance, they will end up being constrained by the model of speculative execution they initially exposed and that existing programs started relying against, i.e. future processors will have to refrain from speculating even an iota more than they currently do (but still speculate, such that we will keep having to maintain the mitigations). Unless they do, in which case we will have to reanalyze all code and update/add mitigations every time a new processor that improves speculation comes out (not to mention the inherent window of vulnerability). Possibly both. This is untenable, as the story of delay slots taught us.

In case you’re new to the concept of delay slots, they were a technique introduced by Berkeley RISC to allow pipelining instructions while avoiding bubbles every time a branch would be encountered: the instruction(s) following the branch would be executed no matter what (slide 156 out of 378), and any jump would only occur after that. And some RISC architectures such as MIPS and Sparc used them, and it worked well. That is, until it was time to create the successor design, with a longer pipeline. You can’t put an additional delay slot, since that would break compatibility with all existing code, so you instead compensate in other ways (branch prediction, etc.), such that delay slots are not longer useful, but you still have to support the existing delay slots: doing otherwise would also break compatibility. In the end, every future version of your processor ends up having to simulate the original pipeline because you encoded that as part of the ISA. Oh, and when I said delay slots initially worked well, that was a lie, because if your processor takes an interrupt just after a branch and before a delay slot, how can the work be resumed? Returning from the interrupt cannot be simply jumping back to the address of the instruction after the branch, there is also the delayed branch state to be taken care of; solutions to these issues were found, obviously, but singularly complexify interrupt handling.

A separate insight of RISC in that area is that, if we could selectively inhibit which instructions would write to flags, then we could prepare flags well ahead of when they would be needed, allowing the processor to know ahead of time whether the branch would be taken or not, enough so that the next instruction could be fetched without any stall, removing the need for either delay slots or to try and predict the branch. That is often useful for “runtime configurable” code, and sometimes for more dynamic situations, however in many cases the compiler does not have many or any instruction to put between the test and the branch, so while it can be a useful tool (compared to delay slots, it does not have the exception-related issues, and the compiler can provision as many instructions in there as the program allows, rather than being limited by the architecturally-defined delay slot size, and conversely if it can’t, it does not have to waste memory filling the slots with nops), it also has many of the same issues as delay slots: as the pipelines get deeper there will be more and more cases where processors will have to resort to branch prediction and speculative executions to keep themselves fed when running existing code; using knowledge that is only available at runtime, as the processor has access to context the compiler does not have. Furthermore, the internal RISC engine of current x86 processors actually goes as far as fusing conditional branches together with the preceding test or comparison instruction, suggesting such a fused instruction is more amenable to dynamic scheduling. RISC-V has an interesting approach: it does away with flags entirely (not even with multiple flag registers as in PPC for instance (cr0 to cr7)), using instead such fused instructions… but it is still possible to put the result of a test well ahead of the branch that requires it, simply by setting a regular register to 0 or 1 depending on the test outcome, then having the fused branch’s test be a comparison of this register with 0, and presumably implementations will be able to take advantage of this.

Generally, there is an unavoidable tension due to the straightjacket of sequential instruction execution, straightjacket which is itself unavoidable due to the need of being able to suspend processing, then resume where it left off. How could we better express our computations in such a way that hardware can execute a lot of it at once, in parallel, while being formally laid out as a sequence of instructions? For instance, while it could be desirable to have vectors of arbitrary lengths rather than fixed-sized ones (as in MMX, Altivec, SSE, NEON, etc.), doing so raises important interruptibility challenges: the Seymour Cray designs either at CDC or Cray did not support interrupts or demand-paged virtual memory! If we give up on those, we give up on the basis of preemptive multitasking and memory protection, so we’d end up with a system that would be worse than MacOS 9, and while MacOS 9 went to heroic lengths in riding a cooperative multitasking, unprotected memory model (and even that OS supported interrupts), no one who has known MacOS 9 ever wants to go back to it: it is dead for a reason (from the story of Audion). Alternatively, we could imagine fundamentally different ways of providing the high-level features, but then you still have to solve the issue of “what if the software developer made a mistake and specified an excruciatingly long vector, such that it will take 10 seconds to complete processing?” So either way, there would need to be some way to pause vector processing to do something else, then resume where it left off, which imposes severe constraints on the instruction set: RISC-V is going to attempt that, but I do not know of anyone else.

One assumption we could challenge is whether we need to be writing system software in C and judge processors according to how well they execute that software (via). To my mind, Spectre and Meltdown are as much the result of having to support Unix or systems with Unix-ish semantic, or possibly even just fast interrupts, as they are the result of having to support C: flat memory, context switching/preemptive multitasking hacked on absolute minimum hardware support (OSes have to manually read, then replace, every bit of architectural state individually and manually!), which itself implies limited architectural state, which results in lots of implicit micro-architectural state to compensate, traps also hacked on absolute minimal hardware support, in particular no quick way to switch to a supervisor context to provide services short of a crude memory-range-based protection (and now we see how that turned out)¹, no collection types in syscall interface (ever looked at select() ?) thus forcing all interactions to be through the kernel reading a block of memory from the calling process address space even for non-scalar syscalls, mmap(), etc. However, especially in the domain of traps, it will be necessary to carefully follow the end to end principle: the history of computing is littered with hardware features that ended up being unused for fail of providing a good match with the high-level, end to end need. This is visible for instance in the x86 instruction set: neither the dedicated device I/O instructions (IN/OUT) nor the protection rings 1 and 2 (user-level code uses ring 3, and kernels use ring 0) are ever used in general purpose computing (except maybe on Windows for compatibility reasons).

But Spectre and Meltdown are also indeed the result of having to support C, in particular synchronization memory primitives are not very good matches for common higher level language operations, such as reference assignment (whether it is in a GC environment or an ARC environment). Unfortunately, a lot of research in that area ends up falling back to C…

Revisiting the assumption of C and Unix is no coincidence. While our computers are in many ways descendants of microcomputers, the general design common to current microprocessor ISAs is inherited from minicomputers, and most of it specifically from the PDP-11, where both C and Unix were developed; this includes memory mapped and interrupt-based device I/O rather than channel I/O, demand-paged virtual memory, the latter two of which imply the need to fault at any time, both byte and word addressing, etc. This in turn shapes higher level features and their implementation: preemptive multitasking, memory protection, IPC, etc. Alternatives such as Lisp machines, Intel 432, Itanium, or Sun Rock did not really pan out; but did these failures disprove their fundamental ideas? Hard to tell. And some choices from the minicomputer era, most of which were related to offloading responsibilities from hardware to software, ended up being useful and/or sticking for reasons sometimes different from the original ones, original ones which could most often be summarized as: we need to make do with the least possible amount of transistors/memory cells (parallels with evolutionary biology: an adaptation that was developed at one time may find itself repurposed for a completely different need). For instance, the PDP-11 supported immutable code (while some other archs would assume instructions could be explicitly or sometimes even implicitly modified, e.g. as late as 1982 the Burroughs B4900 stored the tracking for branch prediction directly in the branch instruction opcode itself, found out while researching the history of branch prediction…) which was immediately useful for having programs directly stored in ROM instead of having to be loaded from mass storage plus occupy precious RAM, was also useful because it enabled code sharing (the PDP-11 supported PIC in addition), but today is also indispensable for security purposes: it’s most definitely safer to have mutable data be marked as no-exec, and therefore have executable code be immutable. The same way, minicomputers eschewed channel I/O to avoid an additional processing unit dedicated to device I/O, thus saving costs and enabling them to fit in a refrigerator-sized enclosure rather than require a dedicated room with raised flooring, but nowadays having the CPU being able to interrupt its processing (and later be able to resume it) is mandatory for preemptive multitasking and real-time purposes such as low-latency audio (think Skype). As a result, it is not possible to decide we can do without feature X of processors just because its original purpose is no longer current: feature X may have been repurposed in the meantime. In particular, virtual memory faults are used everywhere to provide just-in-time services, saving resources even today in the process (physical pages not allocated until actually used, CoW memory, etc.). Careful analysis, and even trial and error (can we build a performant system on this idea?), are necessary. As a significant example, we must not neglect how moving responsibilities from hardware to software enabled rapid developer iteration of computer monitor functionality UX (job control, I/O, supervision, auditing, debugging, etc.). Moving back any of this responsibility to hardware would almost certainly cause the corresponding UX to regress, regardless of the efforts towards this end by the hardware: the lack of rapid iteration would mean any shortcoming would remain unfixable.

Now that I think about it, one avenue of exploration would be to try and build on a high-level memory model that reflects the performance characteristics of the modern memory hierarchy. Indeed, it is important to realize uniform, randomly addressable memory hasn’t been the “reality” for some time: for instance, row selection in JEDEC protocols (DDR, etc.) means it’s faster to read contiguous blocks than random ones, and besides that we have NUMA, and caches of course (need a refresher? Ulrich Drepper’s got you covered). That being said, the Von Neumann abstraction will remain true at some level. So the game here would be not so much to abandon it than to build a more structured abstraction above it that better matches the underlying performance characteristics.

As you can see, this is still very much an open research space. In the meantime, we will have to make do with what we have.

¹e.g. we could imagine a system similar to fasttraps (themselves typically used for floating-point assistance) where a new syscall instruction would take as arguments the address and number of bytes of memory to copy (even if more may end up being necessary, that would still allow processing to start), and automatically switch to the kernel address space, so that the processor would best know how to manage the caches (in relation to address translation caching for instance) instead of second-guessing what the program is attempting to do.

Leave a Reply

Name *
Email *