It’s hard to believe it has now been more than one year since the disclosure of Meltdown and Spectre. There was so much frenzy in the first days and weeks that it has perhaps obscured the fact any solutions we currently have are temporary, barely secure, spackle-everywhere stopgap mitigations, and now that the dust has settled on that, I thought I’d look at what researchers and other contributors have come up with in the last year to provide secure processors – without of course requiring all of us to rewrite all our software from scratch.
Do I need to remind you of Meltdown and Spectre? No, of course not; even if you’re reading this 20 years from now you will have no trouble finding good intro material to these. So as we discuss solutions, my only contribution would be this: it is important to realize designers were not lazy. For instance, they did not “forget” the caches as part of undoing speculative work in the processor, as you can’t “undo” the effect of speculation on the caches: for one, how would you reload the data that was evicted (necessary in order to be a real undo)? You can’t really have checkpoints in the cache that you roll back to, either: SafeSpec explores that, and besides still leaking state, more importantly it precludes any kind of multi-core or multi-socket configuration (SafeSpec is incompatible with cache coherency protocols), a non-starter in this day and age (undoing cache state is also problematic in multi-core setups, as the cache state resulting from speculative execution would be transitorily visible to other cores).
It is also important to realize preventing aliasing in branch prediction tracking slots would not fundamentally solve anything: even if this was done, attackers could still poison BHS and possibly BTB by coercing the kernel into taking (resp. not taking) the attacked branch, through the execution of ordinary syscalls, and then use speculative execution driven by that to leak data through the caches.
Besides information specific to Meltdown and Spectre, my recommended reading before we begin is Ulrich Drepper on the modern computer memory architecture, still current, and Dan Luu on branch prediction: this will tell you the myriad places where processors generate and store implicit information needed for modern performance.
As opposed to the current mitigations, we need systemic protection against whole classes of attacks, not just the current ones: it’s not just that hardware cannot be patched, but it also has dramatically longer design cycles which means protecting only against known risks at the start of a projet would make the protections obsolete by the time the hardware ships. And even if patching was a possibility, it’s not like the patch treadmill is desirable, anyway (in particular, adding fences, etc. around manually identified vulnerable sequences feels completely insane to me and amounts to a dangerous game of whack-a-vulnerability: vulnerable sequences will end up being added to kernel code faster than security-conscious people will figure them out). Take for instance, the Intel doc which described the Spectre (and Meltdown) vulnerability as being a variant of the “confused deputy”; this is its correct classification, but I feel this deputy is confused because he has been given responsibility of the effect of speculative executions of his code paths, a staggering responsibility he has never requested in the first place! No, we need to attack these kinds of vulnerabilities at the root, such that they cannot spawn new heads, and those two techniques do so.
First is DAWG. The fundamental idea is very intriguing: it is designed to close off any kind of cache side channel state¹, not merely tag state (that is, whether a value is present in the cache or not), and designed to close off data leaks regardless of which phenomenon would feed any such side channel: it is not limited to speculative execution. How do they ensure that? DAWG does so by having the OS dynamically partition all cache levels, and then assign the partitions, in a fashion similar to PCID.
This means that even with a single processor core, there are multiple caches at each level, one per trust domain, each separate from its siblings, and having a proportional fraction of the size and associativity of the corresponding physical cache of that level (cache line size and cache way size are unaffected). This piggybacks on recent evolutions (Intel CAT) to manage the cache as a resource to provision, but CAT is intended for QoS and offers limited security guarantees.
As long as data stays within its trust domain, that is all there is to it. When cross-partition data transfer is necessary, however, the kernel performs it by first setting up a mixed context where reads are to be considered as belonging to one domain, but writes to another, then performs the reads and writes themselves: it affords best possible cache usage during and after transfer.
Such an organization raises a number of sub-problems, but they seem to have done a good job of addressing those. For instance, since each cache level is effectively partitioned, the same cache line may be in multiple places in the same physical cache level, in different domains, which is not traditional and requires some care in the implementation. The kernel has access to separate controls for where eviction can happen, and where hits can happens, this is necessary for a transition period whenever resizing the partitions. DAWG integrates itself with cache coherency protocols, by having each cache partition behave mostly, but not exactly, like logically separate cache for cache coherency purposes: one particularly important limitation we will come back to is that DAWG cannot handle a trust domain attempting to load a line for writing when a different domain already owns that line for writing.
In terms of functional insertion, they have a clever design where they interpose in a limited number of cache operations so as not to insert themselves in the most latency-critical parts (tag detection, hit multiplexing, etc.). It requires some integration with the cache replacement algorithm, and they show how to do so with tree-PLRU (Pseudo Least Recently Used) and NRU (Not Recently Used).
In terms of features, DAWG allows sharing memory read-only, CoW (Copy on Write), and one-way shared memory where only one trust domain can have write access. DAWG only features a modest efficiency hit compared to the insecure baseline, though it depends on policy (CAT has similar policy-dependent behavior).
On the other hand, there are a few, though manageable, limitations.
- DAWG disallows sharing physical memory between different trust domains where more than one domain has write access, due to impossibility to manage cache coherence when more than one domain wants to write to two cache lines corresponding to the same physical address. I feel this is manageable: such a technique is probably extremely hard to secure given the possibility of a side channel through cache coherency management state, as MeltdownPrime and SpectrePrime have demonstrated, so we would need an overview of the main uses of where such memory sharing happens; off the top of my head, the typical use is for the framebuffer used for IPC with WindowServer/X11, in which case the need in the first place is only for one-way transfer, the solution here would be to change permissions to restrict write rights to one side only.
- DAWG provides no solution for transfer in/out of shared physical memory between different trust domains where neither is the kernel. But as we just saw, the allocation of such a thing need only be done by specific processes (perhaps those marked with a specific sandbox permission?), and transfer could be performed by the kernel on behalf of the allocating domain through a new syscall.
- Hot data outside the kernel such as oft-called functions in shared libraries (think objc_msgSend()), while residing in a single place in physical memory, would end up being copied in every cache partition, thus reducing effective capacity of all physical caches (hot data from the kernel would only need to be present in the kernel partition, regardless of which process makes the syscall).
- Efficient operation relies on the kernel managing the partitioning according to the needs of each trust domain, which is not trivial: partition ID management could be done in a fashion similar to PCID, however that still leaves the determination of partition size, keeping in mind that the cache at every level needs to be partitioned, including those shared between different cores which therefore have more clients and thus require more partitioning, additionally with limited granularity, granularity which depends on the level: a 16-way set associative cache may be partitioned in increments of 1/16th of its capacity, but a 4-way set associative cache only by fourths of its capacity. Easy.
- DAWG guards between explicit trust domains, so it cannot protect against an attacker in the same process. This could be mitigated by everyone adopting the Chrome method: sorry Robert, but maybe “mixing code with different trust labels in the same address space” needs to become a black art.
The basic idea of InvisiSpec corresponds to the avenue I evoked back then, which is that speculative loads only bring data to the processor without affecting cache state (either bringing that data to a cache level where it wasn’t, or modifying cache replacement policy, or other metadata), with the cache being updated only when the load is validated.
Well, that’s it, god job everyone? Of course not, the devil is in the details, including some I never began to suspect: validation cannot happen just any random way. InvisiSpec details how this is done in practice, the main technique being special loads performed solely for validation purposes: once loaded, the processors only uses this data, if ever, to compare it against the speculatively loaded data kept in a speculation buffer, and if the values match, processing can proceed; and while you would think that could raise ABA issues, it is not the case, as we’re going to see.
Overall, InvisiSpec proposes a very interesting model of a modern processor: first, a speculation part that performs computations while “playing pretend”: it doesn’t matter at that point whether data is correct (of course, it needs to be correct most of the time to serve any purpose), then the reorder buffer part, which can be seen as the “real” processing that executes according to the sequential model of the processor, except it uses results already computed by the speculative engine, when they exist. In fact, if these results don’t exist (e.g. the data was invalidated), the reorder buffer has to have the speculative engine retry, and the reorder buffer waits for it to be done: it does not execute the calculations (ALU, etc.) inline. A third part makes sure to quickly (i.e. with low latency) feed the speculative engine with data that is right most of the time, and do so invisibly: loads performed by the speculative engine can fetch from the caches but do not populate any cache, and are instead stored in the speculation buffer in order to remember that any results were obtained from these inputs.
This model piggybacks on existing infrastructure of current out of order processors: the reorder buffer is already the part in charge of making sure instructions “retire”, i.e. commit their effect, in order; in particular, on x86 processors the reorder buffer is responsible for invalidating loads executed out of order, including instructions after those, when it detects cache coherence traffic that invalidates the corresponding cache line. Ever wondered how x86 processors could maintain a strongly ordered memory model while executing instructions out of order? Now you know.
InvisiSpec has to do much more, however, as it cannot rely on cache coherence traffic: since the initial load is invisible, by design, other caches are allowed to think they have exclusive access (Modified/Exclusive/Shared/Invalid, or MESI, model) and won’t externally signal any change. Therefore, if the memory ordering model stipulates that loads must appear to occur in order, then it is necessary for the reorder buffer to perform a full validation, i.e. not only must it perform a fresh, new, non-speculative load as if the load was executed for the first time (thus allowing the caches to be populated), but then it has to wait for it to complete and compare the loaded data with the speculatively loaded one; if they are equal, then the results precomputed by the speculative engine for the downstream computations are correct as well, and the reorder buffer can proceed with these instructions: it does not even matter if A compared equal to A but the memory cell held the value B in between, as the only thing that matters is whether the downstream computation is valid for value A, which is true if and only if the speculative engine was fed an equal value A when it executed.
This leads into a much more tractable model for security: as far as leaking state is concerned, security researchers only need to look at operation of the reorder buffer; on the other hand, performance engineers will mostly look at the upstream parts, to make sure speculation will be invalidated as rarely as possible, but still look at the reorder buffer to make sure validation latencies will be covered, as far as possible.
Notably, InvisiSpec protects against attackers living in the same address space or trust boundary, and since it is cache-agnostic, it does not restrict memory sharing in any way.
The following limitations can be noted in InvisiSpec:
- InvisiSpec only protects against speculation-related attacks, not other kinds of attacks that also use the cache as a side channel. Additional techniques will be needed for those.
- InvisiSpec adds a significant efficiency hit compared to insecure baseline, both in execution time (22% to 80% increase on SPEC benchmarks, lower is better) and cache traffic (34% to 60% increase on SPEC benchmarks, lower is better), the latter of which is one of the main drivers of power usage. That will need to be improved before people will switch to a secure processor, otherwise they will keep using “good enough” mitigations; more work is needed in that area. My analysis would be that most of that efficiency hit is related to the requirement to protect against an attacker in the same address space: any two pair of loads could be an attacker/victim pair! The result is that pipelining is mostly defeated to the extent it is used to protect against load latencies. I am skeptical with regard to their suggestion for the processor to disable interrupts after a load has committed, and until the next load gets to commit, so as to allow the latter to start validation early (disabling interrupts serves to remove the last potential source of events that could prevent the latter load from committing): this would add an important constraint to interrupt management, which furthermore is unlikely to compose well with similar constraints.
This isn’t the last we will hear of work needed to secure processors post-Meltdown and Spectre; I am sure novel techniques will be proposed. At any rate, we in the computing industry as a whole need to start demanding of Intel and others what systemic protections they are putting in their processors, be they DAWG or InvisiSpec or something else, which will ensure whole classes of attacks become impossible.
- At least on the digital side: DAWG does not guard against power usage or electromagnetic radiation leaks, or rowhammer-like attacks.↩