How to design an architectural processor feature, anyway?

Before I present what could be the very long term solutions to Meltdown and Spectre, I thought it would be interesting to look at a case study in how to (and how not to) implement processor features.

So, imagine you’re in charge of designing a potential replacement for the 6809, and you read this article, with the takeaway that, amazing as it is, that hack would quickly become insufficient given the increase in screen size and resolution (VGA is just around the corner, after all) that is going to outpace processor clock speed.

Of course, one of the first solutions for this issue would be to have better dedicated graphics capabilities, but your new processor may be used in computers where there is zero such hardware, or even if there is, there are always going to be use cases that fall through the cracks and are not well supported by such hardware, use cases where programmers are going to rely on your processor instead. In fact, you don’t want to get too tied up to that particular scenario, and instead think of it as being merely the exemplar of a general need: that of efficient, user-level memory copy of arbitrary length between arbitrary addresses that integrates well with other features, interrupts in particular. That being set up, let us look at prospective solutions.

Repeat next instruction(s) a fixed number of times

That one seems obvious: a new system that indicates the next instruction, or possibly the next few instructions, is to be executed a given number of times, not just once; the point being to pay the instruction overhead (decoding, in particular) only once, then having it perform its work N times at full efficiency. This isn’t a new idea, in fact, programmers for a very long time have been requesting the possibility for an instruction to “stick” so it could operate more than once. How long? Well, how about for as long as there have been programmers?

However, that is not going to work that simply, not with interrupts in play. Let us look at a fictional instruction sequence:

IP -> 0000 REP 4
      0001 MOV X++, Y++
      0002 RTS

SP -> 0100 XXXX (don’t care)
      0102 XXXX

But in the middle of the copy, an interrupt is received after the MOV instruction has executed two times, with two more executions remaining. Now does our state (as we enter the interrupt handler) look like this:

      0000 REP 4
      0001 MOV X++, Y++
      0002 RTS

      0100 0001 (saved IP)
SP -> 0102 XXXX

In which case, when we return from the interrupt the MOV will only be executed once more, making it executed only 3 times in total, rather than the expected 4, wreaking havoc in the program; so can we provide this instead:

      0000 REP 4
      0001 MOV X++, Y++
      0002 RTS

      0100 0000 (saved IP)
SP -> 0102 XXXX

Well, no, since then upon return from the interrupt execution will resume at the REP instruction… in which case the MOV instruction will be executed 4 times, even though it has already executed twice, meaning it will execute 2 extra times and 6 times in total.

It’s not possible to modify the REP instruction since your processor has to support executing code directly from ROM given the price of RAM (and making code be read-only is valuable for other reasons, such as being more secure or allowing it to be shared between different processes). How about resetting X and Y to their starting values and resume all iterations on exit? Except operation of the whole loop is not idempotent if the two buffers overlap, and there is no reason not to allow that (e.g. memmove allows it), so restarting the whole sequence is not going to be transparent. What about delaying interrupts until all iterations are completed? With four iterations that might be acceptable, but given your processor clock speed, as little as 16 iterations could raise important issues in the latency of interrupt handling, such that real-time deadlines would be missed and sound be choppy.

Whichever way we look at it, this is not going to work. What will?

Potential inspiration: the effect on the architectural state of the Thumb (T32) If-Then (IT) instruction

Conditional execution (or perhaps better said, predicated execution) is pervasive in ARM, and it is possible in Thumb too, but that latter case requires the If-Then instruction:

(any instruction that sets the EQ condition code)
IP -> 00000100 ITTE EQ
      00000102 ADD R0, R0, R1
      00000104 ST R5, [R3]
      00000106 XOR R0, R0, R0

SP -> 00000200 XXXXXXXX (don’t care)
      00000204 XXXXXXXX

And as if by magic, the ADD and ST instructions only execute if the EQ condition code is set, and XOR, corresponding to the E (for else) in the IT instruction, only executes if the EQ condition code is *not* set, as if you had written this:

(any instruction that sets the EQ condition code)
IP -> 00000100 ADD.EQ R0, R0, R1
      00000102 ST.EQ R5, [R3]
      00000104 XOR.NE R0, R0, R0

That might appear to raise interruptibility challenges as well: what happens if an interrupt has to be handled just after the ADD instruction, or when the ST instruction raises a page fault because the address at R3 must be paged back in? Because when execution resumes at ST, what is to stop XOR from being unconditionally executed?

The answer is ITSTATE, a 4-bit register that is part of the architectural state. What the IT instruction actually does is:

  • take its immediate bits (here, 110), and combine them using a negative-exclusive-or with the repeated condition code bit (we’re going to assume it is 111)
  • set ITSTATE to the result (here, 110), padding missing bits with ones (final result here being 1101)

And that’s it. What then happens is that nearly every T32 instruction (BKPT being a notable exception) starts operation by shifting out the first bit from ITSTATE (shifting in a 1 from the other side), and avoids performing any work if the shifted out bit was 0

This means you never need explicitly invoke ITSTATE, but it is very much there, and in particular it is saved upon interrupt entry, which ARM calls an exception, and restored upon exception return, such that predicated processing can resume as if control had never been hijacked: upon exception return to the ST instruction, ST will execute, then XOR will not since it will shift out a 0 from ITSTATE, the latter having been restored on exception return.

The lesson is: any extra behavior we want to endow the processor with needs to be expressible as state, so that taking an interrupt, saving the state, and later restoring the state and resuming from a given instruction results in the desired behavior being maintained despite the interrupt.

Repeat next instruction(s) a number of times given by state

Instead of having REP N, let us have a counter register C, and a REP instruction which repeats the next instruction the number of times indicated in the register (we need two instructions for this, as we’re going to see):

IP -> 0000 MOV 4, C
      0001 REP C
      0002 MOV X++, Y++
      0003 RTS

SP -> 0100 XXXX (don’t care)
      0102 XXXX

Now if an interrupt occurs after two iterations, the state is simply going to be:

      0000 MOV 4, C
      0001 REP C
      0002 MOV X++, Y++
      0003 RTS

      0100 0001 (saved IP)
SP -> 0102 XXXX

With C equal to 2. Notice the saved IP points to after the MOV to the counter register, but before the REP C, that way, when execution resumes on the REP instruction the memory-to-memory MOV instruction is executed twice and the end state will be the same as if all four iterations had occurred in sequence without any interrupt.

Applied in: the 8086 and all subsequent x86 processors, where REP is called the REP prefix and is hardwired to the CX register (later ECX), and you can use it for memory copy by prepending the MOVS instruction with it (instruction which is itself hardwired to SI (later ESI) for its source, and DI (later EDI) for its destination).

Load/store multiple

The REP C instruction/prefix system has a number of drawbacks, in particular in order to play well with interrupts as we just saw it requires recognition when handling the interrupt that we are in a special mode, followed by creating the conditions necessary for properly resuming execution. It also requires the elementary memory copy to be feasible as a single instruction, which is incompatible with RISC-style load-store architectures where an instruction can only load or store memory, not both.

We can observe that the REP C prefix, since it is only going to apply to a single instruction, will not serve many use cases anyway, so why not instead dedicate a handful of instructions to the memory copy use case and scale the PUL/PSH system with more registers?

That is the principle of the load and store multiple instructions. They take a list of registers on one side, and a register containing an address on the other, with additional adjustement modes (predecrement, postincrement, leave unchanged) so as to be less constrained as with the PUL/PSH case. Such a system requires increasing the number of registers in the architectural state so as to amortize the instruction decoding cost, increase which is going to add to context switching costs, but we were going to do that anyway with RISC.

So now our fictional instruction sequence can look like this:

IP -> 0000 LOADM X++, A-B-C-D
      0001 STOM A-B-C-D, Y++
      0002 RTS

SP -> 0100 XXXX (don’t care)
      0102 XXXX

We still have to promptly handle interrupts, but for the load/store multiple system the solution is simple, if radical: if an interrupt occurs while such an instruction is partially executed, execution is abandoned in the middle, and it will resume at the start of the instruction when interrupt processing is done. This is OK, since these loads and stores are idempotent: restarting them will not be impacted by any previous partial execution they left over (of course, a change to the register used for the address, if any such change is required, is done as the last step, so that no interrupt can cause the instruction to be restarted once this is done).

Well, this is only mostly OK. For instance, some of the loaded registers may have relationships with one another, such as the stack pointer (in the ABI if not architecturally), and naively loading such a register with a load multiple instruction may violate the constraint if the load multiple instruction is restarted. Similar potentially deadly subtleties may exist, such as in relation with virtual memory page faults where the operating system may have to emulate operation of the instruction… or may omit to do so, in which case load/store multiple instructions are not safe to use even if the processor supports them! I think it was the case for PowerPC ldm/stm in Mac OS X.

Sidebar: how do you, as a software developer, know whether it is safe to use the load and store multiple instructions of a processor if it has them? An important principle of assembly programming is that you can’t go wrong by imitating the system C compiler, so compile this (or a variant) at -Os, or whichever option optimizes for code size, to asm:

struct package
{
    size_t a, b, c;
};

void packcopy(struct package* src, struct package* dst)
{
    *dst = *src;
}

if this compiles to a load multiple followed by a store multiple, then those instructions are safe to use for your system.

Applied in: the 68000 (MOVEM), PowerPC, ARM (where their use is pervasive, at least pre-ARM64), etc.

decrement and branch if not zero

One observation you could make about the REP C system would be that it is best seen as implicitly branching back to the start of the instruction each time it is done executing, so why not put that as a plain old branch located after the instruction rather than as a prefix? Of course, that branch would handle counter management so that it could still function as a repetition system contained in a single instruction, but now repetition can be handled with more common test+branch mechanisms, simplifying processor implementation especially as it relates to interrupt management, and generalizes to loops involving more than one instruction, meaning there is no need to have the elementary copy be a single instruction:

IP -> 0000 MOV 4, C
loopstart:
      0001 LOAD X++, A
      0002 STO A, Y++
      0003 DBNZ C, loopstart;
      0004 RTS

From that framework, you can design the processor to try and recognize cases where the body of the loop is only 1 or 2 instructions long, and handle those specially by no longer fetching or decoding instructions while the loop is ongoing: it instead repeats operation of the looped instructions. In that case it still needs to handle exiting that mode in case of an interrupt, but at least it can decide by itself whether it can afford to enter that mode: for instance, depending on the type of looped instruction it could decide it would not be able to cleanly exit in case of interrupt and decide to execute the loop the traditional way instead.

The drawback is that it is a bit all-or-nothing: the loop is either executed fuly optimized or not at all, with the analysis becoming less and less trivial as we want to increase the number of looped instructions supported: regardless of the size of the loop, if there is a single instruction in the loop body, or a single instruction transition, where the engine would fail to set them up to loop in a way where it can handle interrupts, then the whole loop is executed slowly. That being said, it does handle our target use case as specified.

Applied in: the 68010 and later 68k variants such as CPU32-based microcontrollers, where the DBRA/DBcc instruction could trigger a fast loop feature where instructions fetches are halted and operation of the looped instruction is repeated according to the counter.

instruction caches, pipelining, and branch prediction

You could look at the complexity of implementing interrupt processing in any of these features and consider that you could almost as easily implement a proper pipeline, including handling interrupts while instructions are in flight, and end up supporting the use case, but also much more general speedups, just as efficiently. After all, the speed of memory copy is going to be constrained by the interface to the memory bus, your only contribution is to reduce as much as possible instruction fetching and decoding overhead, which is going to be accomplished if that happens in parallel with the memory copy of the previous instruction. Accomplishing that also requires a dedicated instruction cache so instruction can be fetched in parallel with data, but integrating a small amount of memory cells on your processor die is getting cheaper by the day. And keeping the pipeline fed when branches are involved, as here with loops, will require you to add non-trivial branch prediction, but you can at least get loops right with a simple “backwards branches are predicted to be taken” approach. And it turns out that simple branch predictors work well in real-life code for branches beyond loops, compensating the effects of pipelining elsewhere (and if you make the predictor just a little bit more complex you can predict even better, and then a little more complexity will improve performance still, etc.; there is always a performance gain to be had).

Applied in: all microprocessors used in general-purpose computing by now, starting in the beginning of the 90’s. For instance, x86 processors have an instruction implementing the decrement and branch if not zero functionality, but its use is now discouraged (see 3.5.1 Instruction Selection in the Intel optimization reference manual), and modern Intel processors recognise loops even when they use generic instructions and optimize for them, except for the loop exit which keeps being mispredicted.

With all that in mind, next time we’re going to be able to look at how to redesign our processors to avoid the situation that led us to rampant, insecure speculation in the first place.

Leave a Reply

Name *
Email *
Website