Lukas' Notes

In a classic five-stage pipelined processor, a branch is resolved in the execute stage. By then, two younger instructions have already been fetched and decoded. If the branch is taken, both lie on the wrong path and must be flushed — a two-cycle penalty for every taken branch.

But many branches test only for equality — a comparison far simpler than what a full ALU provides. The idea is to decide the branch one stage earlier, in the decode stage.

Two things must move from the execute stage to the decode stage:

  1. Branch target address computation. The program counter and the immediate offset are already latched in the IF/ID pipeline register. Moving the target adder from the execute stage to the decode stage is therefore a straightforward relocation — the operands are already there.

  2. Register comparison. Instead of routing the two source registers through the ALU, a dedicated equality comparator is placed in the decode stage. The comparator is far cheaper than a full ALU, because it only checks whether two values are equal.

The trade is that forwarding and hazard detection must now reach into the decode stage as well, since the comparison may depend on a result still in flight.

The payoff is immediate. When the branch outcome is known in the decode stage, only the single instruction in the fetch stage lies on the wrong path. A taken branch now costs one cycle instead of two.

Therefore, moving the branch decision from EX to ID halves the branch misprediction penalty for the common case of simple equality branches. The cost is a dedicated comparator and additional forwarding logic in the decode stage — hardware that pays for itself in every taken branch.