Lukas' Notes

computer-architecture

Definition

Superscalar Pipelined Processor

A superscalar pipelined processor is a pipelined processor that fetches, decodes, and issues multiple instructions per cycle, enabling a CPI below 1 (IPC above 1).

The hardware examines the instruction stream dynamically and selects an issue bundle each cycle. The compiler assists by reordering instructions to expose instruction-level parallelism. Hazards are resolved at runtime using scoreboarding, register renaming, and forwarding.

Pipeline Structure

A superscalar out-of-order pipeline extends the single-issue OoO pipeline with wider fetch, multiple decode lanes, and multiple issue slots. The archetypal stage flow is:

Fetch, decode, dispatch, and commit proceed in order. Issue, execute, and finish may proceed out of order.

Hardware additions over a scalar OoO pipeline:

  • Wider instruction fetch (e.g. dual fetch for IPC = 2)
  • Multiple decode lanes
  • Larger issue buffer and reservation stations
  • Multiple issue ports and functional units
  • Dependence checking across parallel instructions
  • FU hazard checking

Instruction Scheduling

Wakeup and Selection

Two operations govern dynamic multi-issue:

  1. Wakeup — broadcasts the tags of parent instructions that have been selected. Dependent instructions match their source tags and determine whether operands are ready. Resolves RAW data dependencies.
  2. Selection — picks instructions from the pool of ready instructions to issue, respecting resource limits (issue bandwidth, available FUs, memory ports).

Compiler Support

Superscalar performance depends on the compiler exposing enough independent instructions per cycle. Loop unrolling is the primary technique:

# Before unrolling (1 independent op per iteration)
Loop:
  lw   x31, 0(x20)
  add  x31, x31, x21
  sw   x31, 0(x20)
  addi x20, x20, -4
  blt  x22, x20, Loop

# After unrolling by 4
  lw   x28,    0(x20)
  lw   x29,   -4(x20)
  lw   x30,   -8(x20)
  lw   x31,  -12(x20)
  add  x28, x28, x21
  add  x29, x29, x21
  add  x30, x30, x21
  add  x31, x31, x21
  sw   x28,    0(x20)
  sw   x29,   -4(x20)
  sw   x30,   -8(x20)
  sw   x31,  -12(x20)
  addi x20, x20, -16
  blt  x22, x20, Loop

Unrolling creates independent work the dual-issue pipeline can exploit in parallel.

Superscalar vs VLIW

A VLIW processor also achieves CPI < 1, but shifts scheduling responsibility to the compiler. Superscalar hardware performs scheduling dynamically at runtime, which requires more complex hardware but avoids recompilation when the binary runs on different processor implementations with different issue widths.

Example

Dual-Issue Superscalar with Unrolled Loop

14 instructions from the unrolled loop execute on a dual-issue superscalar
pipeline (IFISIBROEXWBCO). Two
instructions can be fetched and issued per cycle.

Key stages: IF (fetch), IS (issue), IB (issue buffer, waiting),
RO (read operands), EX (execute), WB (write-back), CO (commit).

addi x20, x20, -16 — Fetched at cycle 1, issues immediately. No dependencies. Completes at 5.

lw chain (4 loads): Each load takes LU for 2 cycles. The second
load of a pair is fetched in the same cycle as the first. Load 4 issues
one cycle after load 3, overlapping their execution.

add chain (4 adds): Each add waits in IB for its corresponding load
(RAW), then issues to ALU. The adds interleave with the stores:
add x28 issues while lw x29 is still in LU.

sw chain (4 stores): Each store waits in IB for its corresponding add
(RAW). Store 1 issues to SU at cycle 8, spends 1 cycle in the store
buffer (SB), and commits at cycle 10. The four stores drain with one
cycle of separation.

blt x22, x20, Loop — Waits for addi (RAW on x20). Issues at
cycle 9, commits at 13.

Result: 14 instructions in 14 cycles (5-cycle ramp-up, 7 effective
cycles). CPI = 0.5, IPC = 2. The dual-issue pipeline achieves its
theoretical peak throughput because the unrolled code exposes enough ILP.