Definition
Superscalar Pipelined Processor
A superscalar pipelined processor is a pipelined processor that fetches, decodes, and issues multiple instructions per cycle, enabling a CPI below 1 (IPC above 1).
The hardware examines the instruction stream dynamically and selects an issue bundle each cycle. The compiler assists by reordering instructions to expose instruction-level parallelism. Hazards are resolved at runtime using scoreboarding, register renaming, and forwarding.
Pipeline Structure
A superscalar out-of-order pipeline extends the single-issue OoO pipeline with wider fetch, multiple decode lanes, and multiple issue slots. The archetypal stage flow is:
Fetch, decode, dispatch, and commit proceed in order. Issue, execute, and finish may proceed out of order.
Hardware additions over a scalar OoO pipeline:
- Wider instruction fetch (e.g. dual fetch for IPC = 2)
- Multiple decode lanes
- Larger issue buffer and reservation stations
- Multiple issue ports and functional units
- Dependence checking across parallel instructions
- FU hazard checking
Instruction Scheduling
Wakeup and Selection
Two operations govern dynamic multi-issue:
- Wakeup — broadcasts the tags of parent instructions that have been selected. Dependent instructions match their source tags and determine whether operands are ready. Resolves RAW data dependencies.
- Selection — picks instructions from the pool of ready instructions to issue, respecting resource limits (issue bandwidth, available FUs, memory ports).
Compiler Support
Superscalar performance depends on the compiler exposing enough independent instructions per cycle. Loop unrolling is the primary technique:
# Before unrolling (1 independent op per iteration)
Loop:
lw x31, 0(x20)
add x31, x31, x21
sw x31, 0(x20)
addi x20, x20, -4
blt x22, x20, Loop
# After unrolling by 4
lw x28, 0(x20)
lw x29, -4(x20)
lw x30, -8(x20)
lw x31, -12(x20)
add x28, x28, x21
add x29, x29, x21
add x30, x30, x21
add x31, x31, x21
sw x28, 0(x20)
sw x29, -4(x20)
sw x30, -8(x20)
sw x31, -12(x20)
addi x20, x20, -16
blt x22, x20, Loop
Unrolling creates independent work the dual-issue pipeline can exploit in parallel.
Superscalar vs VLIW
A VLIW processor also achieves CPI < 1, but shifts scheduling responsibility to the compiler. Superscalar hardware performs scheduling dynamically at runtime, which requires more complex hardware but avoids recompilation when the binary runs on different processor implementations with different issue widths.
Example
Dual-Issue Superscalar with Unrolled Loop
14 instructions from the unrolled loop execute on a dual-issue superscalar
pipeline (IF→IS→IB→RO→EX→WB→CO). Two
instructions can be fetched and issued per cycle.Key stages: IF (fetch), IS (issue), IB (issue buffer, waiting),
RO (read operands), EX (execute), WB (write-back), CO (commit).
addi x20, x20, -16— Fetched at cycle 1, issues immediately. No dependencies. Completes at 5.
lwchain (4 loads): Each load takesLUfor 2 cycles. The second
load of a pair is fetched in the same cycle as the first. Load 4 issues
one cycle after load 3, overlapping their execution.
addchain (4 adds): Each add waits in IB for its corresponding load
(RAW), then issues toALU. The adds interleave with the stores:
add x28issues whilelw x29is still inLU.
swchain (4 stores): Each store waits in IB for its corresponding add
(RAW). Store 1 issues toSUat cycle 8, spends 1 cycle in the store
buffer (SB), and commits at cycle 10. The four stores drain with one
cycle of separation.
blt x22, x20, Loop— Waits foraddi(RAW onx20). Issues at
cycle 9, commits at 13.Result: 14 instructions in 14 cycles (5-cycle ramp-up, 7 effective
cycles). CPI = 0.5, IPC = 2. The dual-issue pipeline achieves its
theoretical peak throughput because the unrolled code exposes enough ILP.