Author: Minh Bui @ UCSB
Branch: minh_pipeline
License: BSD 3-Clause — see LICENSE
A cycle-level RTL prototype of a multi-slice, out-of-order RISC-V core inspired by The Sharing Architecture (ASPLOS 2014). Instead of one wide superscalar core, the design is built from several narrower slices that cooperatively rename, issue, execute, and commit a single instruction stream — sharing register state and operands across slices through an explicit on-chip network. The goal is to study how useful throughput scales with slice count, issue width, and inter-slice communication topology once real bottlenecks (branch recovery, cross-slice operand transport, memory ordering, rename synchronization) are modeled in RTL.
The RTL is complete enough to fetch, decode, rename, speculate past branches,
issue out of order, execute, write back, and commit — running bare-metal
riscv-tests and Embench-IoT workloads through a Verilator + cocotb harness and
emitting cycle-level IPC and bottleneck counters.
- High-Level View
- Slice Anatomy
- Register Renaming
- Branch Speculation (Slice-Native)
- Scalar Operand Network
- Memory Ordering
- Execution & Scoreboard
- Key Parameters
- Status
- Building & Testing
- Repository Layout
- License
- References
A single instruction stream is fetched and distributed across NUM_SLICES slices.
Each slice is a 2-way superscalar in-order front end feeding an out-of-order
execution back end. The slices are stitched together by three shared resources:
- a direct rename fabric that keeps the cross-slice register map coherent,
- a scalar operand NoC that carries remote operands, deallocations, and branch/control events as opcode-tagged packets, and
- a global memory-order arbiter that keeps cross-slice loads/stores correctly ordered.
There is no central control oracle. Branch policy, rename coordination, and
operand delivery are all distributed to the slices; top.sv only wires the shared
fabrics, the memory-order arbiter, and performance-counter aggregation. This is a
deliberate design direction — every cross-slice interaction is meant to look like a
packet at a slice boundary, not a privileged top-level wire.
Each slice performs in-order fetch / decode / rename / commit with out-of-order dispatch, execution, and writeback, 2-way superscalar, with speculative branching. The figure below is the canonical reference for one slice and its inter-slice interfaces:
Walking left to right:
| Block | Role |
|---|---|
| Frontend (HPDCACHE-as-I$, IQ, fetch control, bimodal predictor) | Fetches instruction windows, predicts B-type/JAL with a bimodal counter, and steers fetch via a fetch-time BTB. Refetches on redirect. |
| N-way decode / admission | Decodes a fetch window into renameable micro-ops and admits them into the rename cluster. |
| Rename cluster (S1 → S4) | A register-staged pipeline: S1 logical-register rename → S2 cross-slice dependency-mapping broadcast → S3/S4 physical-register rename. Backed by the LRAT and GRAT maps. |
| ISQ | Issue queue holding renamed micro-ops until their operands are ready (scoreboard-style Qj/Qk wakeup). |
| SON connector + credit control | Packs/unpacks remote operand req/resp and control traffic to/from the Scalar Operand NoC. |
| Functional units | Branch, ALU (×NUM_ALU), LDST, plus FPU placeholders. |
| Physical RF | Per-slice physical register file holding the speculative in-flight split of the 32 architectural registers. |
| ROB | In-order reorder buffer; retires instructions and frees old mappings. Emits the global memory-correct-order request. |
| Slice logic & speculation controller | Local program-order maintainer / cross-slice service handler (remote-slice NoC response controller, local mapping order arbiter, frontend realignment) and the speculative branching resolution unit (global in-flight branch tracker, rename replay table, ROB commit controller). |
Renaming is two-level, which is what lets multiple slices own and share a single 32-entry architectural register space:
- GRAT (global RAT): architectural register →
{logical_reg, slice_id}. Synchronized across slices so every slice agrees on which slice owns each architectural register. - LRAT (local RAT):
{logical_reg}→ local PRF entry. Private to each slice.
A consumer on slice A that needs a value produced on slice B looks up the GRAT
to find the owning slice/logical register, then fetches the physical value over the
Scalar Operand NoC. Rename stays on a direct, specialized fabric
(sharing_rename_fabric.sv) rather than routed through the packet NoC — global
rename is already a dominant serialization point, so it is kept on a fast path while
the NoC experiments focus on operand/writeback/branch traffic.
Correctness across slices is protected by program-order arbitration, dependency holds set at rename-broadcast time, and (historically) generation tags; same-window RAW hazards resolve through the combinational LRAT update path so way 1 sees way 0's mapping in the same cycle.
Branch handling is distributed to the slices — there is no top-level branch oracle:
- Each slice instantiates its own
sharing_branch_control. Branches decoded anywhere are broadcast on the Scalar Operand NoC so every slice sees every branch, and each slice's branch-control reads a synchronous branch bus so the per-slice copies stay in lockstep. - The frontend predicts B-type / JAL with a bimodal counter and steers fetch
with a fetch-time BTB;
JALRis effectively unpredicted and resolves at execute. - Up to
BRANCH_DEPTHbranches may be in flight per slice (default 2). On a misprediction the slice squashes wrong-path state and redirects fetch; the per-slice branch resolution reorder buffer is sized toBRANCH_DEPTHand emits oldest-first to keep the per-slice branch queues lockstep. - Branch packets carry a 3-bit epoch tag (incremented on every squash). A per-slice branch-control drops any predict-redirect whose epoch is stale — this kills the stale-predict-clobber that otherwise corrupts deeper speculation.
BRANCH_DEPTH=2 is the committed, fully-passing default. Deeper speculation
(BRANCH_DEPTH=4) is the active IPC frontier and the source of the largest
remaining throughput win.
Cross-slice communication is an explicit, packetized Scalar Operand Network (SON), not idealized all-to-all wires:
sharing_son.sv(slice-local) packs four message classes into opcode-taggedscalar_msg_tpackets: operand request, operand response / writeback broadcast, dealloc, and branch/control.sharing_scalar_noc.svis an opaque packet router with two topologies:0— direct all-to-all (old idealized timing), and2— staged butterfly of 2×2 switches with valid/ready backpressure and arbitration.
OPERAND_REQis absorbed into a small per-source RX FIFO with per-destination credit counters, making the shared channel provably unable to assert "not ready" on any opcode — i.e. transport-level deadlock-free even under deep speculation.
Because the SON's latency and contention are parameterized, the design can measure the real cost of non-ideal inter-slice communication as slice count grows — the central thesis experiment.
The memory model is correctness-first. A directed test exposed a real cross-slice bug (a younger store on one slice incorrectly forwarding into an older load on another). The current design uses:
- A distributed store queue (
sharing_sq.sv) with byte-granular, age-ordered store-to-load forwarding across slices, and - A global age-based memory-order arbiter in
top.svthat grants one data-memory operation per cycle and blocks younger memory ops behind older pending ones.
This is functionally conservative — it serializes more memory traffic than a real LSQ and stores still write at execute/grant time rather than commit — but it preserves cross-slice ordering for the bare-metal suites and is documented as a known IPC cost.
The back end is scoreboard-driven (CDC 6600 lineage), not a fixed in-order pipeline. Renamed micro-ops sit in the ISQ; a producer's writeback wakes waiting consumers via Qj/Qk producer tracking. Each functional unit walks:
IDLE → READ_OPS → EXEC → WRITEBACK → (back to IDLE)
The scoreboard tracks the four classic hazards:
- RAW (true dependency) — consumer waits on
Qj/Qkuntil the producing FU writes back, including cross-slice via the SON. - WAW — masked by rename (each write gets a fresh physical destination).
- WAR — a writeback stalls until in-flight readers of the destination have consumed it.
- Structural — issue stalls when no FU of the required type is free.
Writeback marks a physical value ready; commit (ROB retirement) is what makes architectural state non-speculative and frees old mappings.
Compile-time, in src/rtl/sharing_pkg.sv (defaults shown; NUM_SLICES is
sweepable via SHARING_NUM_SLICES):
| Parameter | Default | Meaning |
|---|---|---|
NUM_SLICES |
2 | Slices (swept 1 / 2 / 4 / 8) |
NUM_ISSUES / NUM_WAYS |
2 | Superscalar issue width per slice |
ARCH_REGS |
32 | Architectural registers (RV32I/RV64I) |
NUM_PRF / LREGS_PER_SLICE |
64 | Physical / logical registers per slice |
GLOBAL_REGS |
128 | LREGS_PER_SLICE × NUM_SLICES |
NUM_ALU |
4 | Integer ALUs per slice |
NUM_FPU |
2 | FPU placeholders |
NUM_LDST |
1 | Load/store units |
NUM_BRANCH |
1 | Branch units |
NUM_FU |
8 | Total FUs per slice (ALU [0:3], FPU [4:5], LDST [6], BRANCH [7]) |
BRANCH_DEPTH |
2 | In-flight speculative branches per slice |
ROB_ENTRIES |
32 | Reorder buffer depth |
IQ_DEPTH |
32 | Issue queue depth |
MEM_DATA_WIDTH |
128 | Fetch/memory line width |
- RV32UI
riscv-testspass 41/41 at the defaultNUM_SLICES=2, NUM_WAYS=2,BRANCH_DEPTH=2configuration; the matrix also runs acrossNUM_SLICES=1/2/4/8.fence_iis intentionally skipped (the harness has split instruction/data memories). - RV64UI integer subset brought up across the slice matrix.
- Embench-IoT bare-metal workloads run through the same
tohostharness. - IPC and bottleneck counters are emitted as CSV and plotted; recent fetch/ALU work
(I-cache line, 128-bit memory,
NUM_ALU=4) pushed hot-loop IPC to ~1.32. - Preliminary Yosys / OpenROAD synthesis flow under
synth/(one-slice RV32 elaboration with full HPDC RTL passes; HPDC-blackboxed synthesis runs).
Not yet implemented: the M/A extensions, privileged mode / CSRs / traps,
MMU/TLB, interrupts, fence_i coherence, and a precise commit-time LSQ. The
project is an experimental microarchitecture platform, not a Linux-capable or
fully ISA-compliant core.
cocotb, Verilator, and a bare-metal RISC-V GCC (riscv64-unknown-elf-gcc) on
PATH. Initialize submodules first:
git submodule update --init --recursive
command -v riscv64-unknown-elf-gccIf your compiler has a different name, pass RISCV_CC=/path/to/gcc to the commands
below.
cd tests/unit
make -C full_design -s run-riscv-hex \
SIM=verilator TRACE=0 \
NUM_SLICES=2 NUM_WAYS=2 \
RISCV_TEST=rv32ui/add \
WB_BCAST_TOPOLOGY=2 DEALLOC_NET_TOPOLOGY=2 RENAME_NET_TOPOLOGY=0 \
RISCV_TIMEOUT_CYCLES=20000cd tests/unit
python3 full_design/tools/run_riscv_matrix.py --suite rv32ui --slices 1 2 4 8
python3 full_design/tools/run_riscv_matrix.py --suite rv64ui --slices 1 2 4 8Each row writes <suite>-summary.csv (pass/fail), <suite>-ipc.csv
(cycles / retired / IPC / branch / stall counters), and IPC plots.
cd tests/unit
make -C full_design -s run-embench-hex \
SIM=verilator TRACE=0 \
NUM_SLICES=2 NUM_WAYS=2 \
EMBENCH_BENCH=crc32 \
EMBENCH_LOCAL_SCALE_FACTOR_OVERRIDE=1 \
EMBENCH_TIMEOUT_CYCLES=2000000| Knob | Effect |
|---|---|
NUM_SLICES=1|2|4|8 |
Slice count under test |
NUM_WAYS=1|2 |
Superscalar width |
TRACE=0|1 |
FST waveform off/on |
WB_BCAST_TOPOLOGY / SCALAR_NOC_TOPOLOGY |
0 direct, 2 butterfly |
RISCV_METRICS_CSV=out.csv |
Dump IPC/stall counters |
RISCV_TIMEOUT_CYCLES=N |
Per-test cycle budget |
SIM_BUILD=dir |
Per-config build dir (use a distinct one per -D change) |
Build-cache caveat: a given Verilator
SIM_BUILDis keyed by directory, not by the-Dmacros. Use a distinctSIM_BUILDper configuration or a sweep will silently reuse the wrong binary.
src/
├── top.sv # Slice instances + shared fabrics, memory-order arbiter, counters
└── rtl/
├── sharing_pkg.sv # Parameters and types
├── sharing_slice.sv # The slice: frontend→rename→issue→execute→commit
├── sharing_branch_control.sv # Per-slice branch policy (slice-native)
├── sharing_son.sv # Scalar Operand Network connector (packetize/depacketize)
├── sharing_sq.sv # Distributed store queue (cross-slice forwarding)
├── sharing_rename_fabric.sv # Direct cross-slice rename fabric
├── sharing_scoreboard.sv # CDC 6600-style issue/wakeup
├── sharing_rob.sv # In-order commit / deallocation
├── sharing_regfile.sv # Physical register file
├── sharing_alu.sv # RV32I/RV64I integer ALU
├── sharing_icache.sv # HPDCACHE-as-I$ wrapper
├── decode_stage.sv # Per-way decoder
├── frontend/
│ ├── sharing_frontend_pipelined.sv # Fetch + bimodal BP + BTB steer (current)
│ └── sharing_frontend.sv # Earlier non-pipelined frontend
└── interconnect/
├── sharing_scalar_noc.sv # Opaque packet router
├── sharing_butterfly_network.sv # Staged butterfly topology
└── sharing_switch2x2.sv # 2×2 switch with arbitration
tests/unit/full_design/ # riscv-tests / Embench harness, metrics, IPC plotting
synth/ # Yosys + OpenROAD (Docker) synthesis flow
docs/ # Design plans (rename, scoreboard, NoC, IQ)
thesis/ # Thesis text + figures
This project is licensed under the BSD 3-Clause License. See the
LICENSE file for the full text. You are free to use, modify, and
redistribute it (including in proprietary/closed-source and commercial work)
provided you retain the copyright notice and disclaimer and do not use the
author's name to endorse derived products. The project's own RTL carries an
// SPDX-License-Identifier: BSD-3-Clause header.
A few in-tree files retain their upstream permissive licenses (all compatible with BSD-3-Clause) and are not relicensed:
src/include/riscv_pkg.sv— ETH Zurich / University of Bologna, Solderpad Hardware License v0.51.src/rtl/common/sharing_fxarb.sv,src/rtl/common/sharing_prio_encoder.sv— CEA, Apache-2.0 WITH SHL-2.1.
The riscv-tests, embench-iot, cv-hpdcache, and OpenROAD submodules likewise
carry their own upstream licenses.
- The Sharing Architecture — Reducing Maintenance Overhead in General-Purpose Architectures via Processor-in-Memory and a Sharing-Based Microarchitecture (ASPLOS 2014).
- CDC 6600 Scoreboard — J. E. Thornton, Parallel Operation in the Control Data 6600 (1964).
- Design notes under
docs/,CONTEXT.md, andROADMAP.md.

