Skip to content

UCSBarchlab/sharing-arch

Repository files navigation

Sharing Architecture RISC-V Core

Author: Minh Bui @ UCSB Branch: minh_pipeline License: BSD 3-Clause — see LICENSE

A cycle-level RTL prototype of a multi-slice, out-of-order RISC-V core inspired by The Sharing Architecture (ASPLOS 2014). Instead of one wide superscalar core, the design is built from several narrower slices that cooperatively rename, issue, execute, and commit a single instruction stream — sharing register state and operands across slices through an explicit on-chip network. The goal is to study how useful throughput scales with slice count, issue width, and inter-slice communication topology once real bottlenecks (branch recovery, cross-slice operand transport, memory ordering, rename synchronization) are modeled in RTL.

The RTL is complete enough to fetch, decode, rename, speculate past branches, issue out of order, execute, write back, and commit — running bare-metal riscv-tests and Embench-IoT workloads through a Verilator + cocotb harness and emitting cycle-level IPC and bottleneck counters.


Table of Contents

  1. High-Level View
  2. Slice Anatomy
  3. Register Renaming
  4. Branch Speculation (Slice-Native)
  5. Scalar Operand Network
  6. Memory Ordering
  7. Execution & Scoreboard
  8. Key Parameters
  9. Status
  10. Building & Testing
  11. Repository Layout
  12. License
  13. References

High-Level View

A single instruction stream is fetched and distributed across NUM_SLICES slices. Each slice is a 2-way superscalar in-order front end feeding an out-of-order execution back end. The slices are stitched together by three shared resources:

  • a direct rename fabric that keeps the cross-slice register map coherent,
  • a scalar operand NoC that carries remote operands, deallocations, and branch/control events as opcode-tagged packets, and
  • a global memory-order arbiter that keeps cross-slice loads/stores correctly ordered.

Top-level integration

There is no central control oracle. Branch policy, rename coordination, and operand delivery are all distributed to the slices; top.sv only wires the shared fabrics, the memory-order arbiter, and performance-counter aggregation. This is a deliberate design direction — every cross-slice interaction is meant to look like a packet at a slice boundary, not a privileged top-level wire.


Slice Anatomy

Each slice performs in-order fetch / decode / rename / commit with out-of-order dispatch, execution, and writeback, 2-way superscalar, with speculative branching. The figure below is the canonical reference for one slice and its inter-slice interfaces:

Slice overview and inter-slice interfaces

Walking left to right:

Block Role
Frontend (HPDCACHE-as-I$, IQ, fetch control, bimodal predictor) Fetches instruction windows, predicts B-type/JAL with a bimodal counter, and steers fetch via a fetch-time BTB. Refetches on redirect.
N-way decode / admission Decodes a fetch window into renameable micro-ops and admits them into the rename cluster.
Rename cluster (S1 → S4) A register-staged pipeline: S1 logical-register rename → S2 cross-slice dependency-mapping broadcast → S3/S4 physical-register rename. Backed by the LRAT and GRAT maps.
ISQ Issue queue holding renamed micro-ops until their operands are ready (scoreboard-style Qj/Qk wakeup).
SON connector + credit control Packs/unpacks remote operand req/resp and control traffic to/from the Scalar Operand NoC.
Functional units Branch, ALU (×NUM_ALU), LDST, plus FPU placeholders.
Physical RF Per-slice physical register file holding the speculative in-flight split of the 32 architectural registers.
ROB In-order reorder buffer; retires instructions and frees old mappings. Emits the global memory-correct-order request.
Slice logic & speculation controller Local program-order maintainer / cross-slice service handler (remote-slice NoC response controller, local mapping order arbiter, frontend realignment) and the speculative branching resolution unit (global in-flight branch tracker, rename replay table, ROB commit controller).

Register Renaming

Renaming is two-level, which is what lets multiple slices own and share a single 32-entry architectural register space:

  1. GRAT (global RAT): architectural register → {logical_reg, slice_id}. Synchronized across slices so every slice agrees on which slice owns each architectural register.
  2. LRAT (local RAT): {logical_reg} → local PRF entry. Private to each slice.

A consumer on slice A that needs a value produced on slice B looks up the GRAT to find the owning slice/logical register, then fetches the physical value over the Scalar Operand NoC. Rename stays on a direct, specialized fabric (sharing_rename_fabric.sv) rather than routed through the packet NoC — global rename is already a dominant serialization point, so it is kept on a fast path while the NoC experiments focus on operand/writeback/branch traffic.

Correctness across slices is protected by program-order arbitration, dependency holds set at rename-broadcast time, and (historically) generation tags; same-window RAW hazards resolve through the combinational LRAT update path so way 1 sees way 0's mapping in the same cycle.


Branch Speculation (Slice-Native)

Branch handling is distributed to the slices — there is no top-level branch oracle:

  • Each slice instantiates its own sharing_branch_control. Branches decoded anywhere are broadcast on the Scalar Operand NoC so every slice sees every branch, and each slice's branch-control reads a synchronous branch bus so the per-slice copies stay in lockstep.
  • The frontend predicts B-type / JAL with a bimodal counter and steers fetch with a fetch-time BTB; JALR is effectively unpredicted and resolves at execute.
  • Up to BRANCH_DEPTH branches may be in flight per slice (default 2). On a misprediction the slice squashes wrong-path state and redirects fetch; the per-slice branch resolution reorder buffer is sized to BRANCH_DEPTH and emits oldest-first to keep the per-slice branch queues lockstep.
  • Branch packets carry a 3-bit epoch tag (incremented on every squash). A per-slice branch-control drops any predict-redirect whose epoch is stale — this kills the stale-predict-clobber that otherwise corrupts deeper speculation.

BRANCH_DEPTH=2 is the committed, fully-passing default. Deeper speculation (BRANCH_DEPTH=4) is the active IPC frontier and the source of the largest remaining throughput win.


Scalar Operand Network

Cross-slice communication is an explicit, packetized Scalar Operand Network (SON), not idealized all-to-all wires:

  • sharing_son.sv (slice-local) packs four message classes into opcode-tagged scalar_msg_t packets: operand request, operand response / writeback broadcast, dealloc, and branch/control.
  • sharing_scalar_noc.sv is an opaque packet router with two topologies:
    • 0direct all-to-all (old idealized timing), and
    • 2staged butterfly of 2×2 switches with valid/ready backpressure and arbitration.
  • OPERAND_REQ is absorbed into a small per-source RX FIFO with per-destination credit counters, making the shared channel provably unable to assert "not ready" on any opcode — i.e. transport-level deadlock-free even under deep speculation.

Because the SON's latency and contention are parameterized, the design can measure the real cost of non-ideal inter-slice communication as slice count grows — the central thesis experiment.


Memory Ordering

The memory model is correctness-first. A directed test exposed a real cross-slice bug (a younger store on one slice incorrectly forwarding into an older load on another). The current design uses:

  • A distributed store queue (sharing_sq.sv) with byte-granular, age-ordered store-to-load forwarding across slices, and
  • A global age-based memory-order arbiter in top.sv that grants one data-memory operation per cycle and blocks younger memory ops behind older pending ones.

This is functionally conservative — it serializes more memory traffic than a real LSQ and stores still write at execute/grant time rather than commit — but it preserves cross-slice ordering for the bare-metal suites and is documented as a known IPC cost.


Execution & Scoreboard

The back end is scoreboard-driven (CDC 6600 lineage), not a fixed in-order pipeline. Renamed micro-ops sit in the ISQ; a producer's writeback wakes waiting consumers via Qj/Qk producer tracking. Each functional unit walks:

IDLE → READ_OPS → EXEC → WRITEBACK → (back to IDLE)

The scoreboard tracks the four classic hazards:

  • RAW (true dependency) — consumer waits on Qj/Qk until the producing FU writes back, including cross-slice via the SON.
  • WAW — masked by rename (each write gets a fresh physical destination).
  • WAR — a writeback stalls until in-flight readers of the destination have consumed it.
  • Structural — issue stalls when no FU of the required type is free.

Writeback marks a physical value ready; commit (ROB retirement) is what makes architectural state non-speculative and frees old mappings.


Key Parameters

Compile-time, in src/rtl/sharing_pkg.sv (defaults shown; NUM_SLICES is sweepable via SHARING_NUM_SLICES):

Parameter Default Meaning
NUM_SLICES 2 Slices (swept 1 / 2 / 4 / 8)
NUM_ISSUES / NUM_WAYS 2 Superscalar issue width per slice
ARCH_REGS 32 Architectural registers (RV32I/RV64I)
NUM_PRF / LREGS_PER_SLICE 64 Physical / logical registers per slice
GLOBAL_REGS 128 LREGS_PER_SLICE × NUM_SLICES
NUM_ALU 4 Integer ALUs per slice
NUM_FPU 2 FPU placeholders
NUM_LDST 1 Load/store units
NUM_BRANCH 1 Branch units
NUM_FU 8 Total FUs per slice (ALU [0:3], FPU [4:5], LDST [6], BRANCH [7])
BRANCH_DEPTH 2 In-flight speculative branches per slice
ROB_ENTRIES 32 Reorder buffer depth
IQ_DEPTH 32 Issue queue depth
MEM_DATA_WIDTH 128 Fetch/memory line width

Status

  • RV32UI riscv-tests pass 41/41 at the default NUM_SLICES=2, NUM_WAYS=2, BRANCH_DEPTH=2 configuration; the matrix also runs across NUM_SLICES=1/2/4/8. fence_i is intentionally skipped (the harness has split instruction/data memories).
  • RV64UI integer subset brought up across the slice matrix.
  • Embench-IoT bare-metal workloads run through the same tohost harness.
  • IPC and bottleneck counters are emitted as CSV and plotted; recent fetch/ALU work (I-cache line, 128-bit memory, NUM_ALU=4) pushed hot-loop IPC to ~1.32.
  • Preliminary Yosys / OpenROAD synthesis flow under synth/ (one-slice RV32 elaboration with full HPDC RTL passes; HPDC-blackboxed synthesis runs).

Not yet implemented: the M/A extensions, privileged mode / CSRs / traps, MMU/TLB, interrupts, fence_i coherence, and a precise commit-time LSQ. The project is an experimental microarchitecture platform, not a Linux-capable or fully ISA-compliant core.


Building & Testing

Prerequisites

cocotb, Verilator, and a bare-metal RISC-V GCC (riscv64-unknown-elf-gcc) on PATH. Initialize submodules first:

git submodule update --init --recursive
command -v riscv64-unknown-elf-gcc

If your compiler has a different name, pass RISCV_CC=/path/to/gcc to the commands below.

Run a single riscv-tests program

cd tests/unit
make -C full_design -s run-riscv-hex \
  SIM=verilator TRACE=0 \
  NUM_SLICES=2 NUM_WAYS=2 \
  RISCV_TEST=rv32ui/add \
  WB_BCAST_TOPOLOGY=2 DEALLOC_NET_TOPOLOGY=2 RENAME_NET_TOPOLOGY=0 \
  RISCV_TIMEOUT_CYCLES=20000

Run the full RV32UI / RV64UI matrix

cd tests/unit
python3 full_design/tools/run_riscv_matrix.py --suite rv32ui --slices 1 2 4 8
python3 full_design/tools/run_riscv_matrix.py --suite rv64ui --slices 1 2 4 8

Each row writes <suite>-summary.csv (pass/fail), <suite>-ipc.csv (cycles / retired / IPC / branch / stall counters), and IPC plots.

Run an Embench-IoT workload

cd tests/unit
make -C full_design -s run-embench-hex \
  SIM=verilator TRACE=0 \
  NUM_SLICES=2 NUM_WAYS=2 \
  EMBENCH_BENCH=crc32 \
  EMBENCH_LOCAL_SCALE_FACTOR_OVERRIDE=1 \
  EMBENCH_TIMEOUT_CYCLES=2000000

Useful knobs

Knob Effect
NUM_SLICES=1|2|4|8 Slice count under test
NUM_WAYS=1|2 Superscalar width
TRACE=0|1 FST waveform off/on
WB_BCAST_TOPOLOGY / SCALAR_NOC_TOPOLOGY 0 direct, 2 butterfly
RISCV_METRICS_CSV=out.csv Dump IPC/stall counters
RISCV_TIMEOUT_CYCLES=N Per-test cycle budget
SIM_BUILD=dir Per-config build dir (use a distinct one per -D change)

Build-cache caveat: a given Verilator SIM_BUILD is keyed by directory, not by the -D macros. Use a distinct SIM_BUILD per configuration or a sweep will silently reuse the wrong binary.


Repository Layout

src/
├── top.sv                          # Slice instances + shared fabrics, memory-order arbiter, counters
└── rtl/
    ├── sharing_pkg.sv              # Parameters and types
    ├── sharing_slice.sv            # The slice: frontend→rename→issue→execute→commit
    ├── sharing_branch_control.sv   # Per-slice branch policy (slice-native)
    ├── sharing_son.sv              # Scalar Operand Network connector (packetize/depacketize)
    ├── sharing_sq.sv               # Distributed store queue (cross-slice forwarding)
    ├── sharing_rename_fabric.sv    # Direct cross-slice rename fabric
    ├── sharing_scoreboard.sv       # CDC 6600-style issue/wakeup
    ├── sharing_rob.sv              # In-order commit / deallocation
    ├── sharing_regfile.sv          # Physical register file
    ├── sharing_alu.sv              # RV32I/RV64I integer ALU
    ├── sharing_icache.sv           # HPDCACHE-as-I$ wrapper
    ├── decode_stage.sv             # Per-way decoder
    ├── frontend/
    │   ├── sharing_frontend_pipelined.sv  # Fetch + bimodal BP + BTB steer (current)
    │   └── sharing_frontend.sv            # Earlier non-pipelined frontend
    └── interconnect/
        ├── sharing_scalar_noc.sv          # Opaque packet router
        ├── sharing_butterfly_network.sv   # Staged butterfly topology
        └── sharing_switch2x2.sv           # 2×2 switch with arbitration

tests/unit/full_design/   # riscv-tests / Embench harness, metrics, IPC plotting
synth/                    # Yosys + OpenROAD (Docker) synthesis flow
docs/                     # Design plans (rename, scoreboard, NoC, IQ)
thesis/                   # Thesis text + figures

License

This project is licensed under the BSD 3-Clause License. See the LICENSE file for the full text. You are free to use, modify, and redistribute it (including in proprietary/closed-source and commercial work) provided you retain the copyright notice and disclaimer and do not use the author's name to endorse derived products. The project's own RTL carries an // SPDX-License-Identifier: BSD-3-Clause header.

A few in-tree files retain their upstream permissive licenses (all compatible with BSD-3-Clause) and are not relicensed:

  • src/include/riscv_pkg.sv — ETH Zurich / University of Bologna, Solderpad Hardware License v0.51.
  • src/rtl/common/sharing_fxarb.sv, src/rtl/common/sharing_prio_encoder.sv — CEA, Apache-2.0 WITH SHL-2.1.

The riscv-tests, embench-iot, cv-hpdcache, and OpenROAD submodules likewise carry their own upstream licenses.


References

  • The Sharing ArchitectureReducing Maintenance Overhead in General-Purpose Architectures via Processor-in-Memory and a Sharing-Based Microarchitecture (ASPLOS 2014).
  • CDC 6600 Scoreboard — J. E. Thornton, Parallel Operation in the Control Data 6600 (1964).
  • Design notes under docs/, CONTEXT.md, and ROADMAP.md.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors