Skip to content

nvme: add SR-IOV emulation#3650

Draft
jstarks wants to merge 10 commits into
microsoft:mainfrom
jstarks:nvme_vf
Draft

nvme: add SR-IOV emulation#3650
jstarks wants to merge 10 commits into
microsoft:mainfrom
jstarks:nvme_vf

Conversation

@jstarks

@jstarks jstarks commented Jun 3, 2026

Copy link
Copy Markdown
Member

Add optional SR-IOV support to the emulated NVMe controller. When configured, the PF exposes virtual functions that appear as independent NVMe controllers to the guest. Each VF has its own config space, MSI-X, register state, admin/IO queues, and worker tasks. The guest enables VFs through the standard PCIe SR-IOV mechanism, then uses the NVMe Virtualization Management command to bring secondary controllers online and attach namespaces.

The SR-IOV extended capability implementation in pci_core is device-agnostic and reusable for future SR-IOV devices.

VF resources (queues, interrupts) are all private (CRT=0)--there is no flexible resource pool. VFs get fixed queue capacity at construction time.

It's best to review this by commit.

Commits

  • pci_core: add SR-IOV extended capability emulator — generic SriovExtendedCapability implementing PciExtendedCapability, with VF BAR probing, control/status registers, callback on VF Enable, and SaveRestore.
  • nvme: add SR-IOV VF config space routing (Phase 2) — PF creates VF instances on VF Enable, routes config space reads/writes to VFs via pci_cfg_read/write_with_routing.
  • nvme/sriov: phase 3 — VF BAR MMIO routing via PF's SR-IOV capability — PF manages VF MMIO intercepts, computes per-VF BAR addresses from the SR-IOV capability, routes MMIO to VF BAR0 and MSI-X handlers.
  • nvme: add SR-IOV admin commands (Phase 4) — Virtualization Management (secondary online/offline), Namespace Attachment, PCC (CNS 0x14), Secondary Controller List (CNS 0x15).
  • nvme: implement VF NVMe controller logic (Phase 5) — each VF is a full NVMe controller sharing the NvmeRegisterIo trait with the PF. VFs read namespace assignments from shared state at CC.EN time.
  • nvme: stall VCPU on VF_Enable clear until VF IOs drainPollDevice-driven async drain with DeferredWrite completion.
  • nvme: add SR-IOV unit tests — VF lifecycle, config space routing, PF identify, BAR probing, MSE, reset.
  • nvme: self-decode VF BAR addresses for MMIO routing — O(1) address decode via cached base/shift instead of linear intercept scan.
  • nvme: add VF end-to-end IO test — full lifecycle: PF enable → VF enable → secondary online → namespace attach → VF controller enable → IO queue creation → READ command.
  • nvme: wire SR-IOV through resource layer, petri, and VMM testsNvmeSriovConfig in resource handle, resolver validation, petri helper, guest-visible VMM test with nvme-cli.

Copilot AI review requested due to automatic review settings June 3, 2026 23:53

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds SR-IOV emulation support to the NVMe device model, enabling a PF to expose VFs that appear as independent NVMe controllers to the guest. It also introduces a reusable, device-agnostic SR-IOV PCIe extended capability emulator in pci_core, plus unit/integration coverage and resource-layer wiring for the new configuration.

Changes:

  • Implement generic PCIe SR-IOV extended capability emulation (including BAR probing, VF enable/memory decode notifications, and Save/Restore).
  • Extend the emulated NVMe controller to create/manage VF controller instances, route VF config/MMIO/MSI-X, and implement NVMe virtualization management + namespace attachment flows.
  • Add NVMe SR-IOV unit tests and a VMM integration test exercising VF lifecycle and I/O, plus propagate sriov configuration through resource/petri layers.

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
vmm_tests/vmm_tests/tests/tests/x86_64/storage.rs Plumbs new NvmeControllerHandle.sriov field into existing NVMe test device construction.
vmm_tests/vmm_tests/tests/tests/x86_64.rs Updates NVMe handle initialization to include sriov: None.
vmm_tests/vmm_tests/tests/tests/multiarch/pcie.rs Adds NVMe SR-IOV end-to-end guest test using nvme-cli from petritools EROFS.
vm/devices/storage/storage_tests/tests/scsidvd_nvme.rs Updates NVMe test handle to include sriov: None.
vm/devices/storage/nvme/src/workers/coordinator.rs Adds EnableStateKind and a non-blocking poll_drain() to support VF drain via PollDevice.
vm/devices/storage/nvme/src/workers/admin.rs Adds PF/VF controller-ID handling plus SR-IOV admin commands (virt-mgmt + namespace attachment) and Identify CNS 0x14/0x15.
vm/devices/storage/nvme/src/workers.rs Re-exports new SR-IOV/admin-related types and worker state kind.
vm/devices/storage/nvme/src/vf.rs Introduces a VF NVMe controller implementation (config space, MSI-X, BAR handlers, worker lifecycle).
vm/devices/storage/nvme/src/tests/sriov_tests.rs Adds NVMe SR-IOV unit tests (VF lifecycle, routing, MSE, reset, end-to-end VF I/O).
vm/devices/storage/nvme/src/tests/controller_tests.rs Updates controller test setup to pass sriov: None.
vm/devices/storage/nvme/src/tests.rs Registers the new SR-IOV test module.
vm/devices/storage/nvme/src/resolver.rs Adds resource-level SR-IOV config validation and maps it into NVMe device caps.
vm/devices/storage/nvme/src/registers.rs Extracts shared NVMe BAR0 register handling behind a NvmeRegisterIo trait for PF + VF reuse.
vm/devices/storage/nvme/src/pci.rs Adds SR-IOV state management, VF creation/drain, VF BAR address decode and MMIO/config routing in the PF.
vm/devices/storage/nvme/src/namespace.rs Exposes Namespace::disk() to clone disks for VF namespace attachment.
vm/devices/storage/nvme/src/lib.rs Wires new modules and adds shared VF config structures; centralizes NVMe CAP constant.
vm/devices/storage/nvme_spec/src/lib.rs Extends NVMe spec model for virtualization mgmt + namespace attachment and related Identify structures; adds typed Cmic.
vm/devices/storage/nvme_resources/src/lib.rs Adds NvmeSriovConfig to the NVMe resource handle.
vm/devices/storage/disk_nvme/nvme_driver/src/tests.rs Updates test controller caps initialization to include sriov: None.
vm/devices/storage/disk_nvme/nvme_driver/fuzz/fuzz_nvme_driver.rs Updates fuzzer controller caps initialization to include sriov: None.
vm/devices/pci/pci_core/src/spec.rs Adds detailed SR-IOV extended capability register definitions/bitfields and removes older minimal header enum.
vm/devices/pci/pci_core/src/capabilities/extended/sriov.rs Implements the device-agnostic SR-IOV extended capability emulator with Save/Restore and callback hooks.
vm/devices/pci/pci_core/src/capabilities/extended/mod.rs Exposes the new sriov extended capability module.
petri/src/vm/openvmm/modify.rs Adds petri helper to create an SR-IOV-enabled NVMe device and propagates new handle field.
petri/src/vm/openvmm/construct.rs Updates NVMe device construction paths to include sriov: None.
openhcl/underhill_core/src/dispatch/vtl2_settings_worker.rs Updates NVMe controller config creation to include sriov: None.
flowey/flowey_hvlite/src/pipelines/vmm_tests_run.rs Ensures petritools EROFS artifacts are treated as always-available selections.

Comment thread vm/devices/storage/nvme/src/pci.rs Outdated
Comment thread vmm_tests/vmm_tests/tests/tests/multiarch/pcie.rs
Comment thread vmm_tests/vmm_tests/tests/tests/multiarch/pcie.rs
Comment thread vm/devices/pci/pci_core/src/capabilities/extended/sriov.rs Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 8 comments.

Comment thread vm/devices/storage/nvme/src/pci.rs Outdated
Comment thread vm/devices/storage/nvme/src/pci.rs
Comment thread vm/devices/storage/nvme/src/resolver.rs
Comment thread vm/devices/storage/nvme/src/resolver.rs
Comment thread vm/devices/pci/pci_core/src/capabilities/extended/sriov.rs
Comment thread vm/devices/storage/nvme/src/workers/admin.rs
Comment thread vm/devices/storage/nvme/src/workers/admin.rs
Comment thread vmm_tests/vmm_tests/tests/tests/multiarch/pcie.rs Outdated
jstarks added 8 commits June 4, 2026 10:34
Add a generic, device-agnostic SR-IOV extended capability implementation
to pci_core that can be attached to any PF's extended capability list.
This is the foundation for emulating SR-IOV devices such as NVMe.

The new SriovExtendedCapability implements the PciExtendedCapability
trait and covers the full SR-IOV capability structure per PCIe Base
Specification Section 9.3.3: capabilities, control/status, InitialVFs,
TotalVFs, NumVFs, VF offset/stride, VF Device ID, page sizes, and
VF BARs 0-5 with size probing support.

A SriovCallback trait allows the owning device to be notified when
VF Enable changes, so it can create or tear down VF instances.
SaveRestore is implemented from day one.

Register bitfield types (SriovCapabilities, SriovControl, SriovStatus)
and the SriovExtendedCapabilityHeader offset enum are added to
pci_core::spec::caps::sriov, following the existing acs/pci_express
patterns.
Add the NVMe-side wiring for SR-IOV virtual functions. When
NvmeControllerCaps includes an NvmeSriovCaps, the PF attaches the
Phase 1 SriovExtendedCapability to its extended capability list, sets
the multi-function bit, and overrides pci_cfg_read_with_routing /
pci_cfg_write_with_routing to dispatch config space accesses to VF
instances.

NvmeVirtualFunction (new nvme/src/vf.rs) owns a ConfigSpaceType0Emulator
with the VF device ID, MSI-X and PCIe Express capabilities, and Dummy
BARs. VFs are created when VF Enable is set in the SR-IOV capability
and destroyed when it is cleared, using a shared callback state that
the PF drains after each config space write.

Per-VF MsiTarget is derived from the PF target via MsiTarget::with_devfn,
giving each VF a unique requester ID for interrupt delivery.

Actual MMIO intercepts and NVMe controller logic for VFs are deferred
to later phases.
VFs do not have their own BAR registers. BAR addresses are defined by
the PF's SR-IOV extended capability VF BAR registers, with each VF's
address computed as vf_bar_base + vf_index * bar_size.

Changes:

- Remove BARs from VF config space (VFs use empty DeviceBars)
- Add vf_bar_address() and vf_bar_size() accessors to
  SriovExtendedCapability for computing per-VF BAR addresses
- Extend SriovCallback with vf_bar_changed() notification that
  includes the computed VF0 base address and per-VF bar size
- Move callback invocation into write_vf_bar() so both low and
  high 32-bit writes of 64-bit BARs trigger notifications
- Pre-allocate per-VF MMIO intercepts at PF construction time
  (bar0 + msix bar4 for each potential VF)
- PF maps/unmaps VF MMIO intercepts in response to SR-IOV VF BAR
  register changes and VF Enable transitions
- PF routes MMIO reads/writes to VFs using intercept offset_of()
- VF read_bar0/write_bar0 are stub implementations (phase 5)
- VF read_msix/write_msix route to the VF's MsixEmulator
Add NVMe admin command extensions to the PF for SR-IOV support:

- Virtualization Management command (0x1C): online/offline secondary
  controllers. VQ/VI flexible resource assignment is not supported
  (CRT=0 — all resources are private).
- Namespace Attachment command (0x15): attach/detach namespaces to
  secondary controllers.
- Extended Identify: PF reports cntlid, cmic, oacs bits for
  virtualization management and namespace management.
- CNS 0x14 (Primary Controller Capabilities): reports CRT=0 with
  private resource counts for VQ and VI.
- CNS 0x15 (Secondary Controller List): enumerates all secondary
  controllers with their state and VF number.

New spec types in nvme_spec: Cmic bitfield, PrimaryControllerCapabilities,
SecondaryControllerEntry/List, Cdw10/11 for Virtualization Management
and Namespace Attachment, ControllerList.

SriovAdminState in the admin handler tracks per-VF online/offline state
and namespace attachments.
Each VF is now a full NVMe controller with its own register state,
admin/IO workers, and independent CC.EN lifecycle. VFs read namespace
assignments from shared state written by the PF admin handler via
Namespace Attachment commands.

VF queue limits are fixed at construction time (CRT=0) — all resources
are private, not dynamically assignable via Virtualization Management.

VF teardown uses a two-phase async design: disable_vfs() unmaps MMIO
intercepts and moves VFs to a draining list, then stop()/reset() call
vf.drain().await to properly wait for in-flight IOs to complete.

VF Identify correctly reports cmic.sriov=1 to indicate the controller
is associated with an SR-IOV VF.
When VF_Enable is cleared, pci_cfg_write now returns IoResult::Defer,
stalling the writing VCPU until all VF in-flight IOs finish draining.
This is necessary because in-flight IO futures may hold references to
guest memory that must complete before the VF is destroyed.

The drain is driven by PollDevice::poll_device(), which polls each VF's
non-blocking poll_drain() method. When all VFs reach the Disabled state,
the DeferredWrite is completed and the VCPU resumes.

A vf_drain flag prevents VF_Enable=1 writes while drain is in progress.
stop()/reset() also drain any pending VFs as a backstop for device
teardown.
Add 8 tests covering SR-IOV VF lifecycle, config space routing, and
PF identify:
- PF multi-function bit is set when SR-IOV is configured
- VF offset/stride are correct (offset=1, stride=1)
- VF_Enable creates VFs visible via config space routing
- VF_Enable clear removes VFs (with deferred drain)
- VF config space is accessible at the correct function numbers
- NumVFs is read-only while VF_Enable is set
- Device reset clears VFs and VF_Enable
- PF Identify reports cntlid=1 and oacs.virtualization_management
Replace the offset_of()-based MMIO dispatch (which relied on the MMIO
intercept infrastructure tracking mapped addresses) with direct address
computation from the cached VF BAR layout. The PF knows the VF BAR base
and size from the SR-IOV capability registers and computes
vf_index = (addr - base) / bar_size, offset = (addr - base) % bar_size.

This is how real hardware works — the PF's address decoder knows the VF
BAR layout directly, rather than querying external infrastructure.

Add a test that exercises VF BAR0 MMIO routing end-to-end: reads CAP,
Version, and CSTS registers from two VFs via mmio_read through the PF's
MMIO dispatch.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

jstarks added 2 commits June 4, 2026 10:36
Add a comprehensive test that exercises the full VF lifecycle:
1. Create PF with SR-IOV, enable PF controller, add namespace
2. Enable VFs, set VF BAR addresses via SR-IOV capability
3. PF admin: bring secondary controller online, attach namespace
4. VF: enable MSI-X, set admin queues, enable controller (CC.EN=1)
5. VF admin: Identify Controller — verify cntlid, cmic.vf
6. VF admin: Create IO CQ and SQ
7. VF IO: READ 1 sector from namespace 1, verify success

This test exercises MMIO routing to VF BAR0/BAR4, VF config space
routing, VF NVMe controller enable, admin/IO command processing,
and the shared VfControllerConfig namespace propagation.
Add NvmeSriovConfig to the NVMe resource handle so SR-IOV can be
configured through the standard resource resolution path. The resolver
validates total_vfs is in 1..=7 (no ARI support) and hardcodes
vf_msix_count to 4.

Add with_pcie_nvme_sriov() to petri for test construction, and add
sriov: None to all existing NvmeControllerHandle callsites.

Remove vf_device_id from NvmeSriovCaps — VFs always use the same
device ID as the PF (0x00a9).

Add pcie_nvme_sriov VMM test that exercises the full guest-visible
SR-IOV workflow: enable VFs, bring secondary controller online via
nvme-cli, attach namespaces, perform IO through a VF, verify data
via PF, then tear down.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.

Comment on lines +1249 to +1254
// Read the controller list from the data buffer.
let mut list_buf = [0u8; 4096];
PrpRange::parse(&self.config.mem, list_buf.len(), command.dptr)?
.read(&self.config.mem, &mut list_buf)?;
let controller_list = spec::ControllerList::ref_from_bytes(&list_buf).unwrap();

Comment thread vm/devices/storage/nvme/src/pci.rs
Comment thread vm/devices/storage/nvme/src/pci.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants