nvme: add SR-IOV emulation#3650
Draft
jstarks wants to merge 10 commits into
Draft
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds SR-IOV emulation support to the NVMe device model, enabling a PF to expose VFs that appear as independent NVMe controllers to the guest. It also introduces a reusable, device-agnostic SR-IOV PCIe extended capability emulator in pci_core, plus unit/integration coverage and resource-layer wiring for the new configuration.
Changes:
- Implement generic PCIe SR-IOV extended capability emulation (including BAR probing, VF enable/memory decode notifications, and Save/Restore).
- Extend the emulated NVMe controller to create/manage VF controller instances, route VF config/MMIO/MSI-X, and implement NVMe virtualization management + namespace attachment flows.
- Add NVMe SR-IOV unit tests and a VMM integration test exercising VF lifecycle and I/O, plus propagate
sriovconfiguration through resource/petri layers.
Reviewed changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| vmm_tests/vmm_tests/tests/tests/x86_64/storage.rs | Plumbs new NvmeControllerHandle.sriov field into existing NVMe test device construction. |
| vmm_tests/vmm_tests/tests/tests/x86_64.rs | Updates NVMe handle initialization to include sriov: None. |
| vmm_tests/vmm_tests/tests/tests/multiarch/pcie.rs | Adds NVMe SR-IOV end-to-end guest test using nvme-cli from petritools EROFS. |
| vm/devices/storage/storage_tests/tests/scsidvd_nvme.rs | Updates NVMe test handle to include sriov: None. |
| vm/devices/storage/nvme/src/workers/coordinator.rs | Adds EnableStateKind and a non-blocking poll_drain() to support VF drain via PollDevice. |
| vm/devices/storage/nvme/src/workers/admin.rs | Adds PF/VF controller-ID handling plus SR-IOV admin commands (virt-mgmt + namespace attachment) and Identify CNS 0x14/0x15. |
| vm/devices/storage/nvme/src/workers.rs | Re-exports new SR-IOV/admin-related types and worker state kind. |
| vm/devices/storage/nvme/src/vf.rs | Introduces a VF NVMe controller implementation (config space, MSI-X, BAR handlers, worker lifecycle). |
| vm/devices/storage/nvme/src/tests/sriov_tests.rs | Adds NVMe SR-IOV unit tests (VF lifecycle, routing, MSE, reset, end-to-end VF I/O). |
| vm/devices/storage/nvme/src/tests/controller_tests.rs | Updates controller test setup to pass sriov: None. |
| vm/devices/storage/nvme/src/tests.rs | Registers the new SR-IOV test module. |
| vm/devices/storage/nvme/src/resolver.rs | Adds resource-level SR-IOV config validation and maps it into NVMe device caps. |
| vm/devices/storage/nvme/src/registers.rs | Extracts shared NVMe BAR0 register handling behind a NvmeRegisterIo trait for PF + VF reuse. |
| vm/devices/storage/nvme/src/pci.rs | Adds SR-IOV state management, VF creation/drain, VF BAR address decode and MMIO/config routing in the PF. |
| vm/devices/storage/nvme/src/namespace.rs | Exposes Namespace::disk() to clone disks for VF namespace attachment. |
| vm/devices/storage/nvme/src/lib.rs | Wires new modules and adds shared VF config structures; centralizes NVMe CAP constant. |
| vm/devices/storage/nvme_spec/src/lib.rs | Extends NVMe spec model for virtualization mgmt + namespace attachment and related Identify structures; adds typed Cmic. |
| vm/devices/storage/nvme_resources/src/lib.rs | Adds NvmeSriovConfig to the NVMe resource handle. |
| vm/devices/storage/disk_nvme/nvme_driver/src/tests.rs | Updates test controller caps initialization to include sriov: None. |
| vm/devices/storage/disk_nvme/nvme_driver/fuzz/fuzz_nvme_driver.rs | Updates fuzzer controller caps initialization to include sriov: None. |
| vm/devices/pci/pci_core/src/spec.rs | Adds detailed SR-IOV extended capability register definitions/bitfields and removes older minimal header enum. |
| vm/devices/pci/pci_core/src/capabilities/extended/sriov.rs | Implements the device-agnostic SR-IOV extended capability emulator with Save/Restore and callback hooks. |
| vm/devices/pci/pci_core/src/capabilities/extended/mod.rs | Exposes the new sriov extended capability module. |
| petri/src/vm/openvmm/modify.rs | Adds petri helper to create an SR-IOV-enabled NVMe device and propagates new handle field. |
| petri/src/vm/openvmm/construct.rs | Updates NVMe device construction paths to include sriov: None. |
| openhcl/underhill_core/src/dispatch/vtl2_settings_worker.rs | Updates NVMe controller config creation to include sriov: None. |
| flowey/flowey_hvlite/src/pipelines/vmm_tests_run.rs | Ensures petritools EROFS artifacts are treated as always-available selections. |
Add a generic, device-agnostic SR-IOV extended capability implementation to pci_core that can be attached to any PF's extended capability list. This is the foundation for emulating SR-IOV devices such as NVMe. The new SriovExtendedCapability implements the PciExtendedCapability trait and covers the full SR-IOV capability structure per PCIe Base Specification Section 9.3.3: capabilities, control/status, InitialVFs, TotalVFs, NumVFs, VF offset/stride, VF Device ID, page sizes, and VF BARs 0-5 with size probing support. A SriovCallback trait allows the owning device to be notified when VF Enable changes, so it can create or tear down VF instances. SaveRestore is implemented from day one. Register bitfield types (SriovCapabilities, SriovControl, SriovStatus) and the SriovExtendedCapabilityHeader offset enum are added to pci_core::spec::caps::sriov, following the existing acs/pci_express patterns.
Add the NVMe-side wiring for SR-IOV virtual functions. When NvmeControllerCaps includes an NvmeSriovCaps, the PF attaches the Phase 1 SriovExtendedCapability to its extended capability list, sets the multi-function bit, and overrides pci_cfg_read_with_routing / pci_cfg_write_with_routing to dispatch config space accesses to VF instances. NvmeVirtualFunction (new nvme/src/vf.rs) owns a ConfigSpaceType0Emulator with the VF device ID, MSI-X and PCIe Express capabilities, and Dummy BARs. VFs are created when VF Enable is set in the SR-IOV capability and destroyed when it is cleared, using a shared callback state that the PF drains after each config space write. Per-VF MsiTarget is derived from the PF target via MsiTarget::with_devfn, giving each VF a unique requester ID for interrupt delivery. Actual MMIO intercepts and NVMe controller logic for VFs are deferred to later phases.
VFs do not have their own BAR registers. BAR addresses are defined by the PF's SR-IOV extended capability VF BAR registers, with each VF's address computed as vf_bar_base + vf_index * bar_size. Changes: - Remove BARs from VF config space (VFs use empty DeviceBars) - Add vf_bar_address() and vf_bar_size() accessors to SriovExtendedCapability for computing per-VF BAR addresses - Extend SriovCallback with vf_bar_changed() notification that includes the computed VF0 base address and per-VF bar size - Move callback invocation into write_vf_bar() so both low and high 32-bit writes of 64-bit BARs trigger notifications - Pre-allocate per-VF MMIO intercepts at PF construction time (bar0 + msix bar4 for each potential VF) - PF maps/unmaps VF MMIO intercepts in response to SR-IOV VF BAR register changes and VF Enable transitions - PF routes MMIO reads/writes to VFs using intercept offset_of() - VF read_bar0/write_bar0 are stub implementations (phase 5) - VF read_msix/write_msix route to the VF's MsixEmulator
Add NVMe admin command extensions to the PF for SR-IOV support: - Virtualization Management command (0x1C): online/offline secondary controllers. VQ/VI flexible resource assignment is not supported (CRT=0 — all resources are private). - Namespace Attachment command (0x15): attach/detach namespaces to secondary controllers. - Extended Identify: PF reports cntlid, cmic, oacs bits for virtualization management and namespace management. - CNS 0x14 (Primary Controller Capabilities): reports CRT=0 with private resource counts for VQ and VI. - CNS 0x15 (Secondary Controller List): enumerates all secondary controllers with their state and VF number. New spec types in nvme_spec: Cmic bitfield, PrimaryControllerCapabilities, SecondaryControllerEntry/List, Cdw10/11 for Virtualization Management and Namespace Attachment, ControllerList. SriovAdminState in the admin handler tracks per-VF online/offline state and namespace attachments.
Each VF is now a full NVMe controller with its own register state, admin/IO workers, and independent CC.EN lifecycle. VFs read namespace assignments from shared state written by the PF admin handler via Namespace Attachment commands. VF queue limits are fixed at construction time (CRT=0) — all resources are private, not dynamically assignable via Virtualization Management. VF teardown uses a two-phase async design: disable_vfs() unmaps MMIO intercepts and moves VFs to a draining list, then stop()/reset() call vf.drain().await to properly wait for in-flight IOs to complete. VF Identify correctly reports cmic.sriov=1 to indicate the controller is associated with an SR-IOV VF.
When VF_Enable is cleared, pci_cfg_write now returns IoResult::Defer, stalling the writing VCPU until all VF in-flight IOs finish draining. This is necessary because in-flight IO futures may hold references to guest memory that must complete before the VF is destroyed. The drain is driven by PollDevice::poll_device(), which polls each VF's non-blocking poll_drain() method. When all VFs reach the Disabled state, the DeferredWrite is completed and the VCPU resumes. A vf_drain flag prevents VF_Enable=1 writes while drain is in progress. stop()/reset() also drain any pending VFs as a backstop for device teardown.
Add 8 tests covering SR-IOV VF lifecycle, config space routing, and PF identify: - PF multi-function bit is set when SR-IOV is configured - VF offset/stride are correct (offset=1, stride=1) - VF_Enable creates VFs visible via config space routing - VF_Enable clear removes VFs (with deferred drain) - VF config space is accessible at the correct function numbers - NumVFs is read-only while VF_Enable is set - Device reset clears VFs and VF_Enable - PF Identify reports cntlid=1 and oacs.virtualization_management
Replace the offset_of()-based MMIO dispatch (which relied on the MMIO intercept infrastructure tracking mapped addresses) with direct address computation from the cached VF BAR layout. The PF knows the VF BAR base and size from the SR-IOV capability registers and computes vf_index = (addr - base) / bar_size, offset = (addr - base) % bar_size. This is how real hardware works — the PF's address decoder knows the VF BAR layout directly, rather than querying external infrastructure. Add a test that exercises VF BAR0 MMIO routing end-to-end: reads CAP, Version, and CSTS registers from two VFs via mmio_read through the PF's MMIO dispatch.
Add a comprehensive test that exercises the full VF lifecycle: 1. Create PF with SR-IOV, enable PF controller, add namespace 2. Enable VFs, set VF BAR addresses via SR-IOV capability 3. PF admin: bring secondary controller online, attach namespace 4. VF: enable MSI-X, set admin queues, enable controller (CC.EN=1) 5. VF admin: Identify Controller — verify cntlid, cmic.vf 6. VF admin: Create IO CQ and SQ 7. VF IO: READ 1 sector from namespace 1, verify success This test exercises MMIO routing to VF BAR0/BAR4, VF config space routing, VF NVMe controller enable, admin/IO command processing, and the shared VfControllerConfig namespace propagation.
Add NvmeSriovConfig to the NVMe resource handle so SR-IOV can be configured through the standard resource resolution path. The resolver validates total_vfs is in 1..=7 (no ARI support) and hardcodes vf_msix_count to 4. Add with_pcie_nvme_sriov() to petri for test construction, and add sriov: None to all existing NvmeControllerHandle callsites. Remove vf_device_id from NvmeSriovCaps — VFs always use the same device ID as the PF (0x00a9). Add pcie_nvme_sriov VMM test that exercises the full guest-visible SR-IOV workflow: enable VFs, bring secondary controller online via nvme-cli, attach namespaces, perform IO through a VF, verify data via PF, then tear down.
Comment on lines
+1249
to
+1254
| // Read the controller list from the data buffer. | ||
| let mut list_buf = [0u8; 4096]; | ||
| PrpRange::parse(&self.config.mem, list_buf.len(), command.dptr)? | ||
| .read(&self.config.mem, &mut list_buf)?; | ||
| let controller_list = spec::ControllerList::ref_from_bytes(&list_buf).unwrap(); | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add optional SR-IOV support to the emulated NVMe controller. When configured, the PF exposes virtual functions that appear as independent NVMe controllers to the guest. Each VF has its own config space, MSI-X, register state, admin/IO queues, and worker tasks. The guest enables VFs through the standard PCIe SR-IOV mechanism, then uses the NVMe Virtualization Management command to bring secondary controllers online and attach namespaces.
The SR-IOV extended capability implementation in
pci_coreis device-agnostic and reusable for future SR-IOV devices.VF resources (queues, interrupts) are all private (
CRT=0)--there is no flexible resource pool. VFs get fixed queue capacity at construction time.It's best to review this by commit.
Commits
SriovExtendedCapabilityimplementingPciExtendedCapability, with VF BAR probing, control/status registers, callback on VF Enable, and SaveRestore.pci_cfg_read/write_with_routing.NvmeRegisterIotrait with the PF. VFs read namespace assignments from shared state at CC.EN time.PollDevice-driven async drain withDeferredWritecompletion.NvmeSriovConfigin resource handle, resolver validation, petri helper, guest-visible VMM test with nvme-cli.