Skip to content

Debug runtime issues rev 1#6

Merged
miladhzzzz merged 27 commits into
mainfrom
debug-runtime-issues-rev-1
Jun 1, 2026
Merged

Debug runtime issues rev 1#6
miladhzzzz merged 27 commits into
mainfrom
debug-runtime-issues-rev-1

Conversation

@miladhzzzz
Copy link
Copy Markdown
Contributor

Summary

This release adds support for managed volume storage orchestration, enhanced workload telemetry, and improved error handling aligned with scheduler platform extensions. Compute-agent now supports NFS and Ceph-RBD managed volumes for containers and VMs, with proper lifecycle management and storage driver capability advertisement.

Major Features

  1. Managed Volume Support

    • Added ManagedVolumes[]ManagedVolumeSpec to container and VM workload specs
    • Support for multiple storage drivers:
      • local - Host bind paths (existing behavior)
      • nfs - NFS server mounts
      • ceph-rbd - Ceph RBD block devices
    • Per-volume configuration: name, driver, size (GB), access mode, filesystem type, mount path, read-only, retain policy
    • Graceful fallback to host bind paths if managed volume provisioning fails
  2. Storage Driver Capability Advertisement

    • Agent advertises supported storage drivers during node registration
    • SupportedStorageDrivers[] field populated in heartbeat
    • Enables scheduler to make storage-aware placement decisions
  3. Enhanced Workload Telemetry

    • Added WorkloadUsage model with CPU%, memory, disk I/O, and network metrics
    • Per-workload usage collection and exposure in status responses
    • Enables performance correlation with placement decisions
  4. Improved Failure Diagnostics

    • Added WorkloadReason with structured code, message, last transition, and next retry metadata
    • Terminal failure detection: prevents reapply loops for non-retryable errors
    • Failure reason propagation: infrastructure vs runtime error classification
  5. Cloud-Init Enhancement (VM)

    • Support for structured CloudInitConfig with separate fields:
      • user_data - User-provided cloud-init script
      • meta_data - Cloud-init metadata
      • network_config - Network configuration (optional)
      • vendor_data - Vendor data (optional)
    • Faithful injection of all cloud-init fields into VM boot

Breaking Changes

None. All changes are backward compatible.

Deprecations

  • Single-string cloudInit field deprecated in favor of structured CloudInitConfig
  • Legacy bind-path-only volume handling will be superseded by managed volume system

Changed Files

  1. api/proto/agent.proto (updated)

    • Added ManagedVolumeSpec message type
    • Added ManagedVolumes field to ContainerSpec
    • Added ManagedVolumes field to VMSpec
    • Added WorkloadUsage and WorkloadReason message types
    • Extended heartbeat to include workload usage snapshots
  2. internal/models/workload.go (updated)

    • Added ManagedVolumes[]ManagedVolumeSpec to container spec
    • Added ManagedVolumes[]ManagedVolumeSpec to VM spec
    • Added WorkloadUsage struct for telemetry
    • Added WorkloadReason struct for structured failures
  3. internal/control/client.go (updated)

    • Updated node registration to advertise SupportedStorageDrivers
    • Enhanced heartbeat to include per-workload usage snapshots
    • Improved failure reason propagation
  4. internal/runtime/docker.go (updated)

    • Added managed volume mount support
    • Fall back to host bind paths if managed volumes unavailable
    • Properly handle read-only and mount-path specifications
  5. internal/runtime/vm.go (updated)

    • Added managed volume disk attachment for Ceph/NFS backends
    • Enhanced cloud-init ISO builder with structured payload support
    • Proper handling of meta-data, network-config, vendor-data files
  6. internal/workload/manager.go (updated)

    • Added managed volume provisioning lifecycle
    • Pre-provision and attach volumes before runtime create
    • Cleanup volumes on workload deletion (respecting retain policy)
  7. pkg/api/v1/ (regenerated)

    • Protobuf code generation for new message types

Resource Impact

Storage Overhead:

  • Minimal: managed volume metadata tracked in control plane
  • Agent reports only node-level storage capabilities

Network:

  • Heartbeat size increases by ~200-500 bytes per workload (usage metrics)
  • One-time increase during node registration (~100 bytes)

Backward Compatibility:

  • Old workload specs without ManagedVolumes continue to work
  • Agent falls back to host bind paths automatically
  • Single-string CloudInit still supported alongside structured config

@miladhzzzz miladhzzzz merged commit ca4cfd8 into main Jun 1, 2026
6 of 7 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 366c97f9bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if delErr := rt.Delete(ctx, workload.ID); delErr != nil {
m.logger.Warnf("Failed to delete workload during cleanup: %v", delErr)
}
m.cleanupManagedStorageAllocations(ctx, managedAllocations, true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep managed storage usable for retry recreate

When rt.Start fails after rt.Create succeeded for a workload with managed volumes, this branch detaches/deletes the freshly provisioned allocations while leaving the workload persisted for reconciliation. The retry path later recreates from the saved spec without rerunning managed-storage preparation, so it can reuse stale host paths/devices that were just removed. For start failures on managed-volume workloads, either keep the allocations until retry or ensure the retry path reprovisions them.

Useful? React with 👍 / 👎.

Comment on lines +440 to +442
attachment, err := m.volumeMgr.Attach(ctx, platformSpec, handle, workload.ID)
if err != nil {
return nil, fmt.Errorf("attach volume %q (%s): %w", spec.Name, spec.Driver, err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Roll back provisioned volumes when attach fails

If Attach fails after Provision has succeeded, this returns immediately without calling the rollback helper, leaving the saved handle and backing directory/image behind; with multiple managed volumes it also leaves any earlier attachments from the same loop. This can happen on stage-directory permission errors or later-volume failures, so the partial allocations should be cleaned up before returning the storage setup error.

Useful? React with 👍 / 👎.

if retainPolicy == "retain" {
continue
}
if err := m.volumeMgr.Delete(ctx, handle); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not delete volumes still attached elsewhere

Because volume handles are keyed only by driver:name, two workloads that request the same managed volume share the same stored handle, but deleting one workload deletes the handle and backing storage immediately after removing only that workload's attachment. For the local provider this Delete removes the shared directory, which can destroy data for another still-running workload; check for other attachments to the same volume before deleting it.

Useful? React with 👍 / 👎.

code = strings.TrimSpace(status.Metadata["failure_reason"])
message = strings.TrimSpace(status.Metadata["failure_message"])
retryable = strings.EqualFold(strings.TrimSpace(status.Metadata["retryable"]), "true")
if rawNext := strings.TrimSpace(status.Metadata["retry_next_at"]); rawNext != "" {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Read the retry timestamp key that is written

The workload manager records retry delays under next_retry_time in the status metadata, but the heartbeat reason conversion looks for retry_next_at, so ReasonDetail.NextRetryAt is never populated for deferred retries. This makes the control plane miss the scheduled retry time even though it is present in the agent status metadata.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant