Debug runtime issues rev 1 by miladhzzzz · Pull Request #6 · persys-dev/compute-agent

miladhzzzz · 2026-06-01T13:27:28Z

Summary

This release adds support for managed volume storage orchestration, enhanced workload telemetry, and improved error handling aligned with scheduler platform extensions. Compute-agent now supports NFS and Ceph-RBD managed volumes for containers and VMs, with proper lifecycle management and storage driver capability advertisement.

Major Features

Managed Volume Support
- Added ManagedVolumes[]ManagedVolumeSpec to container and VM workload specs
- Support for multiple storage drivers:
  - local - Host bind paths (existing behavior)
  - nfs - NFS server mounts
  - ceph-rbd - Ceph RBD block devices
- Per-volume configuration: name, driver, size (GB), access mode, filesystem type, mount path, read-only, retain policy
- Graceful fallback to host bind paths if managed volume provisioning fails
Storage Driver Capability Advertisement
- Agent advertises supported storage drivers during node registration
- SupportedStorageDrivers[] field populated in heartbeat
- Enables scheduler to make storage-aware placement decisions
Enhanced Workload Telemetry
- Added WorkloadUsage model with CPU%, memory, disk I/O, and network metrics
- Per-workload usage collection and exposure in status responses
- Enables performance correlation with placement decisions
Improved Failure Diagnostics
- Added WorkloadReason with structured code, message, last transition, and next retry metadata
- Terminal failure detection: prevents reapply loops for non-retryable errors
- Failure reason propagation: infrastructure vs runtime error classification
Cloud-Init Enhancement (VM)
- Support for structured CloudInitConfig with separate fields:
  - user_data - User-provided cloud-init script
  - meta_data - Cloud-init metadata
  - network_config - Network configuration (optional)
  - vendor_data - Vendor data (optional)
- Faithful injection of all cloud-init fields into VM boot

Breaking Changes

None. All changes are backward compatible.

Deprecations

Single-string cloudInit field deprecated in favor of structured CloudInitConfig
Legacy bind-path-only volume handling will be superseded by managed volume system

Changed Files

api/proto/agent.proto (updated)
- Added ManagedVolumeSpec message type
- Added ManagedVolumes field to ContainerSpec
- Added ManagedVolumes field to VMSpec
- Added WorkloadUsage and WorkloadReason message types
- Extended heartbeat to include workload usage snapshots
internal/models/workload.go (updated)
- Added ManagedVolumes[]ManagedVolumeSpec to container spec
- Added ManagedVolumes[]ManagedVolumeSpec to VM spec
- Added WorkloadUsage struct for telemetry
- Added WorkloadReason struct for structured failures
internal/control/client.go (updated)
- Updated node registration to advertise SupportedStorageDrivers
- Enhanced heartbeat to include per-workload usage snapshots
- Improved failure reason propagation
internal/runtime/docker.go (updated)
- Added managed volume mount support
- Fall back to host bind paths if managed volumes unavailable
- Properly handle read-only and mount-path specifications
internal/runtime/vm.go (updated)
- Added managed volume disk attachment for Ceph/NFS backends
- Enhanced cloud-init ISO builder with structured payload support
- Proper handling of meta-data, network-config, vendor-data files
internal/workload/manager.go (updated)
- Added managed volume provisioning lifecycle
- Pre-provision and attach volumes before runtime create
- Cleanup volumes on workload deletion (respecting retain policy)
pkg/api/v1/ (regenerated)
- Protobuf code generation for new message types

Resource Impact

Storage Overhead:

Minimal: managed volume metadata tracked in control plane
Agent reports only node-level storage capabilities

Network:

Heartbeat size increases by ~200-500 bytes per workload (usage metrics)
One-time increase during node registration (~100 bytes)

Backward Compatibility:

Old workload specs without ManagedVolumes continue to work
Agent falls back to host bind paths automatically
Single-string CloudInit still supported alongside structured config

…kload status handling

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 366c97f9bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-01T13:32:08Z

-			if delErr := rt.Delete(ctx, workload.ID); delErr != nil {
-				m.logger.Warnf("Failed to delete workload during cleanup: %v", delErr)
-			}
+			m.cleanupManagedStorageAllocations(ctx, managedAllocations, true)


Keep managed storage usable for retry recreate

When rt.Start fails after rt.Create succeeded for a workload with managed volumes, this branch detaches/deletes the freshly provisioned allocations while leaving the workload persisted for reconciliation. The retry path later recreates from the saved spec without rerunning managed-storage preparation, so it can reuse stale host paths/devices that were just removed. For start failures on managed-volume workloads, either keep the allocations until retry or ensure the retry path reprovisions them.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-01T13:32:08Z

+		attachment, err := m.volumeMgr.Attach(ctx, platformSpec, handle, workload.ID)
+		if err != nil {
+			return nil, fmt.Errorf("attach volume %q (%s): %w", spec.Name, spec.Driver, err)


Roll back provisioned volumes when attach fails

If Attach fails after Provision has succeeded, this returns immediately without calling the rollback helper, leaving the saved handle and backing directory/image behind; with multiple managed volumes it also leaves any earlier attachments from the same loop. This can happen on stage-directory permission errors or later-volume failures, so the partial allocations should be cleaned up before returning the storage setup error.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-01T13:32:08Z

+		if retainPolicy == "retain" {
+			continue
+		}
+		if err := m.volumeMgr.Delete(ctx, handle); err != nil {


Do not delete volumes still attached elsewhere

Because volume handles are keyed only by driver:name, two workloads that request the same managed volume share the same stored handle, but deleting one workload deletes the handle and backing storage immediately after removing only that workload's attachment. For the local provider this Delete removes the shared directory, which can destroy data for another still-running workload; check for other attachments to the same volume before deleting it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-01T13:32:08Z

+		code = strings.TrimSpace(status.Metadata["failure_reason"])
+		message = strings.TrimSpace(status.Metadata["failure_message"])
+		retryable = strings.EqualFold(strings.TrimSpace(status.Metadata["retryable"]), "true")
+		if rawNext := strings.TrimSpace(status.Metadata["retry_next_at"]); rawNext != "" {


Read the retry timestamp key that is written

The workload manager records retry delays under next_retry_time in the status metadata, but the heartbeat reason conversion looks for retry_next_at, so ReasonDetail.NextRetryAt is never populated for deferred retries. This makes the control plane miss the scheduled retry time even though it is present in the agent status metadata.

Useful? React with 👍 / 👎.

miladhzzzz added 27 commits February 23, 2026 17:11

Update: Package Name to use correct persys-dev name scheme

d6ea8f9

Chore: Add Make Proto Target With output path update

fb406ba

Chore: Update Infra + go mod + readme

f7bca88

Update: CHANGELOG.md

b713f4f

Update: README.md

b4178e7

Feat: Add Managed Volumes to API schema

d3d6098

Feat: Add Opentelemetry/Fix Package name / Volume management

44b9fb3

Add: metrics port / storage configuration

184d0a5

Fix: Package name / Add OpenTelemetry / Add workload status

fd95575

Fix: Change Package Name / Add OTEL and Storage management

7a08fcc

Update: Metrics collect workload states

f3dbd6d

Add: platform specific interfaces for networking / storage

0bf1c5f

Update: policy test retry logic

41b0298

Feat: Handle Different failure modes and report

9b3fe47

Add: Docker compose tests

70561a0

Feat: Add compose status

0343f07

Feat: Add status metadata gathering

d6d0a93

Add: vm disk source test

7f9455c

Update: Add storage configuration / cloud-init injection + better wor…

e36da39

…kload status handling

Add: Bolt manager

a04bbe0

Add: Managed volumes

2261a54

Feat: Add Storage Providers

9e36cf9

Feat: Add better observablity to queue

b8e8981

Add: OpenTelemetry Driver

e376ede

Feat: Add volume management + better reporting

68d0816

Add: Workload Retry Logic test

edd894a

Update: Schema to reflect latest proto changes

366c97f

miladhzzzz merged commit ca4cfd8 into main Jun 1, 2026
6 of 7 checks passed

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug runtime issues rev 1#6

Debug runtime issues rev 1#6
miladhzzzz merged 27 commits into
mainfrom
debug-runtime-issues-rev-1

miladhzzzz commented Jun 1, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miladhzzzz commented Jun 1, 2026

Summary

Major Features

Breaking Changes

Deprecations

Changed Files

Resource Impact

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant