Debug runtime conditions rev 1#22
Conversation
…mentation + FIX etcd memory exceed bug
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: caa4d95da8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Volumes: w.Volumes, | ||
| Network: w.Network, | ||
| RestartPolicy: w.RestartPolicy, | ||
| VM: w.VM, |
There was a problem hiding this comment.
Persist top-level managed volumes
When a container workload includes ContainerSpec.managed_volumes, controlApplyToModel stores those specs in models.Workload.ManagedVolumes, but this new spec projection never serializes ManagedVolumes and never reconstructs it on read. After the first save/reload the scheduler loses the requested volume specs; because the metadata marker remains, later storage syncs see an empty required set and can release/delete the managed volume state for a still-running workload.
Useful? React with 👍 / 👎.
| replace github.com/lib/pq => ./third_party/libpq | ||
|
|
||
| replace github.com/redis/go-redis/v9 => ./third_party/go-redis-v9 |
There was a problem hiding this comment.
Remove missing local module replacements
A clean checkout cannot build persys-automation with these replace directives because neither persys-automation/third_party/libpq nor persys-automation/third_party/go-redis-v9 is committed; I verified go list ./... fails with missing replacement-directory errors. This blocks the new automation service before any code or Docker image can be built unless those directories are added or the replacements are removed.
Useful? React with 👍 / 👎.
Summary
This release adds Redis-backed telemetry storage to solve etcd space exhaustion, improved runtime failure handling, managed volume support, and workload utilization telemetry.
Major Features
etcd Space Optimization with Redis Telemetry Store
/workloads-spec/{id}) and mutable status (/workloads-status/{id})/workloads/{id}data continues to load via compatibility shimREDIS_ADDR,REDIS_PASSWORD,REDIS_DB,REDIS_RECONCILE_TTL,REDIS_EVENT_TTL,REDIS_EVENT_MAX_ENTRIESEnhanced Runtime Failure Handling
WorkloadReasoncontaining code, message, last transition, next retry metadataManaged Volume Support
SupportedStorageDrivers[]field to Node model for storage capability advertisinglocal- Host bind paths (existing behavior)nfs- NFS server mountsceph-rbd- Ceph RBD block devicesManagedVolumeRecordandVolumeAttachmentRecordacross lifecycleWorkload Utilization Telemetry
WorkloadUsagemodel with CPU%, memory bytes, disk read/write bytes, network RX/TX bytes, collection timestamppersys_scheduler_state_store_writes_total{category}- Track storage operations by typeBreaking Changes
None. All changes are backward compatible.
Deprecations
/workloads/{id}key is deprecated in favor of/workloads-spec/{id}+/workloads-status/{id}splitcloud_initfield deprecated in favor of structuredCloudInitConfigwith separate fieldsreapplyStillGuardedreturn signature changed to(bool, time.Time)to provide wait-until timestampChanged Files
internal/config/config.go (+36 lines)
internal/scheduler/redis_store.go (NEW FILE, +92 lines)
initRedisStore()- Initialize Redis with graceful degradationwriteReconciliationTelemetry()- Store reconciliation metadata in RediswriteEventTelemetry()- Store bounded event history in Redisinternal/scheduler/workload_projection.go (NEW FILE, +83 lines)
workloadSpec- Immutable specification data typeworkloadStatus- Mutable status and telemetry data typeinternal/models/models.go (+183 lines)
SupportedStorageDrivers[]stringto NodeManagedVolumes[]ManagedVolumeSpecto Workload and VMSpecManagedVolumeSpecstruct with detailsWorkloadUsagestruct for telemetryWorkloadReasonstruct for structured failuresManagedVolumeRecordandVolumeAttachmentRecordfor control-plane stateinternal/scheduler/state_store.go (+455 lines)
saveWorkload()to split and persist spec/status separatelywriteReconciliationRecord()to use Redis telemetryemitEvent()to use Redis with etcd fallbackinternal/scheduler/scheduler.go (+97 lines)
redisClient *redis.ClientfieldNewScheduler()to callinitRedisStore()Close()to close Redis connectionGetWorkloads()to load from split spec/status keysGetWorkloadByID()to load from split keys with legacy compatibilityDeleteWorkloadWithContext()to delete split keysUpdateWorkloadRuntimeDetails()for storing structured reason + usagerequiredStorageDrivers()andnodeSupportsStorageDrivers()for schedulinginternal/scheduler/reconciler.go (+286 lines)
ReconcileWorkload()with exponential backoff logicshouldHaltReapply()for terminal failure detectionnextReapplyAttempt()andresetReapplyBackoff()helpersgetActualWorkloadState()to capture error metadataapplyDesiredState()to track reapply attempts and reset on successupdateWorkloadReconciliationStatus()to use Redis storageinternal/scheduler/workload_control.go (+93 lines)
UpdateWorkloadRetryOnFailure()with grace window logic:UpdateWorkloadSpec()to detect actual spec changesapplyFailureReason(),preferredRuntimeFailureReason(),isInfrastructureFailureReason()internal/metrics/metrics.go (+9 lines)
stateStoreWritesTotalcounter with category labelIncStateStoreWrite(category string)functionapi/proto/control.proto (+34 lines)
SubmitAutomationSuggestionRPC to AgentControl serviceAutomationActionTypeenumgo.mod (+1 line)
github.com/redis/go-redis/v9 v9.19.0sample.env (+6 lines)
Resource Impact
etcd Reduction (12-hour baseline: 100 workloads, 5s reconciliation):
Redis Requirements:
Backward Compatibility:
/workloads/{id}continue to load via compatibility shimMigration Notes
REDIS_ADDRset (or empty to use etcd-only)Known Issues
None documented at this time.
Testing
All changes follow runtime condition debugging and Redis optimization specification exactly.