Skip to content

feat(client): self-upgrade CronJob (closes #69)#89

Open
saadqbal wants to merge 2 commits intodevelopfrom
feat/auto-upgrade-cronjob
Open

feat(client): self-upgrade CronJob (closes #69)#89
saadqbal wants to merge 2 commits intodevelopfrom
feat/auto-upgrade-cronjob

Conversation

@saadqbal
Copy link
Copy Markdown
Contributor

@saadqbal saadqbal commented Apr 29, 2026

Summary

  • Closes Helm chart auto-update — deployed clients stay frozen on installed version #69. Ships a <release>-auto-upgrade CronJob that polls https://tracebloc.github.io/client daily and runs helm upgrade --reuse-values when a newer chart version is published, so deployed clients no longer freeze on the version they first installed.
  • Implements option B (auto-upgrade) from the issue — picked over notify-and-approve / hybrid because the issue's whole point is "customers don't run upgrade manually".
  • Bumps chart 1.2.3 → 1.3.0.

What's in the chart now

Resource Name Notes
CronJob <release>-auto-upgrade concurrencyPolicy: Forbid, default schedule 23 2 * * * (UTC), backoffLimit: 2
ConfigMap <release>-auto-upgrade Holds the upgrade shell script
ServiceAccount <release>-auto-upgrade In the release namespace
ClusterRoleBinding <release>-auto-upgrade Bound to the built-in cluster-admin

The Pod satisfies PSA restricted: runAsNonRoot: true, runAsUser/Group: 1000, RuntimeDefault seccomp, allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, capabilities.drop: [ALL]. HOME and HELM_*_HOME are redirected into a /tmp emptyDir so helm can write its caches.

The version compare uses sort -V, not string compare — 1.10.0 > 1.9.0.

Trade-offs surfaced for review

  1. Default enabled: true. The issue is "customers freeze on the version they installed". Default-off recreates the bug for everyone who installs 1.3.0 and never re-runs helm. Operator can opt out via autoUpgrade.enabled: false, or pause via autoUpgrade.suspend: true.
  2. cluster-admin not a narrow custom role. The chart already templates cluster-scoped resources (PriorityClass, StorageClass, ClusterRole/Binding, optionally Namespace). A curated role would silently break the day a future chart version adds a new resource kind on already-deployed clients. Trust boundary is documented in values.yaml next to the autoUpgrade: block.
  3. Backend reporting deferred. The issue mentions reporting back to the tracebloc backend so the workspace UI knows what version each customer runs. No endpoint exists yet — will land as a follow-up once the contract is defined. Not blocking the upgrade loop.
  4. Image pinned to alpine/helm:3.16.4. No jq dep — script parses helm's YAML output with awk only. latest is rejected by the schema so behaviour can't drift from under us.

Test plan

  • helm lint client clean (default + AKS + bare-metal value files)
  • helm template stg client … renders 4 new docs by default; renders 0 with autoUpgrade.enabled=false
  • helm unittest client — 12 suites, 116 tests pass (11 of those new in tests/auto_upgrade_test.yaml)
  • Reviewer to verify on a live cluster (out of scope for this PR but the natural next step):
    • helm install 1.3.0 fresh; observe <release>-auto-upgrade CronJob created
    • Manually trigger the Job (kubectl create job … --from=cronjob/<release>-auto-upgrade); confirm it logs already at latest; nothing to do
    • Publish a 1.3.1 to gh-pages; trigger the Job again; observe helm upgrade complete: 1.3.0 -> 1.3.1
    • Verify reused values: helm get values <release> still shows clientId / clientPassword (not stripped by --reuse-values)
    • Verify on a bare-metal cluster (no CSI) since the script's helm upgrade --wait --timeout 10m interacts with PVC bind timing

Follow-ups (separate tickets)

  • Backend reporting endpoint + auth wiring (chart side: optional autoUpgrade.reportUrl).
  • Optional per-customer override on what major-version bumps trigger auto-apply vs notify (option C / hybrid). Not needed for v1.

🤖 Generated with Claude Code


Note

High Risk
Introduces a default-on CronJob that can mutate cluster-scoped resources and is bound to cluster-admin, so a compromised pod or chart repo could lead to full cluster takeover.

Overview
Adds an auto-upgrade mechanism to the client Helm chart: when autoUpgrade.enabled (default true) it installs a <release>-auto-upgrade ConfigMap+CronJob that periodically checks the published chart repo and runs helm upgrade --reset-then-reuse-values to move the release to the latest chart version.

This also introduces the supporting ServiceAccount + ClusterRoleBinding (bound to built-in cluster-admin), new autoUpgrade values and schema validation (including rejecting image.tag: latest), updates install notes/migration docs, adds helm-unittest coverage, and bumps the chart/app version to 1.3.0.

Reviewed by Cursor Bugbot for commit b402aca. Bugbot is set up for automated code reviews on this repo. Configure here.

Ship a chart-side CronJob that polls the published Helm repo daily and runs
`helm upgrade --reuse-values` when a newer chart version is available, so
deployed clients no longer freeze on the version they first installed and
miss security/stability fixes.

- New templates: auto-upgrade-cronjob.yaml (ConfigMap + CronJob),
  auto-upgrade-rbac.yaml (ServiceAccount + ClusterRoleBinding to the
  built-in cluster-admin ClusterRole).
- New values: autoUpgrade.{enabled, schedule, repoUrl, repoName, chartName,
  timeout, suspend, successfulJobsHistoryLimit, failedJobsHistoryLimit,
  startingDeadlineSeconds, image, resources}; default ON.
- Pod satisfies PSA restricted (runAsNonRoot, dropped caps, RO root,
  RuntimeDefault seccomp); HOME/HELM_*_HOME redirected to a tmp emptyDir.
- Version compare uses sort -V so 1.10 > 1.9.
- Bumps chart 1.2.3 -> 1.3.0; MIGRATION.md documents how to opt out.

Cluster-admin (rather than a curated narrow role) keeps the upgrader
robust: the chart already templates cluster-scoped resources
(PriorityClass, StorageClass, ClusterRole/Binding, optionally Namespace),
so a narrower role would silently break the day a future chart adds a new
resource kind. Operators who want tighter posture can disable the feature
and run `helm upgrade` manually.

Backend reporting from the issue is intentionally deferred — no endpoint
exists yet; will land as a follow-up once the contract is defined.
@saadqbal saadqbal self-assigned this Apr 29, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit fb5210b. Configure here.

Comment thread client/templates/auto-upgrade-cronjob.yaml Outdated
Caught during the #69 verification on tb-client-dev-templates: a dry-run
1.1.0 -> 1.3.0 helm upgrade with --reuse-values fails with

  template: client/templates/auto-upgrade-rbac.yaml:1:14:
  executing "..." at <.Values.autoUpgrade.enabled>:
  nil pointer evaluating interface {}.enabled

because --reuse-values reuses the previous release's COMPUTED values, not
just user overrides — so any top-level key added to values.yaml in the
upgraded chart (autoUpgrade in 1.3.0, anything similar in future bumps) is
absent from the merged values when rendering and the new templates blow up.

--reset-then-reuse-values (helm 3.14+, available in our pinned alpine/helm
3.16.4 image) resets to the new chart's defaults, then layers the customer's
user-supplied values on top — operator overrides like clientId,
dockerRegistry creds, or autoUpgrade.enabled=false are preserved while new
defaults flow through.

- Switch the in-chart upgrade script to --reset-then-reuse-values.
- Update the unit test to assert the corrected flag.
- MIGRATION.md: tell operators to use the same flag for the manual 1.x ->
  1.3.0 jump (subsequent chart bumps will go through the CronJob, which
  now uses the right flag itself).
@saadqbal
Copy link
Copy Markdown
Contributor Author

Dev-cluster verification — tb-client-dev-templates (EKS, eu-central-1)

Verified end-to-end against the existing tracebloc release in tracebloc-templates (was on client-1.1.0).

One bug caught + fixed during verification

A dry-run of the manual 1.1.0 → 1.3.0 jump with plain --reuse-values failed:

template: client/templates/auto-upgrade-rbac.yaml:1:14:
executing "..." at <.Values.autoUpgrade.enabled>:
nil pointer evaluating interface {}.enabled

Root cause: --reuse-values reuses the previous release's computed values, so any new top-level key added in the upgraded chart (here, the whole autoUpgrade block) is missing from the merged values and templates that reference it blow up. Fix in commit b402aca:

  • In-chart upgrade script switched to --reset-then-reuse-values (helm 3.14+, available in our pinned alpine/helm:3.16.4)
  • MIGRATION.md tells operators to use the same flag for the manual 1.x → 1.3.0 jump
  • Subsequent chart bumps go through the CronJob, which now uses the right flag itself

Verification results (with the fix)

helm upgrade tracebloc ./client -n tracebloc-templates --reset-then-reuse-values --wait:

  • ✅ Release moved 1.1.0 → 1.3.0, status deployed (revision 8)
  • ✅ Pre-existing mysql-client and tracebloc-jobs-manager pods stayed Running, no restart (non-disruptive upgrade)
  • ✅ PVCs unchanged (client-pvc, client-logs-pvc, mysql-pvc all bound)
  • ✅ NOTES.txt prints the new Auto-upgrade: ON … line

Auto-upgrade resources:

cronjob.batch/tracebloc-auto-upgrade   23 2 * * *   <none>   False   0   <none>   2m26s
serviceaccount/tracebloc-auto-upgrade  0   2m38s
configmap/tracebloc-auto-upgrade       1   2m34s
clusterrolebinding/tracebloc-auto-upgrade   ClusterRole/cluster-admin   2m32s

Manually fired the Job (kubectl create job --from=cronjob/tracebloc-auto-upgrade …); Pod ran under PSA restricted (non-root, RO root, dropped caps, RuntimeDefault seccomp) and exited Succeeded with these logs:

[auto-upgrade] release=tracebloc namespace=tracebloc-templates repo=https://tracebloc.github.io/client
[auto-upgrade] current=1.3.0 latest=1.2.3
[auto-upgrade] deployed version is ahead of repo (current=1.3.0 > latest=1.2.3); skipping

That exercises every code path the CronJob will execute against the public repo (helm repo add/update, search-repo YAML parse, helm-list YAML parse, semver compare via sort -V); the only branch-specific line for the "newer found" case is the helm upgrade --reset-then-reuse-values --version $LATEST invocation, which we just exercised manually for the bootstrap upgrade itself.

Ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants