Fix flaky pkg/controller/reconciler tests by awaiting manager shutdown#12296
Fix flaky pkg/controller/reconciler tests by awaiting manager shutdown#12296brooke-hamilton wants to merge 3 commits into
Conversation
The reconciler envtest suite intermittently failed at the package level with no named failing test (=== FAIL: pkg/controller/reconciler (0.00s)). Each test setup started a controller-runtime manager in a detached goroutine and only called t.Cleanup(cancel), which signals shutdown but never waits for mgr.Start to return. Manager goroutines therefore leaked past the test that owned them and raced with later tests and package teardown (env.Stop in TestMain). Because SetLogger was never called, a still-shutting-down manager also hit controller-runtime's eventuallyFulfillRoot fallback, printing the "log.SetLogger(...) was never called" warning with a goroutine stack dump after a ~30s delay -- exactly the output seen at the failure. Changes (test-only): - Add a shared startManager helper that runs the manager and, on t.Cleanup, cancels its context and blocks until the goroutine fully exits. - Use it in the five standard reconciler/webhook setups and apply the same await-shutdown pattern to the Flux controller test. - Call runtimelog.SetLogger(logr.Discard()) once in TestMain so controller-runtime always has a logger, removing the noisy, ~30s-delayed fallback path. Co-authored-by: Copilot App <223556219+Copilot@users.noreply.github.com> Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Pull request overview
This PR addresses intermittent flakiness in the pkg/controller/reconciler envtest suite by ensuring controller-runtime managers are shut down deterministically (no leaked goroutines across tests) and by configuring a default controller-runtime logger in TestMain to avoid the delayed “log.SetLogger(...) was never called” fallback path.
Changes:
- Introduces a shared
startManagerhelper that starts a controller-runtime manager in a goroutine and waits for it to fully exit duringt.Cleanup. - Updates the standard reconciler/webhook test setups to use
startManager, and applies the same “await shutdown” pattern to the Flux controller test. - Sets a discard logger via
runtimelog.SetLogger(logr.Discard())inTestMainto eliminate the controller-runtime fallback warning/stack dump behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/controller/reconciler/shared_test.go | Adds startManager helper to start managers and block until shutdown during test cleanup. |
| pkg/controller/reconciler/recipe_webhook_test.go | Switches manager startup/cleanup to use startManager. |
| pkg/controller/reconciler/recipe_reconciler_test.go | Switches manager startup/cleanup to use startManager. |
| pkg/controller/reconciler/deploymenttemplate_reconciler_test.go | Switches manager startup/cleanup to use startManager. |
| pkg/controller/reconciler/deploymentresource_reconciler_test.go | Switches manager startup/cleanup to use startManager. |
| pkg/controller/reconciler/deployment_reconciler_test.go | Switches manager startup/cleanup to use startManager. |
| pkg/controller/reconciler/flux_controller_test.go | Updates Flux test manager cleanup to cancel and wait for the manager goroutine to exit. |
| pkg/controller/reconciler/main_test.go | Sets controller-runtime logger to logr.Discard() in TestMain to avoid delayed fallback warning output. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #12296 +/- ##
==========================================
- Coverage 52.97% 52.96% -0.01%
==========================================
Files 754 754
Lines 48686 48686
==========================================
- Hits 25791 25787 -4
- Misses 20469 20471 +2
- Partials 2426 2428 +2 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…ment - Add managerShutdownTimeout and a shared waitForManagerShutdown helper so test cleanup fails with a clear message instead of hanging until the global go test timeout if a manager goroutine never exits after its context is cancelled. - Use the bounded wait in both startManager and the Flux controller test. - Reword the goroutine comment: the reason to avoid require/assert is that they may call t.FailNow/t.Fail, which must run on the test goroutine, not that testing.T is inherently racy. Co-authored-by: Copilot App <223556219+Copilot@users.noreply.github.com> Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com>
Radius functional test overviewClick here to see the test run details
Test Status⌛ Building Radius and pushing container images for functional tests... |
Description
The
pkg/controller/reconcilerenvtest suite is intermittently flaky. It recently failed on an unrelated PR (failed run — Run Unit Tests, attempt 2) and then passed on re-run, with no code change in between.Root cause
The failure surfaced as a package-level failure with no named failing test:
Every test setup started a controller-runtime manager in a detached goroutine and only called
t.Cleanup(cancel):cancel()only signals shutdown — it never waits formgr.Startto return. So manager goroutines outlived the test that owned them and raced with subsequent tests and package teardown (env.StopinTestMain), which intermittently pushed the test binary to a non-zero exit.Because
SetLoggerwas never called, a still-shutting-down manager also hit controller-runtime'seventuallyFulfillRootfallback, which prints thelog.SetLogger(...) was never calledwarning with a goroutine stack dump after a ~30s delay — exactly the output captured at the failure point (note the 30.179s package time).This is test-harness flakiness, not a product bug: the triggering PR only touched Helm chart templates and had nothing to do with this package.
Changes (test-only)
startManagerhelper inshared_test.gothat runs the manager and, ont.Cleanup, cancels its context and blocks until the goroutine fully exits, so no manager leaks past its test.runtimelog.SetLogger(logr.Discard())once inTestMainso controller-runtime always has a logger, eliminating the noisy, ~30s-delayed fallback path. A discard logger is safe here because it never writes totesting.Tafter a test completes.Type of change
Contributor checklist
Please verify that the PR meets the following requirements, where applicable:
eng/design-notes/in this repository, if new APIs are being introduced.