Skip to content

Fix flaky pkg/controller/reconciler tests by awaiting manager shutdown#12296

Open
brooke-hamilton wants to merge 3 commits into
mainfrom
fix-reconciler-test-manager-shutdown-flake
Open

Fix flaky pkg/controller/reconciler tests by awaiting manager shutdown#12296
brooke-hamilton wants to merge 3 commits into
mainfrom
fix-reconciler-test-manager-shutdown-flake

Conversation

@brooke-hamilton

Copy link
Copy Markdown
Member

Description

The pkg/controller/reconciler envtest suite is intermittently flaky. It recently failed on an unrelated PR (failed run — Run Unit Tests, attempt 2) and then passed on re-run, with no code change in between.

Root cause

The failure surfaced as a package-level failure with no named failing test:

=== FAIL: pkg/controller/reconciler  (0.00s)
FAIL
[controller-runtime] log.SetLogger(...) was never called; logs will not be displayed.
Detected at:
  > ...eventuallyFulfillRoot() ...
  > ...manager.(*controllerManager).engageStopProcedure...
FAIL	github.com/radius-project/radius/pkg/controller/reconciler	30.179s

Every test setup started a controller-runtime manager in a detached goroutine and only called t.Cleanup(cancel):

ctx, cancel := testcontext.NewWithCancel(t)
t.Cleanup(cancel)
...
go func() {
    if err := mgr.Start(ctx); err != nil && !errors.Is(err, context.Canceled) { panic(...) }
}()

cancel() only signals shutdown — it never waits for mgr.Start to return. So manager goroutines outlived the test that owned them and raced with subsequent tests and package teardown (env.Stop in TestMain), which intermittently pushed the test binary to a non-zero exit.

Because SetLogger was never called, a still-shutting-down manager also hit controller-runtime's eventuallyFulfillRoot fallback, which prints the log.SetLogger(...) was never called warning with a goroutine stack dump after a ~30s delay — exactly the output captured at the failure point (note the 30.179s package time).

This is test-harness flakiness, not a product bug: the triggering PR only touched Helm chart templates and had nothing to do with this package.

Changes (test-only)

  • Add a shared startManager helper in shared_test.go that runs the manager and, on t.Cleanup, cancels its context and blocks until the goroutine fully exits, so no manager leaks past its test.
  • Use the helper in the five standard reconciler/webhook setups, and apply the same await-shutdown pattern to the Flux controller test.
  • Call runtimelog.SetLogger(logr.Discard()) once in TestMain so controller-runtime always has a logger, eliminating the noisy, ~30s-delayed fallback path. A discard logger is safe here because it never writes to testing.T after a test completes.

Type of change

  • This pull request is a minor refactor, code cleanup, test improvement, or other maintenance task and doesn't change the functionality of Radius (issue link optional).

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

  • An overview of proposed schema changes is included in a linked GitHub issue.
    • Not applicable
  • A design document is added or updated under eng/design-notes/ in this repository, if new APIs are being introduced.
    • Not applicable
  • The design document has been reviewed and approved by Radius maintainers/approvers.
    • Not applicable
  • A PR for resource-types-contrib is created, if resource types or recipes are affected by the changes in this PR.
    • Not applicable
  • A PR for dashboard is created, if the Radius Dashboard is affected by the changes in this PR.
    • Not applicable
  • A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
    • Not applicable

The reconciler envtest suite intermittently failed at the package level with
no named failing test (=== FAIL: pkg/controller/reconciler (0.00s)). Each test
setup started a controller-runtime manager in a detached goroutine and only
called t.Cleanup(cancel), which signals shutdown but never waits for mgr.Start
to return. Manager goroutines therefore leaked past the test that owned them and
raced with later tests and package teardown (env.Stop in TestMain).

Because SetLogger was never called, a still-shutting-down manager also hit
controller-runtime's eventuallyFulfillRoot fallback, printing the
"log.SetLogger(...) was never called" warning with a goroutine stack dump after
a ~30s delay -- exactly the output seen at the failure.

Changes (test-only):
- Add a shared startManager helper that runs the manager and, on t.Cleanup,
  cancels its context and blocks until the goroutine fully exits.
- Use it in the five standard reconciler/webhook setups and apply the same
  await-shutdown pattern to the Flux controller test.
- Call runtimelog.SetLogger(logr.Discard()) once in TestMain so controller-runtime
  always has a logger, removing the noisy, ~30s-delayed fallback path.

Co-authored-by: Copilot App <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com>
@brooke-hamilton brooke-hamilton requested a review from a team as a code owner July 1, 2026 17:57
Copilot AI review requested due to automatic review settings July 1, 2026 17:57
@brooke-hamilton brooke-hamilton requested a review from a team as a code owner July 1, 2026 17:57
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses intermittent flakiness in the pkg/controller/reconciler envtest suite by ensuring controller-runtime managers are shut down deterministically (no leaked goroutines across tests) and by configuring a default controller-runtime logger in TestMain to avoid the delayed “log.SetLogger(...) was never called” fallback path.

Changes:

  • Introduces a shared startManager helper that starts a controller-runtime manager in a goroutine and waits for it to fully exit during t.Cleanup.
  • Updates the standard reconciler/webhook test setups to use startManager, and applies the same “await shutdown” pattern to the Flux controller test.
  • Sets a discard logger via runtimelog.SetLogger(logr.Discard()) in TestMain to eliminate the controller-runtime fallback warning/stack dump behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/controller/reconciler/shared_test.go Adds startManager helper to start managers and block until shutdown during test cleanup.
pkg/controller/reconciler/recipe_webhook_test.go Switches manager startup/cleanup to use startManager.
pkg/controller/reconciler/recipe_reconciler_test.go Switches manager startup/cleanup to use startManager.
pkg/controller/reconciler/deploymenttemplate_reconciler_test.go Switches manager startup/cleanup to use startManager.
pkg/controller/reconciler/deploymentresource_reconciler_test.go Switches manager startup/cleanup to use startManager.
pkg/controller/reconciler/deployment_reconciler_test.go Switches manager startup/cleanup to use startManager.
pkg/controller/reconciler/flux_controller_test.go Updates Flux test manager cleanup to cancel and wait for the manager goroutine to exit.
pkg/controller/reconciler/main_test.go Sets controller-runtime logger to logr.Discard() in TestMain to avoid delayed fallback warning output.

Comment thread pkg/controller/reconciler/shared_test.go
Comment thread pkg/controller/reconciler/shared_test.go Outdated
Comment thread pkg/controller/reconciler/flux_controller_test.go
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.96%. Comparing base (bf1015c) to head (f48cef2).

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #12296      +/-   ##
==========================================
- Coverage   52.97%   52.96%   -0.01%     
==========================================
  Files         754      754              
  Lines       48686    48686              
==========================================
- Hits        25791    25787       -4     
- Misses      20469    20471       +2     
- Partials     2426     2428       +2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Unit Tests

    2 files  ±0    452 suites  ±0   7m 31s ⏱️ +2s
5 656 tests ±0  5 654 ✅ ±0  2 💤 ±0  0 ❌ ±0 
6 853 runs  ±0  6 851 ✅ ±0  2 💤 ±0  0 ❌ ±0 

Results for commit f48cef2. ± Comparison against base commit bf1015c.

♻️ This comment has been updated with latest results.

…ment

- Add managerShutdownTimeout and a shared waitForManagerShutdown helper so test
  cleanup fails with a clear message instead of hanging until the global go test
  timeout if a manager goroutine never exits after its context is cancelled.
- Use the bounded wait in both startManager and the Flux controller test.
- Reword the goroutine comment: the reason to avoid require/assert is that they
  may call t.FailNow/t.Fail, which must run on the test goroutine, not that
  testing.T is inherently racy.

Co-authored-by: Copilot App <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com>
@radius-functional-tests

radius-functional-tests Bot commented Jul 1, 2026

Copy link
Copy Markdown

Radius functional test overview

🔍 Go to test action run

Click here to see the test run details
Name Value
Repository radius-project/radius
Commit ref f48cef2
Unique ID func4a81ff7536
Image tag pr-func4a81ff7536
  • Dapr: 1.14.4
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-func4a81ff7536
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-func4a81ff7536
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-func4a81ff7536
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-func4a81ff7536
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-func4a81ff7536
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting corerp-cloud functional tests...
⌛ Starting ucp-cloud functional tests...
✅ ucp-cloud functional tests succeeded
❌ corerp-cloud functional test failed. Please check the logs for more details
⌛ Starting corerp-cloud functional tests...
✅ corerp-cloud functional tests succeeded

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Functional Tests - corerp-noncloud

158 tests  ±0   156 ✅ ±0   1h 15m 55s ⏱️ - 2m 51s
  3 suites ±0     2 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit f48cef2. ± Comparison against base commit bf1015c.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Functional Tests - corerp-cloud

26 tests  ±0   25 ✅ ±0   11m 16s ⏱️ +4s
 2 suites ±0    1 💤 ±0 
 1 files   ±0    0 ❌ ±0 

Results for commit f48cef2. ± Comparison against base commit bf1015c.

♻️ This comment has been updated with latest results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants