Skip to content

OCPBUGS-84572: fix(cpo): generate EBS CSI driver operator serving cert in CPO#8357

Open
typeid wants to merge 1 commit intoopenshift:mainfrom
typeid:fix/ebs-csi-operator-serving-cert
Open

OCPBUGS-84572: fix(cpo): generate EBS CSI driver operator serving cert in CPO#8357
typeid wants to merge 1 commit intoopenshift:mainfrom
typeid:fix/ebs-csi-operator-serving-cert

Conversation

@typeid
Copy link
Copy Markdown
Member

@typeid typeid commented Apr 28, 2026

Summary

  • Generate the aws-ebs-csi-driver-operator-serving-cert TLS secret in the CPO instead of relying on the service-ca operator
  • Removes the service-ca annotation from the existing aws-ebs-csi-driver-operator-metrics service to prevent ownership conflicts with the service-ca operator on OpenShift management clusters
  • Follows the same pattern used for Azure Disk/File CSI driver operator certs and the existing AWS EBS controller metrics cert (acebcc919)

Problem

The aws-ebs-csi-driver-operator deployment mounts a TLS serving cert from secret aws-ebs-csi-driver-operator-serving-cert. On OpenShift, the service-ca operator generates this cert automatically for services annotated with service.beta.kubernetes.io/serving-cert-secret-name. On non-OpenShift management clusters (e.g. EKS), the service-ca operator does not exist, so the secret is never created and the operator pod fails to start with FailedMount.

Without the EBS CSI driver operator, no CSI driver pods are deployed on the data plane, blocking PersistentVolume support on the guest cluster.

The CPO already handles the analogous case for the EBS CSI driver controller metrics cert (commit acebcc919), but the operator serving cert was not covered.

Context

This is part of a broader effort to remove the CPO's dependency on the service-ca operator for TLS cert generation in hosted control plane namespaces. See OCPBUGS-34662 / #7538 for the full scope.

Bug: OCPBUGS-84572

Test plan

  • Unit test for cert generation (TestReconcileAWSEBSCsiDriverOperatorMetricsServingCertSecret)
  • Verify aws-ebs-csi-driver-operator pod starts successfully on a non-OpenShift management cluster
  • Verify PersistentVolumeClaim binding works on the guest cluster
  • E2E: ci/prow/e2e-aws (validates no reconciliation fight with service-ca on OpenShift management clusters)

Summary by CodeRabbit

  • New Features

    • Provisioning of an AWS EBS CSI driver operator metrics TLS certificate signed by the control-plane CA.
    • Added a headless metrics Service for the AWS EBS CSI driver operator.
    • Any existing service-ca–managed serving-cert annotation/secret for the operator metrics Service is cleared before provisioning; cleanup errors are logged, while certificate provisioning errors surface as reconciliation failures.
  • Tests

    • Unit tests added to verify the TLS secret contains key/cert and expected X.509 SANs and subject fields.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds PKI reconciliation for the AWS EBS CSI driver operator metrics endpoint: new manifest helpers for the metrics Service (aws-ebs-csi-driver-operator-metrics) and its serving-cert Secret (aws-ebs-csi-driver-operator-serving-cert), a PKI reconcile function that computes DNS SANs and signs a serving certificate, and controller logic that best-effort removes service-ca–managed annotation/secret content before creating/updating the serving-cert Secret. Cleanup errors are only logged; create/update errors are returned as reconciliation failures.

Sequence Diagram(s)

sequenceDiagram
    participant Controller as HostedControlPlane Controller
    participant Manifest as Manifest Helpers
    participant PKI as PKI Reconciliation
    participant K8s as Kubernetes API

    Controller->>Manifest: Build AWS EBS CSI Driver Operator Service
    Manifest-->>K8s: Apply Service (name: aws-ebs-csi-driver-operator-metrics, ClusterIP=None)

    Controller->>K8s: removeServiceCAAnnotationAndSecret(service, serving-cert Secret) (best-effort)
    K8s-->>Controller: Ack or Error (logged)

    Controller->>PKI: ReconcileAWSEBSCsiDriverOperatorMetricsServingCertSecret(secret, ca, ownerRef)
    PKI->>PKI: Compute DNS SANs (service FQDNs, cluster-local, short name, localhost)
    PKI->>PKI: Call reconcileSignedCertWithAddresses(...)
    PKI->>K8s: Create/Update serving-cert Secret (aws-ebs-csi-driver-operator-serving-cert)
    K8s-->>PKI: Secret created/updated or error
    PKI-->>Controller: Return result (error if create/update failed)
Loading

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Ote Binary Stdout Contract ❌ Error The PR modifies AWS EBS CSI Driver Operator control plane code, not OTE test infrastructure. These changes are to hosted control plane operator business logic and unit tests, not JSON stdout test infrastructure used by OpenShift Tests Extension.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: generating the AWS EBS CSI driver operator serving cert in the Control Plane Operator, which is the core objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR only adds standard Go tests using testing.T, not Ginkgo-style tests. The custom check targets Ginkgo constructs which are not present.
Test Structure And Quality ✅ Passed The custom check targets Ginkgo test code, but this PR adds a standard Go unit test using the testing package, which is not in scope for this check.
Microshift Test Compatibility ✅ Passed The PR adds only a standard Go unit test, not a Ginkgo e2e test, so the MicroShift Test Compatibility check does not apply.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds only a standard Go unit test, not Ginkgo e2e tests. Unit tests do not depend on cluster topology.
Topology-Aware Scheduling Compatibility ✅ Passed Changes add PKI certificate reconciliation and metrics Service for AWS EBS CSI driver operator without introducing topology-dependent scheduling constraints.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR adds only a standard Go unit test using testing.T, not Ginkgo e2e tests. The unit test is IP-family agnostic and requires no external connectivity, so it is out of scope for this custom check.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Apr 28, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and csrwng April 28, 2026 10:51
@openshift-ci openshift-ci Bot added area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels Apr 28, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 1750-1753: The cleanup call removeServiceCAAnnotationAndSecret can
fail but current code only logs the error via r.Log.Error and continues, risking
leftover service-ca artifacts and ownership races; update the reconcile flow in
hostedcontrolplane_controller.go so that when
removeServiceCAAnnotationAndSecret(ctx, r.Client,
awsEBSCsiDriverOperatorService, awsEBSCsiDriverOperatorServingCert) returns a
non-nil error you immediately return that error (or requeue with it) instead of
just logging it—replace the r.Log.Error-only branch with a fast-fail return of
the error to stop further reconcile actions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: dcb4cec2-0d54-4a1e-a2f8-13f36aca9195

📥 Commits

Reviewing files that changed from the base of the PR and between 222a19f and aeeb73b.

📒 Files selected for processing (5)
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/manifests/aws_ebs_csi_driver_operator.go
  • control-plane-operator/controllers/hostedcontrolplane/manifests/pki.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 41.93548% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.47%. Comparing base (222a19f) to head (249d41d).
⚠️ Report is 59 commits behind head on main.

Files with missing lines Patch % Lines
...trolplane/manifests/aws_ebs_csi_driver_operator.go 0.00% 10 Missing ⚠️
...ostedcontrolplane/hostedcontrolplane_controller.go 45.45% 4 Missing and 2 partials ⚠️
...or/controllers/hostedcontrolplane/manifests/pki.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8357      +/-   ##
==========================================
+ Coverage   36.30%   36.47%   +0.16%     
==========================================
  Files         764      767       +3     
  Lines       93015    93287     +272     
==========================================
+ Hits        33772    34024     +252     
- Misses      56530    56547      +17     
- Partials     2713     2716       +3     
Files with missing lines Coverage Δ
...tedcontrolplane/pki/aws_ebs_csi_driver_operator.go 100.00% <100.00%> (ø)
...or/controllers/hostedcontrolplane/manifests/pki.go 0.00% <0.00%> (ø)
...ostedcontrolplane/hostedcontrolplane_controller.go 36.82% <45.45%> (+0.08%) ⬆️
...trolplane/manifests/aws_ebs_csi_driver_operator.go 0.00% <0.00%> (ø)

... and 45 files with indirect coverage changes

Flag Coverage Δ
cmd-support 30.34% <ø> (+0.32%) ⬆️
cpo-hostedcontrolplane 37.06% <41.93%> (+0.01%) ⬆️
cpo-other 35.69% <ø> (ø)
hypershift-operator 47.89% <ø> (+<0.01%) ⬆️
other 28.37% <ø> (+0.68%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@typeid typeid changed the title fix(cpo): generate EBS CSI driver operator serving cert in CPO OCPBUGS-84572: fix(cpo): generate EBS CSI driver operator serving cert in CPO Apr 28, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@typeid: This pull request references Jira Issue OCPBUGS-84572, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Generate the aws-ebs-csi-driver-operator-serving-cert TLS secret in the CPO instead of relying on the service-ca operator
  • Follows the same pattern used for Azure Disk/File CSI driver operator certs and the existing AWS EBS controller metrics cert (acebcc919)

Problem

The aws-ebs-csi-driver-operator deployment mounts a TLS serving cert from secret aws-ebs-csi-driver-operator-serving-cert. On OpenShift, the service-ca operator generates this cert automatically for services annotated with service.beta.kubernetes.io/serving-cert-secret-name. On non-OpenShift management clusters (e.g. EKS), the service-ca operator does not exist, so the secret is never created and the operator pod fails to start with FailedMount.

Without the EBS CSI driver operator, no CSI driver pods are deployed on the data plane, blocking PersistentVolume support on the guest cluster.

The CPO already handles the analogous case for the EBS CSI driver controller metrics cert (commit acebcc919), but the operator serving cert was not covered.

Context

This is part of a broader effort to remove the CPO's dependency on the service-ca operator for TLS cert generation in hosted control plane namespaces. See OCPBUGS-34662 / #7538 for the full scope.

Bug: OCPBUGS-84572

Test plan

  • Unit test for cert generation (TestReconcileAWSEBSCsiDriverOperatorMetricsServingCertSecret)
  • Verify aws-ebs-csi-driver-operator pod starts successfully on a non-OpenShift management cluster
  • Verify PersistentVolumeClaim binding works on the guest cluster

Summary by CodeRabbit

  • New Features

  • Added PKI reconciliation for AWS EBS CSI driver operator metrics serving certificates.

  • Introduced metrics service for AWS EBS CSI driver operator with proper certificate management.

  • Tests

  • Added unit tests validating certificate generation with X.509 properties and expected DNS names.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@typeid typeid force-pushed the fix/ebs-csi-operator-serving-cert branch from aeeb73b to 2b27a9d Compare April 28, 2026 12:06
@openshift-ci-robot
Copy link
Copy Markdown

@typeid: This pull request references Jira Issue OCPBUGS-84572, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • Generate the aws-ebs-csi-driver-operator-serving-cert TLS secret in the CPO instead of relying on the service-ca operator
  • Follows the same pattern used for Azure Disk/File CSI driver operator certs and the existing AWS EBS controller metrics cert (acebcc919)

Problem

The aws-ebs-csi-driver-operator deployment mounts a TLS serving cert from secret aws-ebs-csi-driver-operator-serving-cert. On OpenShift, the service-ca operator generates this cert automatically for services annotated with service.beta.kubernetes.io/serving-cert-secret-name. On non-OpenShift management clusters (e.g. EKS), the service-ca operator does not exist, so the secret is never created and the operator pod fails to start with FailedMount.

Without the EBS CSI driver operator, no CSI driver pods are deployed on the data plane, blocking PersistentVolume support on the guest cluster.

The CPO already handles the analogous case for the EBS CSI driver controller metrics cert (commit acebcc919), but the operator serving cert was not covered.

Context

This is part of a broader effort to remove the CPO's dependency on the service-ca operator for TLS cert generation in hosted control plane namespaces. See OCPBUGS-34662 / #7538 for the full scope.

Bug: OCPBUGS-84572

Test plan

  • Unit test for cert generation (TestReconcileAWSEBSCsiDriverOperatorMetricsServingCertSecret)
  • Verify aws-ebs-csi-driver-operator pod starts successfully on a non-OpenShift management cluster
  • Verify PersistentVolumeClaim binding works on the guest cluster

Summary by CodeRabbit

  • New Features

  • Added PKI reconciliation to create/update the AWS EBS CSI driver operator metrics serving certificate signed by the control plane CA.

  • Added a headless metrics Service manifest for the AWS EBS CSI driver operator.

  • Tests

  • Added unit tests verifying generated TLS serving secrets include key/cert and expected X.509 SANs and subject fields.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@typeid
Copy link
Copy Markdown
Member Author

typeid commented Apr 28, 2026

/verified by @typeid

Tested on an EKS cluster without service-ca - the CPO generates the cert and the aws-ebs-csi-driver-operator pod starts successfully.

For OpenShift management clusters, this is validated by existing CI e2e tests:

  1. TestCreateCluster verifies that new clusters receive a valid cert.
  2. TestUpgradeControlPlane verifies the migration. It creates a cluster on PreviousReleaseImage (which doesn't have this change) then upgrades to the PR's CPO, exercising the service-ca → CPO cert handover. Both tests wait for all ClusterOperators (including storage) to reach a healthy state via WaitForDataPlaneRollout. If the CPO-generated cert is invalid, aws-ebs-csi-driver-operator would fail to start, storage would go degraded, and the CVO would never reach Completed.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@typeid: This PR has been marked as verified by @typeid.

Details

In response to this:

/verified by @typeid

Tested on an EKS cluster without service-ca - the CPO generates the cert and the aws-ebs-csi-driver-operator pod starts successfully.

On OpenShift management clusters, the CPO takes over cert ownership from service-ca. This is validated by existing CI e2e tests: TestCreateCluster verifies new clusters. TestUpgradeControlPlane verifies the migration: it creates a cluster on PreviousReleaseImage (which doesn't have this change) then upgrades to the PR's CPO, exercising the service-ca → CPO cert handover. Both tests wait for all ClusterOperators (including storage) to reach a healthy state via WaitForDataPlaneRollout. If the CPO-generated cert is invalid, aws-ebs-csi-driver-operator would fail to start, storage would go degraded, and the CVO would never reach Completed. Therefore, it's covered by existing E2E.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@typeid
Copy link
Copy Markdown
Member Author

typeid commented Apr 29, 2026

/test e2e-aws e2e-aws-upgrade-hypershift-operator e2e-azure-self-managed e2e-kubevirt-aws-ovn-reduced e2e-v2-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2049416847203241984 | Cost: $5.183902500000002 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented Apr 29, 2026

Test Results

e2e-aws

e2e-aks

@typeid
Copy link
Copy Markdown
Member Author

typeid commented Apr 29, 2026

/remove-label verified

Investigating failures caught in e2e.

@openshift-ci openshift-ci Bot removed the verified Signifies that the PR passed pre-merge verification criteria label Apr 29, 2026
@typeid typeid marked this pull request as draft April 29, 2026 13:01
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2026
@typeid typeid force-pushed the fix/ebs-csi-operator-serving-cert branch from 2b27a9d to a2013a1 Compare April 29, 2026 13:04
@openshift-ci-robot
Copy link
Copy Markdown

@typeid: This pull request references Jira Issue OCPBUGS-84572, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • Generate the aws-ebs-csi-driver-operator-serving-cert TLS secret in the CPO instead of relying on the service-ca operator
  • Removes the service-ca annotation from the existing aws-ebs-csi-driver-operator-metrics service to prevent ownership conflicts with the service-ca operator on OpenShift management clusters
  • Follows the same pattern used for Azure Disk/File CSI driver operator certs and the existing AWS EBS controller metrics cert (acebcc919)

Problem

The aws-ebs-csi-driver-operator deployment mounts a TLS serving cert from secret aws-ebs-csi-driver-operator-serving-cert. On OpenShift, the service-ca operator generates this cert automatically for services annotated with service.beta.kubernetes.io/serving-cert-secret-name. On non-OpenShift management clusters (e.g. EKS), the service-ca operator does not exist, so the secret is never created and the operator pod fails to start with FailedMount.

Without the EBS CSI driver operator, no CSI driver pods are deployed on the data plane, blocking PersistentVolume support on the guest cluster.

The CPO already handles the analogous case for the EBS CSI driver controller metrics cert (commit acebcc919), but the operator serving cert was not covered.

Context

This is part of a broader effort to remove the CPO's dependency on the service-ca operator for TLS cert generation in hosted control plane namespaces. See OCPBUGS-34662 / #7538 for the full scope.

Bug: OCPBUGS-84572

Test plan

  • Unit test for cert generation (TestReconcileAWSEBSCsiDriverOperatorMetricsServingCertSecret)
  • Verify aws-ebs-csi-driver-operator pod starts successfully on a non-OpenShift management cluster
  • Verify PersistentVolumeClaim binding works on the guest cluster
  • E2E: ci/prow/e2e-aws (validates no reconciliation fight with service-ca on OpenShift management clusters)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@typeid typeid marked this pull request as ready for review April 29, 2026 13:05
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2026
@typeid
Copy link
Copy Markdown
Member Author

typeid commented Apr 29, 2026

/test e2e-aks e2e-aks-4-22 e2e-aws e2e-aws-4-22 e2e-aws-upgrade-hypershift-operator e2e-azure-self-managed e2e-kubevirt-aws-ovn-reduced e2e-v2-aws

@openshift-ci openshift-ci Bot requested review from Nirshal and clebs April 29, 2026 13:09
@typeid typeid force-pushed the fix/ebs-csi-operator-serving-cert branch from a2013a1 to 553763f Compare April 29, 2026 13:57
@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2049475649592102912 | Cost: $2.0726945000000003 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2049475649793429504 | Cost: $2.16588125 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go (1)

69-77: Prefer explicit “missing SAN” assertions for clearer failures.

Current checks catch extras and count mismatch, but adding explicit missing-name checks gives more actionable failures.

♻️ Suggested test assertion refinement
-	if len(cert.DNSNames) != len(expectedDNSNames) {
-		t.Errorf("expected %d DNS names, got %d", len(expectedDNSNames), len(cert.DNSNames))
-	}
-
-	for _, name := range cert.DNSNames {
-		if !expectedDNSNames[name] {
-			t.Errorf("unexpected DNS name: %s", name)
-		}
-	}
+	gotDNSNames := map[string]bool{}
+	for _, name := range cert.DNSNames {
+		gotDNSNames[name] = true
+		if !expectedDNSNames[name] {
+			t.Errorf("unexpected DNS name: %s", name)
+		}
+	}
+	for expected := range expectedDNSNames {
+		if !gotDNSNames[expected] {
+			t.Errorf("missing expected DNS name: %s", expected)
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go`
around lines 69 - 77, The test currently only checks DNS count and unexpected
extras; update the assertion logic in the test that iterates cert.DNSNames to
also assert any expected names missing from cert.DNSNames: after building or
defining expectedDNSNames and cert.DNSNames, verify each key in expectedDNSNames
exists in cert.DNSNames (e.g., check expectedDNSNames[name] or presence) and
fail with a clear message like "missing DNS name: <name>" for each missing
entry; keep the existing unexpected-name check but add this explicit missing-SAN
check so failures clearly indicate which expected SANs are absent (operate on
the existing cert.DNSNames and expectedDNSNames variables).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 1751-1753: The log calls pass the error as the first argument to
r.Log.Error but include a "%w" in the message which is literal for logr; remove
the "%w" from the message string in each r.Log.Error invocation (e.g., the one
in hostedcontrolplane_controller.go around the service CA removal and the other
occurrences referenced in the review) so the call becomes r.Log.Error(err,
"descriptive message") and rely on logr's structured error argument instead of
printf formatting.

---

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go`:
- Around line 69-77: The test currently only checks DNS count and unexpected
extras; update the assertion logic in the test that iterates cert.DNSNames to
also assert any expected names missing from cert.DNSNames: after building or
defining expectedDNSNames and cert.DNSNames, verify each key in expectedDNSNames
exists in cert.DNSNames (e.g., check expectedDNSNames[name] or presence) and
fail with a clear message like "missing DNS name: <name>" for each missing
entry; keep the existing unexpected-name check but add this explicit missing-SAN
check so failures clearly indicate which expected SANs are absent (operate on
the existing cert.DNSNames and expectedDNSNames variables).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4b5f9fad-758f-4674-99b3-e12edc7637ee

📥 Commits

Reviewing files that changed from the base of the PR and between a2013a1 and 553763f.

📒 Files selected for processing (5)
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/manifests/aws_ebs_csi_driver_operator.go
  • control-plane-operator/controllers/hostedcontrolplane/manifests/pki.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • control-plane-operator/controllers/hostedcontrolplane/manifests/aws_ebs_csi_driver_operator.go
  • control-plane-operator/controllers/hostedcontrolplane/manifests/pki.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator.go

The aws-ebs-csi-driver-operator deployment mounts a TLS serving cert
from secret "aws-ebs-csi-driver-operator-serving-cert". On OpenShift,
the service-ca operator generates this cert automatically. On
non-OpenShift management clusters (e.g. EKS), the service-ca operator
does not exist, so the secret is never created and the operator pod
fails to start with a FailedMount error. This blocks PersistentVolume
support on the guest cluster.

Add CPO-managed cert generation for the operator serving cert,
following the same pattern used for Azure Disk/File CSI driver
operator certs and the existing AWS EBS controller metrics cert
(commit acebcc9).

Signed-off-by: Claudio Busse <cbusse@redhat.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@typeid typeid force-pushed the fix/ebs-csi-operator-serving-cert branch from 553763f to 249d41d Compare April 29, 2026 14:10
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go (1)

69-77: Make SAN assertion strict for missing-name detection.

Current checks can miss a missing expected SAN if a duplicate expected SAN appears. Consider asserting exact set equality.

Proposed assertion tightening
-	if len(cert.DNSNames) != len(expectedDNSNames) {
-		t.Errorf("expected %d DNS names, got %d", len(expectedDNSNames), len(cert.DNSNames))
-	}
-
-	for _, name := range cert.DNSNames {
-		if !expectedDNSNames[name] {
-			t.Errorf("unexpected DNS name: %s", name)
-		}
-	}
+	seen := map[string]bool{}
+	for _, name := range cert.DNSNames {
+		if !expectedDNSNames[name] {
+			t.Errorf("unexpected DNS name: %s", name)
+			continue
+		}
+		seen[name] = true
+	}
+	if len(seen) != len(expectedDNSNames) {
+		t.Errorf("expected DNS names %v, got %v", expectedDNSNames, cert.DNSNames)
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go`
around lines 69 - 77, The current SAN checks can miss a missing expected name if
a duplicate expected name exists; replace the loose validation with an exact set
equality check between cert.DNSNames and expectedDNSNames by converting
cert.DNSNames into a map (e.g. observed := map[string]bool{}) and then compare
keys: verify len(observed) == len(expectedDNSNames) and that for each key in
expectedDNSNames observed[key] is true (and vice-versa) so any missing or extra
SAN is detected; update the assertions around cert.DNSNames/expectedDNSNames
accordingly in the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go`:
- Around line 69-77: The current SAN checks can miss a missing expected name if
a duplicate expected name exists; replace the loose validation with an exact set
equality check between cert.DNSNames and expectedDNSNames by converting
cert.DNSNames into a map (e.g. observed := map[string]bool{}) and then compare
keys: verify len(observed) == len(expectedDNSNames) and that for each key in
expectedDNSNames observed[key] is true (and vice-versa) so any missing or extra
SAN is detected; update the assertions around cert.DNSNames/expectedDNSNames
accordingly in the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 0185437e-5fd5-416a-b75e-7eff75f9c02a

📥 Commits

Reviewing files that changed from the base of the PR and between 553763f and 249d41d.

📒 Files selected for processing (5)
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
  • control-plane-operator/controllers/hostedcontrolplane/manifests/aws_ebs_csi_driver_operator.go
  • control-plane-operator/controllers/hostedcontrolplane/manifests/pki.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator.go
  • control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go
✅ Files skipped from review due to trivial changes (2)
  • control-plane-operator/controllers/hostedcontrolplane/manifests/pki.go
  • control-plane-operator/controllers/hostedcontrolplane/manifests/aws_ebs_csi_driver_operator.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go

@typeid
Copy link
Copy Markdown
Member Author

typeid commented Apr 29, 2026

/test e2e-aks e2e-aks-4-22 e2e-aws e2e-aws-4-22 e2e-aws-upgrade-hypershift-operator e2e-azure-self-managed e2e-kubevirt-aws-ovn-reduced e2e-v2-aws

@typeid
Copy link
Copy Markdown
Member Author

typeid commented Apr 29, 2026

/retest

1 similar comment
@typeid
Copy link
Copy Markdown
Member Author

typeid commented Apr 30, 2026

/retest

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2049724592120926208 | Cost: $4.333082999999999 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@typeid
Copy link
Copy Markdown
Member Author

typeid commented Apr 30, 2026

/retest

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Now I have all the evidence needed. Here's the complete analysis:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

TestCreateClusterPrivateWithRouteKAS/Teardown (1349.46s)
  fixture.go:333: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
  fixture.go:340: Failed to clean up 3 remaining resources for guest cluster

Summary

The failure is exclusively in the teardown phase of TestCreateClusterPrivateWithRouteKAS — all functional subtests (ValidateHostedCluster, Main/SwitchFromPrivateToPublic, Main/SwitchFromPublicToPrivate, EnsureHostedCluster) passed. After the hosted cluster private-6zjhk was destroyed, the post-delete validation (validateAWSGuestResourcesDeletedFunc) polled AWS for 15 minutes waiting for orphaned guest resources to be cleaned up. Three AWS resources remained after the timeout: an NLB for openshift-ingress/router-default (which is the resource that triggers the hasGuestResources check to return true), and two node EC2 volumes (which are skipped by the check since they lack the kubernetes.io/created-for/pv/name PV tag). This is unrelated to the PR's changes, which only add EBS CSI driver operator serving cert generation to the CPO PKI reconciliation loop.

Root Cause

The root cause is a slow AWS resource cleanup for the hosted cluster private-6zjhk in namespace e2e-clusters-kfsbk. After the hosted cluster was destroyed, the validateAWSGuestResourcesDeletedFunc in test/e2e/util/fixture.go polls AWS every 20 seconds for up to 15 minutes, checking for resources tagged with kubernetes.io/cluster/private-6zjhk=owned of types elasticloadbalancing:loadbalancer, ec2:volume, and s3.

The blocking resource is the NLB (net/a60bee9d9b6b94b72880f6997e945d56/ed3ad43133889872) tagged as kubernetes.io/service-name=openshift-ingress/router-default. This is the ingress router's Network Load Balancer for the guest cluster. AWS NLB deletion can take significant time after the Kubernetes Service is deleted, as AWS must drain connections, deregister targets, and release ENIs.

The two remaining EC2 volumes (vol-00f69dbc322ea58d1, vol-0897f1bb3b0c29056) are node root volumes (tagged with MachineName=... and sigs.k8s.io/cluster-api-provider-aws/role=node) which do NOT have the kubernetes.io/created-for/pv/name tag, so they are explicitly skipped by the hasGuestResources function. Only the NLB blocks the check.

This is an infrastructure timing flake — the NLB did not finish deleting within the 15-minute timeout. Sippy data shows TestCreateClusterPrivateWithRouteKAS/Teardown has a 100% pass rate across 33 recent runs on the 4.22 release, confirming this is a rare, non-deterministic infrastructure cleanup timeout rather than a regression introduced by this PR.

The PR's changes (generating aws-ebs-csi-driver-operator-serving-cert in the CPO and removing the service-ca annotation from aws-ebs-csi-driver-operator-metrics) have zero relationship to NLB lifecycle management, ingress router teardown, or the AWS resource tagging API validation that failed.

Recommendations
  1. Retest the PR — this failure is an infrastructure timing flake unrelated to the PR's code changes. A /retest should pass.
  2. No code changes needed in this PR — the EBS CSI driver operator cert generation changes do not affect cluster teardown or AWS resource cleanup.
  3. Consider upstream fix for the test framework — the 15-minute timeout in validateAWSGuestResourcesDeletedFunc could be extended, or the NLB deletion could be explicitly forced before starting the polling loop. However, this is a test infrastructure improvement, not a blocker for this PR.
Evidence
Evidence Detail
Failed test TestCreateClusterPrivateWithRouteKAS/Teardown (1349.46s)
Failure type Teardown-only; all functional subtests PASSED
Error fixture.go:333: Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
Blocking resource NLB net/a60bee9d9b6b94b72880f6997e945d56/ed3ad43133889872 (openshift-ingress/router-default)
Other remaining resources 2 EC2 node volumes (skipped by hasGuestResources — no PV tag)
Timeout configuration 15 minutes polling every 20 seconds (fixture.go line 303)
Hosted cluster e2e-clusters-kfsbk/private-6zjhk (destroyed successfully per hypershift_framework.go:574)
Sippy flake rate 100% pass rate across 33 runs for this test (4.22 release)
PR files changed CPO PKI only: hostedcontrolplane_controller.go, manifests/aws_ebs_csi_driver_operator.go, manifests/pki.go, pki/aws_ebs_csi_driver_operator.go, pki/aws_ebs_csi_driver_operator_test.go
PR relevance None — PR modifies EBS CSI cert generation; failure is NLB cleanup timeout

@typeid
Copy link
Copy Markdown
Member Author

typeid commented Apr 30, 2026

/retest

@csrwng
Copy link
Copy Markdown
Contributor

csrwng commented Apr 30, 2026

/approve

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng, typeid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2026
@typeid
Copy link
Copy Markdown
Member Author

typeid commented Apr 30, 2026

/retest

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

@typeid: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@typeid
Copy link
Copy Markdown
Member Author

typeid commented Apr 30, 2026

/verified by @typeid

Tested on an EKS cluster without service-ca - the CPO generates the cert and the aws-ebs-csi-driver-operator pod starts successfully.

For OpenShift management clusters, this is validated by existing CI e2e tests:

  • TestCreateCluster verifies that new clusters receive a valid cert.
  • TestUpgradeControlPlane verifies the migration. It creates a cluster on PreviousReleaseImage (which doesn't have this change) then upgrades to the PR's CPO, exercising the service-ca → CPO cert handover. Both tests wait for all ClusterOperators (including storage) to reach a healthy state via WaitForDataPlaneRollout. If the CPO-generated cert is invalid, aws-ebs-csi-driver-operator would fail to start, storage would go degraded, and the CVO would never reach Completed.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 30, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@typeid: This PR has been marked as verified by @typeid.

Details

In response to this:

/verified by @typeid

Tested on an EKS cluster without service-ca - the CPO generates the cert and the aws-ebs-csi-driver-operator pod starts successfully.

For OpenShift management clusters, this is validated by existing CI e2e tests:

TestCreateCluster verifies that new clusters receive a valid cert.
TestUpgradeControlPlane verifies the migration. It creates a cluster on PreviousReleaseImage (which doesn't have this change) then upgrades to the PR's CPO, exercising the service-ca → CPO cert handover. Both tests wait for all ClusterOperators (including storage) to reach a healthy state via WaitForDataPlaneRollout. If the CPO-generated cert is invalid, aws-ebs-csi-driver-operator would fail to start, storage would go degraded, and the CVO would never reach Completed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@cblecker
Copy link
Copy Markdown
Member

cblecker commented May 1, 2026

/uncc

@openshift-ci openshift-ci Bot removed the request for review from cblecker May 1, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants