OCPBUGS-84572: fix(cpo): generate EBS CSI driver operator serving cert in CPO#8357
OCPBUGS-84572: fix(cpo): generate EBS CSI driver operator serving cert in CPO#8357typeid wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds PKI reconciliation for the AWS EBS CSI driver operator metrics endpoint: new manifest helpers for the metrics Service ( Sequence Diagram(s)sequenceDiagram
participant Controller as HostedControlPlane Controller
participant Manifest as Manifest Helpers
participant PKI as PKI Reconciliation
participant K8s as Kubernetes API
Controller->>Manifest: Build AWS EBS CSI Driver Operator Service
Manifest-->>K8s: Apply Service (name: aws-ebs-csi-driver-operator-metrics, ClusterIP=None)
Controller->>K8s: removeServiceCAAnnotationAndSecret(service, serving-cert Secret) (best-effort)
K8s-->>Controller: Ack or Error (logged)
Controller->>PKI: ReconcileAWSEBSCsiDriverOperatorMetricsServingCertSecret(secret, ca, ownerRef)
PKI->>PKI: Compute DNS SANs (service FQDNs, cluster-local, short name, localhost)
PKI->>PKI: Call reconcileSignedCertWithAddresses(...)
PKI->>K8s: Create/Update serving-cert Secret (aws-ebs-csi-driver-operator-serving-cert)
K8s-->>PKI: Secret created/updated or error
PKI-->>Controller: Return result (error if create/update failed)
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (10 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 1750-1753: The cleanup call removeServiceCAAnnotationAndSecret can
fail but current code only logs the error via r.Log.Error and continues, risking
leftover service-ca artifacts and ownership races; update the reconcile flow in
hostedcontrolplane_controller.go so that when
removeServiceCAAnnotationAndSecret(ctx, r.Client,
awsEBSCsiDriverOperatorService, awsEBSCsiDriverOperatorServingCert) returns a
non-nil error you immediately return that error (or requeue with it) instead of
just logging it—replace the r.Log.Error-only branch with a fast-fail return of
the error to stop further reconcile actions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: dcb4cec2-0d54-4a1e-a2f8-13f36aca9195
📒 Files selected for processing (5)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/controllers/hostedcontrolplane/manifests/aws_ebs_csi_driver_operator.gocontrol-plane-operator/controllers/hostedcontrolplane/manifests/pki.gocontrol-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator.gocontrol-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8357 +/- ##
==========================================
+ Coverage 36.30% 36.47% +0.16%
==========================================
Files 764 767 +3
Lines 93015 93287 +272
==========================================
+ Hits 33772 34024 +252
- Misses 56530 56547 +17
- Partials 2713 2716 +3
... and 45 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
@typeid: This pull request references Jira Issue OCPBUGS-84572, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
aeeb73b to
2b27a9d
Compare
|
@typeid: This pull request references Jira Issue OCPBUGS-84572, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified by @typeid Tested on an EKS cluster without For OpenShift management clusters, this is validated by existing CI e2e tests:
|
|
@typeid: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws e2e-aws-upgrade-hypershift-operator e2e-azure-self-managed e2e-kubevirt-aws-ovn-reduced e2e-v2-aws |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
Test Resultse2e-aws
e2e-aks
|
|
/remove-label verified Investigating failures caught in e2e. |
2b27a9d to
a2013a1
Compare
|
@typeid: This pull request references Jira Issue OCPBUGS-84572, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aks e2e-aks-4-22 e2e-aws e2e-aws-4-22 e2e-aws-upgrade-hypershift-operator e2e-azure-self-managed e2e-kubevirt-aws-ovn-reduced e2e-v2-aws |
a2013a1 to
553763f
Compare
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go (1)
69-77: Prefer explicit “missing SAN” assertions for clearer failures.Current checks catch extras and count mismatch, but adding explicit missing-name checks gives more actionable failures.
♻️ Suggested test assertion refinement
- if len(cert.DNSNames) != len(expectedDNSNames) { - t.Errorf("expected %d DNS names, got %d", len(expectedDNSNames), len(cert.DNSNames)) - } - - for _, name := range cert.DNSNames { - if !expectedDNSNames[name] { - t.Errorf("unexpected DNS name: %s", name) - } - } + gotDNSNames := map[string]bool{} + for _, name := range cert.DNSNames { + gotDNSNames[name] = true + if !expectedDNSNames[name] { + t.Errorf("unexpected DNS name: %s", name) + } + } + for expected := range expectedDNSNames { + if !gotDNSNames[expected] { + t.Errorf("missing expected DNS name: %s", expected) + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go` around lines 69 - 77, The test currently only checks DNS count and unexpected extras; update the assertion logic in the test that iterates cert.DNSNames to also assert any expected names missing from cert.DNSNames: after building or defining expectedDNSNames and cert.DNSNames, verify each key in expectedDNSNames exists in cert.DNSNames (e.g., check expectedDNSNames[name] or presence) and fail with a clear message like "missing DNS name: <name>" for each missing entry; keep the existing unexpected-name check but add this explicit missing-SAN check so failures clearly indicate which expected SANs are absent (operate on the existing cert.DNSNames and expectedDNSNames variables).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go`:
- Around line 1751-1753: The log calls pass the error as the first argument to
r.Log.Error but include a "%w" in the message which is literal for logr; remove
the "%w" from the message string in each r.Log.Error invocation (e.g., the one
in hostedcontrolplane_controller.go around the service CA removal and the other
occurrences referenced in the review) so the call becomes r.Log.Error(err,
"descriptive message") and rely on logr's structured error argument instead of
printf formatting.
---
Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go`:
- Around line 69-77: The test currently only checks DNS count and unexpected
extras; update the assertion logic in the test that iterates cert.DNSNames to
also assert any expected names missing from cert.DNSNames: after building or
defining expectedDNSNames and cert.DNSNames, verify each key in expectedDNSNames
exists in cert.DNSNames (e.g., check expectedDNSNames[name] or presence) and
fail with a clear message like "missing DNS name: <name>" for each missing
entry; keep the existing unexpected-name check but add this explicit missing-SAN
check so failures clearly indicate which expected SANs are absent (operate on
the existing cert.DNSNames and expectedDNSNames variables).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4b5f9fad-758f-4674-99b3-e12edc7637ee
📒 Files selected for processing (5)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/controllers/hostedcontrolplane/manifests/aws_ebs_csi_driver_operator.gocontrol-plane-operator/controllers/hostedcontrolplane/manifests/pki.gocontrol-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator.gocontrol-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- control-plane-operator/controllers/hostedcontrolplane/manifests/aws_ebs_csi_driver_operator.go
- control-plane-operator/controllers/hostedcontrolplane/manifests/pki.go
- control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator.go
The aws-ebs-csi-driver-operator deployment mounts a TLS serving cert from secret "aws-ebs-csi-driver-operator-serving-cert". On OpenShift, the service-ca operator generates this cert automatically. On non-OpenShift management clusters (e.g. EKS), the service-ca operator does not exist, so the secret is never created and the operator pod fails to start with a FailedMount error. This blocks PersistentVolume support on the guest cluster. Add CPO-managed cert generation for the operator serving cert, following the same pattern used for Azure Disk/File CSI driver operator certs and the existing AWS EBS controller metrics cert (commit acebcc9). Signed-off-by: Claudio Busse <cbusse@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
553763f to
249d41d
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go (1)
69-77: Make SAN assertion strict for missing-name detection.Current checks can miss a missing expected SAN if a duplicate expected SAN appears. Consider asserting exact set equality.
Proposed assertion tightening
- if len(cert.DNSNames) != len(expectedDNSNames) { - t.Errorf("expected %d DNS names, got %d", len(expectedDNSNames), len(cert.DNSNames)) - } - - for _, name := range cert.DNSNames { - if !expectedDNSNames[name] { - t.Errorf("unexpected DNS name: %s", name) - } - } + seen := map[string]bool{} + for _, name := range cert.DNSNames { + if !expectedDNSNames[name] { + t.Errorf("unexpected DNS name: %s", name) + continue + } + seen[name] = true + } + if len(seen) != len(expectedDNSNames) { + t.Errorf("expected DNS names %v, got %v", expectedDNSNames, cert.DNSNames) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go` around lines 69 - 77, The current SAN checks can miss a missing expected name if a duplicate expected name exists; replace the loose validation with an exact set equality check between cert.DNSNames and expectedDNSNames by converting cert.DNSNames into a map (e.g. observed := map[string]bool{}) and then compare keys: verify len(observed) == len(expectedDNSNames) and that for each key in expectedDNSNames observed[key] is true (and vice-versa) so any missing or extra SAN is detected; update the assertions around cert.DNSNames/expectedDNSNames accordingly in the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@control-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go`:
- Around line 69-77: The current SAN checks can miss a missing expected name if
a duplicate expected name exists; replace the loose validation with an exact set
equality check between cert.DNSNames and expectedDNSNames by converting
cert.DNSNames into a map (e.g. observed := map[string]bool{}) and then compare
keys: verify len(observed) == len(expectedDNSNames) and that for each key in
expectedDNSNames observed[key] is true (and vice-versa) so any missing or extra
SAN is detected; update the assertions around cert.DNSNames/expectedDNSNames
accordingly in the test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 0185437e-5fd5-416a-b75e-7eff75f9c02a
📒 Files selected for processing (5)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.gocontrol-plane-operator/controllers/hostedcontrolplane/manifests/aws_ebs_csi_driver_operator.gocontrol-plane-operator/controllers/hostedcontrolplane/manifests/pki.gocontrol-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator.gocontrol-plane-operator/controllers/hostedcontrolplane/pki/aws_ebs_csi_driver_operator_test.go
✅ Files skipped from review due to trivial changes (2)
- control-plane-operator/controllers/hostedcontrolplane/manifests/pki.go
- control-plane-operator/controllers/hostedcontrolplane/manifests/aws_ebs_csi_driver_operator.go
🚧 Files skipped from review as they are similar to previous changes (1)
- control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go
|
/test e2e-aks e2e-aks-4-22 e2e-aws e2e-aws-4-22 e2e-aws-upgrade-hypershift-operator e2e-azure-self-managed e2e-kubevirt-aws-ovn-reduced e2e-v2-aws |
|
/retest |
1 similar comment
|
/retest |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/retest |
|
Now I have all the evidence needed. Here's the complete analysis: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe failure is exclusively in the teardown phase of Root CauseThe root cause is a slow AWS resource cleanup for the hosted cluster The blocking resource is the NLB ( The two remaining EC2 volumes ( This is an infrastructure timing flake — the NLB did not finish deleting within the 15-minute timeout. Sippy data shows The PR's changes (generating Recommendations
Evidence
|
|
/retest |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: csrwng, typeid The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
@typeid: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/verified by @typeid Tested on an EKS cluster without service-ca - the CPO generates the cert and the aws-ebs-csi-driver-operator pod starts successfully. For OpenShift management clusters, this is validated by existing CI e2e tests:
|
|
@typeid: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/uncc |
Summary
aws-ebs-csi-driver-operator-serving-certTLS secret in the CPO instead of relying on the service-ca operatoraws-ebs-csi-driver-operator-metricsservice to prevent ownership conflicts with the service-ca operator on OpenShift management clustersProblem
The
aws-ebs-csi-driver-operatordeployment mounts a TLS serving cert from secretaws-ebs-csi-driver-operator-serving-cert. On OpenShift, the service-ca operator generates this cert automatically for services annotated withservice.beta.kubernetes.io/serving-cert-secret-name. On non-OpenShift management clusters (e.g. EKS), the service-ca operator does not exist, so the secret is never created and the operator pod fails to start withFailedMount.Without the EBS CSI driver operator, no CSI driver pods are deployed on the data plane, blocking PersistentVolume support on the guest cluster.
The CPO already handles the analogous case for the EBS CSI driver controller metrics cert (commit acebcc919), but the operator serving cert was not covered.
Context
This is part of a broader effort to remove the CPO's dependency on the service-ca operator for TLS cert generation in hosted control plane namespaces. See OCPBUGS-34662 / #7538 for the full scope.
Bug: OCPBUGS-84572
Test plan
TestReconcileAWSEBSCsiDriverOperatorMetricsServingCertSecret)aws-ebs-csi-driver-operatorpod starts successfully on a non-OpenShift management clusterci/prow/e2e-aws(validates no reconciliation fight with service-ca on OpenShift management clusters)Summary by CodeRabbit
New Features
Tests