OCPBUGS-62517: Set replicas=1, PDB, and pod anti-affinity for HA topology by tmshort · Pull Request #709 · openshift/operator-framework-operator-controller

tmshort · 2026-04-24T21:07:28Z

Rolling updates in HighlyAvailable clusters leave catalogd and operator-controller unavailable when the only running pod is evicted before its replacement is ready.

Fix by defaulting replicas=1 and PDB disabled in the static Helm values (safe for SNO/External topologies, passes the SNO conformance test that asserts exactly one replica in SingleReplica topology mode). Add pod anti-affinity to prefer scheduling replicas on different nodes.

cluster-olm-operator detects the cluster's ControlPlaneTopology at startup and overrides these values to replicas=2 and PDB enabled when a HighlyAvailable topology is detected, then re-renders the manifests before starting controllers. When a topology change is observed at runtime (exceedingly rare), the operator exits so its deployment
controller restarts it, triggering a fresh Helm render with the correct values for the new topology.

Summary by CodeRabbit

Chores
- Introduced preferred pod anti-affinity to bias controller-manager pods to different nodes for improved fault tolerance.
- Set explicit replica counts in Helm values for catalog and operator controller deployments.
- Removed outdated notes about replica increases and Pod Disruption Budget additions; no PDBs were added in these changes.

coderabbitai · 2026-04-24T21:07:42Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Add preferred pod anti-affinity (preferredDuringSchedulingIgnoredDuringExecution) to controller-manager Deployments for catalogd and operator-controller, and set explicit replicas: 1 in corresponding Helm values. No other functional changes shown.

Changes

Controller-manager pod anti-affinity + Helm replica values

Layer / File(s)	Summary
Affinity Rule (manifests) `openshift/catalogd/manifests.yaml`, `openshift/catalogd/manifests-experimental.yaml`, `openshift/operator-controller/manifests.yaml`, `openshift/operator-controller/manifests-experimental.yaml`	Added `spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution` with `podAffinityTerm` matching `control-plane: <component>-controller-manager`, `topologyKey: kubernetes.io/hostname`, `weight: 100`.
Helm values: replica counts `openshift/helm/catalogd.yaml`, `openshift/helm/operator-controller.yaml`	Added `options.*.deployment.replicas: 1` entries in Helm values for the two deployments.
Helm values: affinity templates `openshift/helm/catalogd.yaml`, `openshift/helm/operator-controller.yaml`	Added corresponding `deployments.templateSpec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution` blocks to Helm templates mirroring the manifest affinity rules.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 3 warnings)

Check name	Status	Explanation	Resolution
Stable And Deterministic Test Names	❌ Error	Test titles use dynamic image names via fmt.Sprintf, violating stable and deterministic test name requirements.	Replace dynamic image names in test titles with static descriptive strings and move specific image validation to test assertions.
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Topology-Aware Scheduling Compatibility	⚠️ Warning	Pull request hard-codes replicas: 1 globally, preventing HA scaling on multi-node clusters and lacking topology-aware detection for SNO vs. multi-node topologies.	Implement topology-aware replica counts by detecting infrastructure.Status.ControlPlaneTopology: set replicas to 1 for SingleReplica/SNO and 2+ for HighlyAvailable/DualReplica topologies; verify with SNO and HA multi-node CI tests.
Title check	⚠️ Warning	The PR title mentions 'replicas=1' but the PR description explicitly states the change increases replicas from 1 to 2, contradicting the title claim.	Correct the title to accurately reflect 'replicas=2' instead of 'replicas=1', or clarify if the title refers to a different aspect of the changes.

✅ Passed checks (8 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Test Structure And Quality	✅ Passed	The custom check is not applicable to this pull request. The PR exclusively modifies Kubernetes manifest files (.yaml) and Helm chart configuration files, containing no Ginkgo test code (Go files).
Microshift Test Compatibility	✅ Passed	Pull request only modifies Kubernetes manifest files and Helm chart configurations without adding any new Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR only modifies Kubernetes manifest and Helm chart YAML files; no new Ginkgo e2e tests are added, making this check not applicable.
Ote Binary Stdout Contract	✅ Passed	OTE stdout contract check verified. PR contains only YAML configuration changes without modifications to Go source code, test code, or executable binaries.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR modifies only Kubernetes manifest and Helm chart YAML files with no new Ginkgo e2e tests, making the IPv6 and disconnected network compatibility check not applicable.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-24T21:07:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~DOWNSTREAM_OWNERS~~ [tmshort]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/catalogd/serverutil/serverutil.go`:
- Around line 67-68: The comment for catalogServerRunnable is stale: it states
"leader-only Runnable" while NeedLeaderElection() returns false (so it runs on
all pods); update the comment above catalogServerRunnable to remove or correct
the "leader-only" wording and accurately describe that the runnable runs on all
pods and lazily binds the catalog HTTP port in Start(), ensuring the comment
matches the behavior of NeedLeaderElection() and Start().
- Around line 78-87: PopulateCache in
internal/operator-controller/catalogmetadata/client/client.go is caching non-200
responses (notably 404s from non-leader pods) which causes permanent failures
during rollouts; update PopulateCache to not cache error responses and to
implement retries for transient/non-leader 404s (use a small retry loop with
exponential backoff and/or attempt alternate endpoints), only writing to the
local cache when a 2xx response is received; keep existing behavior that
Start()/NeedLeaderElection() returns false but ensure PopulateCache retries
across requests before giving up and then surface a final error without
persisting the 404 into the cache.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: def80a5a-2570-4718-a4c6-821b6a09c784

📥 Commits

Reviewing files that changed from the base of the PR and between 9b9eb03 and cf8b48a.

📒 Files selected for processing (7)

internal/catalogd/serverutil/serverutil.go
openshift/catalogd/manifests-experimental.yaml
openshift/catalogd/manifests.yaml
openshift/helm/catalogd.yaml
openshift/helm/operator-controller.yaml
openshift/operator-controller/manifests-experimental.yaml
openshift/operator-controller/manifests.yaml

tmshort · 2026-04-27T18:13:17Z

/test openshift-e2e-aws

tmshort · 2026-04-27T18:13:23Z

/test e2e-aws-upgrade-ovn-single-node

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

manifests/experimental.yaml (1)
2544-2544: Replica count alone does not guarantee node-level HA.

With only replicas: 2, both pods can still land on the same node. Consider adding pod anti-affinity or topology spread constraints so this HA change is resilient to single-node loss.

Also applies to: 2682-2682
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifests/experimental.yaml` at line 2544, Replica count set to 2 is
insufficient for node-level HA because both pods can schedule to the same node;
update the workload spec that contains `replicas: 2` to include scheduling
constraints such as a podAntiAffinity (e.g.,
preferredDuringSchedulingIgnoredDuringExecution with topologyKey
`kubernetes.io/hostname`) or topologySpreadConstraints so the two replicas are
spread across distinct nodes (apply the same change to the other occurrence
referencing `replicas: 2` at the noted location).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/operator-controller/catalogmetadata/client/client.go`:
- Around line 108-113: The code currently returns early on any non-200
resp.StatusCode which prevents cache.Put from recording the actual error;
instead, modify the branch so that only 404 bypasses the cache (treat as HA
leader-miss) while other non-200 statuses create an error value and call
cache.Put(key, errOrErrorResponse) before returning the error; specifically, in
the function handling the HTTP response (the block that checks resp.StatusCode
and currently returns fmt.Errorf...), change the logic to if resp.StatusCode ==
http.StatusNotFound then return without caching, else construct the appropriate
error (including status code and body if available), call cache.Put with the
same cache key used elsewhere, and then return that error.

In `@manifests/experimental-e2e.yaml`:
- Line 2624: The shared e2e-coverage PVC is defined as ReadWriteOnce but both
Deployments (catalogd and operator-controller) mount it and replicas were
increased to 2, causing pods on different nodes to hang on attach; update the
storage model to support multiple nodes by either changing the PVC named
e2e-coverage to accessMode: ReadWriteMany (if your storage class supports RWX)
or convert these Deployments to StatefulSets and use volumeClaimTemplates to
give each replica its own PVC; locate the e2e-coverage PVC definition and the
catalogd/operator-controller workload manifests and apply the RWX change or the
StatefulSet + volumeClaimTemplates approach accordingly.

In `@test/e2e/features/ha.feature`:
- Around line 15-19: Add a pre-failover assertion that the catalog is already
serving: before the step "When the catalogd leader pod is force-deleted" assert
that catalog "test" reports Serving as True with Reason Available (i.e., ensure
the step/check that verifies Serving=True/Reason=Available runs prior to
deleting the leader pod). Locate the scenario steps around "And catalogd is
ready to reconcile resources", "And catalog \"test\" is reconciled" and the
existing "And catalog \"test\" reports Serving as True with Reason Available"
and move or duplicate that Serving check so it executes immediately before the
force-delete action to guarantee the catalog was serving before failover.

In `@test/e2e/steps/ha_steps.go`:
- Around line 27-31: The code uses the cached value sc.leaderPods["catalogd"]
which can be stale; before deleting the pod, refresh the current lease holder
from the cluster instead of relying on sc.leaderPods. Replace the direct use of
sc.leaderPods["catalogd"] in the step (after obtaining sc via scenarioCtx(ctx))
with a fresh lookup of the catalogd leader (e.g., call the existing helper that
queries the lease/leader from the API or add a small helper like
refresh/getLeaseHolder for "catalogd"), then use that up-to-date leaderPod for
the deletion and for assertions in NewCatalogdLeaderIsElected. Ensure the
refreshed value overwrites or is compared against sc.leaderPods["catalogd"] so
you don't accidentally delete a follower.

In `@test/e2e/steps/hooks.go`:
- Around line 136-142: The HA gate is currently based on total node count;
change it to count only control-plane nodes that are schedulable and enable
featureGates[catalogdHAFeature] only when there are at least two such nodes. Use
the existing k8sClient call to list control-plane nodes (select by label
node-role.kubernetes.io/control-plane or equivalent), then for each returned
node check its spec.taints (or call k8sClient to get the node object) and
exclude nodes with a NoSchedule control-plane taint; finally set
featureGates[catalogdHAFeature] = true when the filtered count >= 2. Ensure you
reference the same featureGates map and catalogdHAFeature symbol and keep the
logic inside the same hook.

---

Nitpick comments:
In `@manifests/experimental.yaml`:
- Line 2544: Replica count set to 2 is insufficient for node-level HA because
both pods can schedule to the same node; update the workload spec that contains
`replicas: 2` to include scheduling constraints such as a podAntiAffinity (e.g.,
preferredDuringSchedulingIgnoredDuringExecution with topologyKey
`kubernetes.io/hostname`) or topologySpreadConstraints so the two replicas are
spread across distinct nodes (apply the same change to the other occurrence
referencing `replicas: 2` at the noted location).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c19769a4-b068-4be4-bc2a-b40decbe7fad

📥 Commits

Reviewing files that changed from the base of the PR and between cf8b48a and dfec6bd.

📒 Files selected for processing (17)

Makefile
helm/experimental.yaml
internal/catalogd/serverutil/serverutil.go
internal/operator-controller/catalogmetadata/client/client.go
internal/operator-controller/catalogmetadata/client/client_test.go
manifests/experimental-e2e.yaml
manifests/experimental.yaml
openshift/catalogd/manifests-experimental.yaml
openshift/catalogd/manifests.yaml
openshift/helm/catalogd.yaml
openshift/helm/operator-controller.yaml
openshift/operator-controller/manifests-experimental.yaml
openshift/operator-controller/manifests.yaml
test/e2e/features/ha.feature
test/e2e/steps/ha_steps.go
test/e2e/steps/hooks.go
test/e2e/steps/steps.go

💤 Files with no reviewable changes (1)

internal/operator-controller/catalogmetadata/client/client_test.go

✅ Files skipped from review due to trivial changes (2)

openshift/helm/operator-controller.yaml
helm/experimental.yaml

🚧 Files skipped from review as they are similar to previous changes (3)

openshift/helm/catalogd.yaml
openshift/operator-controller/manifests-experimental.yaml
openshift/catalogd/manifests.yaml

coderabbitai · 2026-04-27T18:58:00Z

 	if resp.StatusCode != http.StatusOK {
-		errToCache := fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)
-		return c.cache.Put(catalog.Name, catalog.Status.ResolvedSource.Image.Ref, nil, errToCache)
+		// Do not cache non-200 responses (e.g. 404 from a non-leader catalogd pod).
+		// Returning the error directly lets the next reconcile retry a fresh HTTP
+		// request and eventually hit the leader.
+		return nil, fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)
 	}


⚠️ Potential issue | 🟠 Major

Non-200 responses are now hidden from cache consumers

On Line 112, returning directly skips cache.Put, so later reads can surface cache ... not found instead of the real upstream failure. That’s risky for persistent non-200s (e.g., 401/403/5xx). Consider bypassing cache only for the HA race case (404), and caching other non-200 errors.

Proposed scoped fix

if resp.StatusCode != http.StatusOK { - // Do not cache non-200 responses (e.g. 404 from a non-leader catalogd pod). - // Returning the error directly lets the next reconcile retry a fresh HTTP - // request and eventually hit the leader. - return nil, fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode) + err := fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode) + // Avoid pinning transient HA misses, but preserve error visibility for other failures. + if resp.StatusCode == http.StatusNotFound { + return nil, err + } + return c.cache.Put(catalog.Name, catalog.Status.ResolvedSource.Image.Ref, nil, err) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if resp.StatusCode != http.StatusOK {

errToCache := fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)

return c.cache.Put(catalog.Name, catalog.Status.ResolvedSource.Image.Ref, nil, errToCache)

// Do not cache non-200 responses (e.g. 404 from a non-leader catalogd pod).

// Returning the error directly lets the next reconcile retry a fresh HTTP

// request and eventually hit the leader.

return nil, fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)

}

if resp.StatusCode != http.StatusOK {

err := fmt.Errorf("error: received unexpected response status code %d", resp.StatusCode)

// Avoid pinning transient HA misses, but preserve error visibility for other failures.

if resp.StatusCode == http.StatusNotFound {

return nil, err

}

return c.cache.Put(catalog.Name, catalog.Status.ResolvedSource.Image.Ref, nil, err)

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/operator-controller/catalogmetadata/client/client.go` around lines 108 - 113, The code currently returns early on any non-200 resp.StatusCode which prevents cache.Put from recording the actual error; instead, modify the branch so that only 404 bypasses the cache (treat as HA leader-miss) while other non-200 statuses create an error value and call cache.Put(key, errOrErrorResponse) before returning the error; specifically, in the function handling the HTTP response (the block that checks resp.StatusCode and currently returns fmt.Errorf...), change the logic to if resp.StatusCode == http.StatusNotFound then return without caching, else construct the appropriate error (including status code and body if available), call cache.Put with the same cache key used elsewhere, and then return that error.

coderabbitai · 2026-04-27T18:58:00Z

 spec:
  minReadySeconds: 5
-  replicas: 1
+  replicas: 2


⚠️ Potential issue | 🟠 Major

The shared e2e-coverage PVC does not scale with these replica bumps.

Both Deployments still mount the same e2e-coverage claim, but this file defines that PVC as ReadWriteOnce at Line 180-Line 185. With 2 catalogd pods and 2 operator-controller pods in the 2-node experimental suite, a replica scheduled onto a different node can stay Pending on volume attach, which defeats the HA rollout this change is trying to exercise.

Also applies to: 2775-2775

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@manifests/experimental-e2e.yaml` at line 2624, The shared e2e-coverage PVC is defined as ReadWriteOnce but both Deployments (catalogd and operator-controller) mount it and replicas were increased to 2, causing pods on different nodes to hang on attach; update the storage model to support multiple nodes by either changing the PVC named e2e-coverage to accessMode: ReadWriteMany (if your storage class supports RWX) or convert these Deployments to StatefulSets and use volumeClaimTemplates to give each replica its own PVC; locate the e2e-coverage PVC definition and the catalogd/operator-controller workload manifests and apply the RWX change or the StatefulSet + volumeClaimTemplates approach accordingly.

coderabbitai · 2026-04-27T18:58:00Z

+    And catalogd is ready to reconcile resources
+    And catalog "test" is reconciled
+    When the catalogd leader pod is force-deleted
+    Then a new catalogd leader is elected
+    And catalog "test" reports Serving as True with Reason Available


⚠️ Potential issue | 🟡 Minor

Assert the catalog is serving before killing the leader.

Right now this scenario only proves eventual availability after the delete. If the catalog never became Serving=True until after failover, the test would still pass, which weakens the regression signal for “resume serving” behavior.

Suggested tweak

And catalogd is ready to reconcile resources And catalog "test" is reconciled + And catalog "test" reports Serving as True with Reason Available When the catalogd leader pod is force-deleted Then a new catalogd leader is elected And catalog "test" reports Serving as True with Reason Available

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

And catalogd is ready to reconcile resources

And catalog "test" is reconciled

When the catalogd leader pod is force-deleted

Then a new catalogd leader is elected

And catalog "test" reports Serving as True with Reason Available

And catalogd is ready to reconcile resources

And catalog "test" is reconciled

And catalog "test" reports Serving as True with Reason Available

When the catalogd leader pod is force-deleted

Then a new catalogd leader is elected

And catalog "test" reports Serving as True with Reason Available

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/features/ha.feature` around lines 15 - 19, Add a pre-failover assertion that the catalog is already serving: before the step "When the catalogd leader pod is force-deleted" assert that catalog "test" reports Serving as True with Reason Available (i.e., ensure the step/check that verifies Serving=True/Reason=Available runs prior to deleting the leader pod). Locate the scenario steps around "And catalogd is ready to reconcile resources", "And catalog \"test\" is reconciled" and the existing "And catalog \"test\" reports Serving as True with Reason Available" and move or duplicate that Serving check so it executes immediately before the force-delete action to guarantee the catalog was serving before failover.

coderabbitai · 2026-04-27T18:58:00Z

+	sc := scenarioCtx(ctx)
+	leaderPod := sc.leaderPods["catalogd"]
+	if leaderPod == "" {
+		return fmt.Errorf("catalogd leader pod not found in scenario context; run 'catalogd is ready to reconcile resources' first")
+	}


⚠️ Potential issue | 🟠 Major

Refresh the lease holder immediately before deleting the pod.

sc.leaderPods["catalogd"] can be stale by the time this step runs. If leadership already changed during catalog "test" is reconciled, this step deletes a follower, and NewCatalogdLeaderIsElected can then pass without any real failover because the holder already differs from the cached value.

Suggested fix

func CatalogdLeaderPodIsForceDeleted(ctx context.Context) error { sc := scenarioCtx(ctx) - leaderPod := sc.leaderPods["catalogd"] - if leaderPod == "" { - return fmt.Errorf("catalogd leader pod not found in scenario context; run 'catalogd is ready to reconcile resources' first") + holder, err := k8sClient("get", "lease", leaseNames["catalogd"], "-n", olmNamespace, + "-o", "jsonpath={.spec.holderIdentity}") + if err != nil || strings.TrimSpace(holder) == "" { + return fmt.Errorf("failed to resolve current catalogd leader from lease: %w", err) } + leaderPod := strings.Split(strings.TrimSpace(holder), "_")[0] + sc.leaderPods["catalogd"] = leaderPod logger.Info("Force-deleting catalogd leader pod", "pod", leaderPod) if _, err := k8sClient("delete", "pod", leaderPod, "-n", olmNamespace, "--force", "--grace-period=0"); err != nil {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sc := scenarioCtx(ctx)

leaderPod := sc.leaderPods["catalogd"]

if leaderPod == "" {

return fmt.Errorf("catalogd leader pod not found in scenario context; run 'catalogd is ready to reconcile resources' first")

}

sc := scenarioCtx(ctx)

holder, err := k8sClient("get", "lease", leaseNames["catalogd"], "-n", olmNamespace,

"-o", "jsonpath={.spec.holderIdentity}")

if err != nil || strings.TrimSpace(holder) == "" {

return fmt.Errorf("failed to resolve current catalogd leader from lease: %w", err)

}

leaderPod := strings.Split(strings.TrimSpace(holder), "_")[0]

sc.leaderPods["catalogd"] = leaderPod

logger.Info("Force-deleting catalogd leader pod", "pod", leaderPod)

if _, err := k8sClient("delete", "pod", leaderPod, "-n", olmNamespace,

"--force", "--grace-period=0"); err != nil {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/steps/ha_steps.go` around lines 27 - 31, The code uses the cached value sc.leaderPods["catalogd"] which can be stale; before deleting the pod, refresh the current lease holder from the cluster instead of relying on sc.leaderPods. Replace the direct use of sc.leaderPods["catalogd"] in the step (after obtaining sc via scenarioCtx(ctx)) with a fresh lookup of the catalogd leader (e.g., call the existing helper that queries the lease/leader from the API or add a small helper like refresh/getLeaseHolder for "catalogd"), then use that up-to-date leaderPod for the deletion and for assertions in NewCatalogdLeaderIsElected. Ensure the refreshed value overwrites or is compared against sc.leaderPods["catalogd"] so you don't accidentally delete a follower.

coderabbitai · 2026-04-27T18:58:00Z

+	// Enable HA scenarios when the cluster has at least 2 nodes.  This runs
+	// unconditionally so that upgrade scenarios (which install OLM in a Background
+	// step and return early below) still get the gate set correctly.
+	if out, err := k8sClient("get", "nodes", "--no-headers", "-o", "name"); err == nil &&
+		len(strings.Fields(strings.TrimSpace(out))) >= 2 {
+		featureGates[catalogdHAFeature] = true
+	}


⚠️ Potential issue | 🟡 Minor

HA detection should use eligible control-plane nodes, not total nodes.

This currently enables the HA gate based on all nodes. On clusters with only one control-plane node, HA scenarios may be enabled incorrectly and fail for the wrong reason. Please gate on schedulable control-plane capacity for these deployments.

💡 Minimal adjustment

- if out, err := k8sClient("get", "nodes", "--no-headers", "-o", "name"); err == nil && + if out, err := k8sClient("get", "nodes", "-l", "node-role.kubernetes.io/control-plane", "--no-headers", "-o", "name"); err == nil && len(strings.Fields(strings.TrimSpace(out))) >= 2 { featureGates[catalogdHAFeature] = true }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Enable HA scenarios when the cluster has at least 2 nodes. This runs

// unconditionally so that upgrade scenarios (which install OLM in a Background

// step and return early below) still get the gate set correctly.

if out, err := k8sClient("get", "nodes", "--no-headers", "-o", "name"); err == nil &&

len(strings.Fields(strings.TrimSpace(out))) >= 2 {

featureGates[catalogdHAFeature] = true

}

// Enable HA scenarios when the cluster has at least 2 nodes. This runs

// unconditionally so that upgrade scenarios (which install OLM in a Background

// step and return early below) still get the gate set correctly.

if out, err := k8sClient("get", "nodes", "-l", "node-role.kubernetes.io/control-plane", "--no-headers", "-o", "name"); err == nil &&

len(strings.Fields(strings.TrimSpace(out))) >= 2 {

featureGates[catalogdHAFeature] = true

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/steps/hooks.go` around lines 136 - 142, The HA gate is currently based on total node count; change it to count only control-plane nodes that are schedulable and enable featureGates[catalogdHAFeature] only when there are at least two such nodes. Use the existing k8sClient call to list control-plane nodes (select by label node-role.kubernetes.io/control-plane or equivalent), then for each returned node check its spec.taints (or call k8sClient to get the node object) and exclude nodes with a NoSchedule control-plane taint; finally set featureGates[catalogdHAFeature] = true when the filtered count >= 2. Ensure you reference the same featureGates map and catalogdHAFeature symbol and keep the logic inside the same hook.

tmshort · 2026-04-28T13:24:19Z

/test e2e-aws-upgrade-ovn-single-node
/test openshift-e2e-aws

coderabbitai

🧹 Nitpick comments (1)

openshift/operator-controller/manifests-experimental.yaml (1)

1227-1227: Add pod anti-affinity to ensure replicas spread across different nodes.

With only 2 replicas and no podAntiAffinity or topologySpreadConstraints, both pods can still co-locate on a single control-plane node, compromising HA during node failure. Adding preferred pod anti-affinity with the existing nodeAffinity section will distribute replicas across nodes.

♻️ Suggested manifest adjustment

      affinity:
+        podAntiAffinity:
+          preferredDuringSchedulingIgnoredDuringExecution:
+            - weight: 100
+              podAffinityTerm:
+                topologyKey: kubernetes.io/hostname
+                labelSelector:
+                  matchLabels:
+                    control-plane: operator-controller-controller-manager
         nodeAffinity:
           requiredDuringSchedulingIgnoredDuringExecution:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openshift/operator-controller/manifests-experimental.yaml` at line 1227, Add
a preferred pod anti-affinity rule to the deployment/statefulset spec near the
existing replicas: 2 entry so pods prefer not to co-locate on the same node;
modify the pod template's spec.affinity by adding
podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution matching the
podLabel (use the same label selector as the workload) with topologyKey:
"kubernetes.io/hostname", and merge it with the existing nodeAffinity block (do
not remove nodeAffinity). Optionally consider adding topologySpreadConstraints
as an alternative, but implement the preferred podAntiAffinity under the same
affinity section for immediate replica spreading.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@openshift/operator-controller/manifests-experimental.yaml`:
- Line 1227: Add a preferred pod anti-affinity rule to the
deployment/statefulset spec near the existing replicas: 2 entry so pods prefer
not to co-locate on the same node; modify the pod template's spec.affinity by
adding podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution matching
the podLabel (use the same label selector as the workload) with topologyKey:
"kubernetes.io/hostname", and merge it with the existing nodeAffinity block (do
not remove nodeAffinity). Optionally consider adding topologySpreadConstraints
as an alternative, but implement the preferred podAntiAffinity under the same
affinity section for immediate replica spreading.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bb752058-a2e9-40c1-997d-9d29533d7c3d

📥 Commits

Reviewing files that changed from the base of the PR and between dfec6bd and c55a3f0.

📒 Files selected for processing (6)

openshift/catalogd/manifests-experimental.yaml
openshift/catalogd/manifests.yaml
openshift/helm/catalogd.yaml
openshift/helm/operator-controller.yaml
openshift/operator-controller/manifests-experimental.yaml
openshift/operator-controller/manifests.yaml

✅ Files skipped from review due to trivial changes (2)

openshift/helm/catalogd.yaml
openshift/helm/operator-controller.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

openshift/catalogd/manifests.yaml
openshift/catalogd/manifests-experimental.yaml

openshift-ci-robot · 2026-05-01T19:40:24Z

@tmshort: This pull request references Jira Issue OCPBUGS-62517, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jiazha@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary by CodeRabbit

Chores

Increased controller manager deployment replicas from 1 to 2 for improved availability.

Added Pod Disruption Budgets for catalogd and operator-controller to keep at least one pod available during maintenance.

Introduced preferred pod anti-affinity to bias replicas onto different nodes for better fault tolerance.

Updated Helm chart values to reflect these deployment and scheduling changes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tmshort · 2026-05-01T20:12:53Z

/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-ovn-upgrade 10

tmshort · 2026-05-01T20:13:05Z

/payload-aggregate periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node 10

openshift-ci · 2026-05-01T20:13:21Z

@tmshort: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-5.0-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/234b3ed0-459a-11f1-9727-76273af6616d-0

openshift-ci · 2026-05-01T20:13:42Z

@tmshort: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-5.0-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2a784040-459a-11f1-9da3-e06aa30947a7-0

perdasilva · 2026-05-04T11:40:59Z

/retest

tmshort · 2026-05-04T15:25:26Z

/retest
Also looks as though the aggregate jobs succeeded.
EDIT: Single node test is failing.

tmshort · 2026-05-04T18:54:17Z

The single-node test is failing in aggregate also:


: [sig-arch] Cluster topology single node tests Verify that OpenShift components deploy one replica in SingleReplica topology mode [Suite:openshift/conformance/parallel] expand_less | 1s
-- | --
{  fail [github.com/openshift/origin/test/extended/single_node/topology.go:97]: catalogd-controller-manager in openshift-catalogd namespace expected to have 1 replica but got 2 Expected     <int>: 2 to equal     <int>: 1}

tmshort · 2026-05-04T19:01:15Z

It appears that this fix is incompatible with SNO, a fix will need to be done in cluster-olm-operator.
/hold

…ffinity for HA topology Rolling updates in HighlyAvailable clusters leave catalogd and operator-controller unavailable when the only running pod is evicted before its replacement is ready. Fix by defaulting replicas=1 and PDB disabled in the static Helm values (safe for SNO/External topologies, passes the SNO conformance test that asserts exactly one replica in SingleReplica topology mode). Add pod anti-affinity to prefer scheduling replicas on different nodes. cluster-olm-operator detects the cluster's ControlPlaneTopology at startup and overrides these values to replicas=2 and PDB enabled when a HighlyAvailable topology is detected, then re-renders the manifests before starting controllers. When a topology change is observed at runtime (exceedingly rare), the operator exits so its deployment controller restarts it, triggering a fresh Helm render with the correct values for the new topology. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Todd Short <tshort@redhat.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@openshift/helm/catalogd.yaml`:
- Line 11: The manifest hard-codes "replicas: 1" which prevents HA; replace that
literal with a Helm-driven value so replica count follows cluster topology—e.g.,
remove the fixed "replicas: 1" in catalogd.yaml and reference a chart value such
as .Values.replicaCount (or use a conditional on .Values.topology /
.Values.highlyAvailable to choose 1 vs N) and add the corresponding default in
values.yaml; ensure the template and values are used everywhere catalogd is
deployed so multi-node clusters can scale replicas appropriately.

In `@openshift/helm/operator-controller.yaml`:
- Line 11: Replace the hard-coded replicas: 1 with a topology-aware
Helm/template-driven value and pair it with PDB enablement; specifically, change
the replicas field to use a templated value (e.g., replicas: {{
.Values.topology.replicas }} or a helper like {{ include "operator.replicas" .
}}), implement a helper/template (operator.replicas or values.topology.mode)
that returns 1 for SingleReplica and 2 (or configurable HA count) for
HighlyAvailable, and ensure the same topology flag drives creation/enabling of a
PodDisruptionBudget resource so disruption protection is applied in HA mode.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9277e43b-f0b3-4431-abfa-7af3287dbac9

📥 Commits

Reviewing files that changed from the base of the PR and between 238bbc8 and 4c63a27.

📒 Files selected for processing (6)

openshift/catalogd/manifests-experimental.yaml
openshift/catalogd/manifests.yaml
openshift/helm/catalogd.yaml
openshift/helm/operator-controller.yaml
openshift/operator-controller/manifests-experimental.yaml
openshift/operator-controller/manifests.yaml

coderabbitai · 2026-05-04T19:51:33Z

    enabled: true
    deployment:
      image: ${CATALOGD_IMAGE}
+      replicas: 1


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid hard-coding catalogd to a single replica.

Line 11 fixes replicas at 1, which prevents HA rollout protection on multi-node clusters. This should follow cluster topology (SingleReplica vs HighlyAvailable) rather than being globally pinned.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openshift/helm/catalogd.yaml` at line 11, The manifest hard-codes "replicas: 1" which prevents HA; replace that literal with a Helm-driven value so replica count follows cluster topology—e.g., remove the fixed "replicas: 1" in catalogd.yaml and reference a chart value such as .Values.replicaCount (or use a conditional on .Values.topology / .Values.highlyAvailable to choose 1 vs N) and add the corresponding default in values.yaml; ensure the template and values are used everywhere catalogd is deployed so multi-node clusters can scale replicas appropriately.

This is updated via cluster-olm-operator; so this value is appropriate here.

@tmshort, understood — the replicas: 1 in the Helm chart is intentionally a conservative default that gets overridden by cluster-olm-operator based on the actual cluster topology (e.g. HA vs. SNO). I'll withdraw my concern. Thanks for the clarification!

📝 Learnings were identified and not saved (knowledge base disabled). Enable

coderabbitai · 2026-05-04T19:51:33Z

    enabled: true
    deployment:
      image: ${OPERATOR_CONTROLLER_IMAGE}
+      replicas: 1


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make replica count topology-aware instead of hard-coding 1.

Line 11 pins replicas to 1, so the HA path (multi-node clusters running 2 replicas with disruption protection) cannot be realized from this config. This should be topology-aware (for example, SingleReplica=1 and HighlyAvailable=2) and paired with PDB enablement in the same flow.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openshift/helm/operator-controller.yaml` at line 11, Replace the hard-coded replicas: 1 with a topology-aware Helm/template-driven value and pair it with PDB enablement; specifically, change the replicas field to use a templated value (e.g., replicas: {{ .Values.topology.replicas }} or a helper like {{ include "operator.replicas" . }}), implement a helper/template (operator.replicas or values.topology.mode) that returns 1 for SingleReplica and 2 (or configurable HA count) for HighlyAvailable, and ensure the same topology flag drives creation/enabling of a PodDisruptionBudget resource so disruption protection is applied in HA mode.

This is updated via cluster-olm-operator; so this value is appropriate here.

@tmshort, understood — thanks for the clarification! Since cluster-olm-operator overrides the replica count based on topology at a higher level, replicas: 1 here is the correct baseline default. I'll note this for future reviews.

📝 Learnings were identified and not saved (knowledge base disabled). Enable

tmshort · 2026-05-04T19:52:41Z

/unhold

tmshort · 2026-05-05T00:37:00Z

/test e2e-aws-upgrade-ovn-single-node

openshift-ci · 2026-05-05T05:03:50Z

@tmshort: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2026

openshift-ci Bot requested review from pedjak and perdasilva April 24, 2026 21:07

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2026

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread internal/catalogd/serverutil/serverutil.go Outdated

Comment thread internal/catalogd/serverutil/serverutil.go

tmshort force-pushed the ocpbugs-62517-ha-replicas branch from cf8b48a to dfec6bd Compare April 27, 2026 18:47

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

tmshort force-pushed the ocpbugs-62517-ha-replicas branch from dfec6bd to c55a3f0 Compare May 1, 2026 18:56

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

tmshort force-pushed the ocpbugs-62517-ha-replicas branch from c55a3f0 to 238bbc8 Compare May 1, 2026 19:29

tmshort changed the title ~~WIP: OCPBUGS-62517 ha replicas~~ OCPBUGS-62517: Add ha replicas May 1, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 1, 2026

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 1, 2026

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 1, 2026

tmshort changed the title ~~OCPBUGS-62517: Add ha replicas~~ OCPBUGS-62517: Set replicas=2 and enable PodDisruptionBudget May 1, 2026

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2026

tmshort force-pushed the ocpbugs-62517-ha-replicas branch from 238bbc8 to 4c63a27 Compare May 4, 2026 19:48

tmshort changed the title ~~OCPBUGS-62517: Set replicas=2 and enable PodDisruptionBudget~~ OCPBUGS-62517: Set replicas=1, PDB, and pod anti-affinity for HA topology May 4, 2026

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2026

-	sc := scenarioCtx(ctx)
-	leaderPod := sc.leaderPods["catalogd"]
-	if leaderPod == "" {
-		return fmt.Errorf("catalogd leader pod not found in scenario context; run 'catalogd is ready to reconcile resources' first")
-	}
+	sc := scenarioCtx(ctx)
+	holder, err := k8sClient("get", "lease", leaseNames["catalogd"], "-n", olmNamespace,
+		"-o", "jsonpath={.spec.holderIdentity}")
+	if err != nil || strings.TrimSpace(holder) == "" {
+		return fmt.Errorf("failed to resolve current catalogd leader from lease: %w", err)
+	}
+	leaderPod := strings.Split(strings.TrimSpace(holder), "_")[0]
+	sc.leaderPods["catalogd"] = leaderPod
+	logger.Info("Force-deleting catalogd leader pod", "pod", leaderPod)
+	if _, err := k8sClient("delete", "pod", leaderPod, "-n", olmNamespace,
+		"--force", "--grace-period=0"); err != nil {

Conversation

tmshort commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Pre-merge checks failed

❌ Failed checks (1 error, 3 warnings)

Uh oh!

openshift-ci Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tmshort commented Apr 27, 2026

Uh oh!

tmshort commented Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

tmshort commented Apr 28, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented May 1, 2026

Summary by CodeRabbit

Uh oh!

tmshort commented May 1, 2026

Uh oh!

tmshort commented May 1, 2026

Uh oh!

openshift-ci Bot commented May 1, 2026

Uh oh!

openshift-ci Bot commented May 1, 2026

Uh oh!

perdasilva commented May 4, 2026

Uh oh!

tmshort commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmshort commented May 4, 2026

Uh oh!

tmshort commented May 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmshort May 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 4, 2026

Choose a reason for hiding this comment

tmshort commented Apr 24, 2026 •

edited

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

tmshort commented May 4, 2026 •

edited

Loading

coderabbitai Bot May 4, 2026 •

edited

Loading

coderabbitai Bot May 4, 2026 •

edited

Loading