feat(alerts): make nova alerts region- and value-aware#902
Conversation
Inline alert rules into each bundle's templates/alerts.yaml so they can be gated on Helm values. Nova: severity of CortexNovaSchedulingDown depends on kvm.enabled, CortexNovaDoesntFindValidKVMHosts only renders when KVM is enabled, memory and reconcile-duration thresholds are configurable via .Values.alerts.thresholds. Other bundles: structural relocation only with Style-B escaping of Prometheus directives. Ironcore: empty rules removed.
📝 WalkthroughWalkthroughThis PR consolidates Prometheus alert rule definitions across four Cortex bundles by moving them from dedicated ChangesAlert Rules Consolidation Across Bundles
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
To get a use full diff run something like: git --no-pager diff --no-index -w <(git show HEAD~1:helm/bundles/cortex-nova/alerts/nova.alerts.yaml) helm/bundles/cortex-nova/templates/alerts.yaml
diff --git a/proc/self/fd/16 b/helm/bundles/cortex-nova/templates/alerts.yaml
index 00000000..6f3fabef 100644
--- a/proc/self/fd/16
+++ b/helm/bundles/cortex-nova/templates/alerts.yaml
@@ -1,3 +1,19 @@
+# Copyright SAP SE
+# SPDX-License-Identifier: Apache-2.0
+
+# NOTE: This file is rendered by Helm. Prometheus templating directives
+# (e.g. {{ "{{" }} $labels.foo {{ "}}" }}) must be escaped using Style B:
+# replace the outer `{{` and `}}` with `{{ "{{" }}` and `{{ "}}" }}`.
+
+{{- if .Values.alerts.enabled }}
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+ name: cortex-nova-alerts
+ labels:
+ type: alerting-rules
+ prometheus: {{ required ".Values.alerts.prometheus missing" .Values.alerts.prometheus | quote }}
+spec:
groups:
- name: cortex-nova-alerts
rules:
@@ -10,7 +26,7 @@ groups:
context: liveness
dashboard: cortex-status-dashboard/cortex-status-dashboard
service: cortex
- severity: critical
+ severity: {{ if .Values.kvm.enabled }}critical{{ else }}warning{{ end }}
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/down
annotations:
@@ -93,7 +109,7 @@ groups:
Thus, no immediate action is needed.
- alert: CortexNovaHighMemoryUsage
- expr: process_resident_memory_bytes{service="cortex-nova-metrics"} > 6000 * 1024 * 1024
+ expr: process_resident_memory_bytes{service="cortex-nova-metrics"} > {{ .Values.alerts.thresholds.highMemoryMiB }} * 1024 * 1024
for: 5m
labels:
context: memory
@@ -103,9 +119,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/deployment
annotations:
- summary: "`{{$labels.component}}` uses too much memory"
+ summary: "`{{ "{{" }} $labels.component {{ "}}" }}` uses too much memory"
description: >
- `{{$labels.component}}` should not be using more than 6000 MiB of memory. Usually it
+ `{{ "{{" }} $labels.component {{ "}}" }}` should not be using more than {{ .Values.alerts.thresholds.highMemoryMiB }} MiB of memory. Usually it
should use much less, so there may be a memory leak or other changes
that are causing the memory usage to increase significantly.
@@ -120,9 +136,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/deployment
annotations:
- summary: "`{{$labels.component}}` uses too much CPU"
+ summary: "`{{ "{{" }} $labels.component {{ "}}" }}` uses too much CPU"
description: >
- `{{$labels.component}}` should not be using more than 50% of a single CPU core. Usually
+ `{{ "{{" }} $labels.component {{ "}}" }}` should not be using more than 50% of a single CPU core. Usually
it should use much less, so there may be a CPU leak or other changes
that are causing the CPU usage to increase significantly.
@@ -137,9 +153,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/database
annotations:
- summary: "`{{$labels.component}}` is trying to connect to the database too often"
+ summary: "`{{ "{{" }} $labels.component {{ "}}" }}` is trying to connect to the database too often"
description: >
- `{{$labels.component}}` is trying to connect to the database too often. This may happen
+ `{{ "{{" }} $labels.component {{ "}}" }}` is trying to connect to the database too often. This may happen
when the database is down or the connection parameters are misconfigured.
- alert: CortexNovaSyncNotSuccessful
@@ -153,9 +169,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/datasources
annotations:
- summary: "`{{$labels.component}}` Sync not successful"
+ summary: "`{{ "{{" }} $labels.component {{ "}}" }}` Sync not successful"
description: >
- `{{$labels.component}}` experienced an issue syncing data from the datasource `{{$labels.datasource}}`. This may
+ `{{ "{{" }} $labels.component {{ "}}" }}` experienced an issue syncing data from the datasource `{{ "{{" }} $labels.datasource {{ "}}" }}`. This may
happen when the datasource (OpenStack, Prometheus, etc.) is down or
the sync module is misconfigured. No immediate action is needed, since
the sync module will retry the sync operation and the currently synced
@@ -173,9 +189,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/datasources
annotations:
- summary: "`{{$labels.component}}` is not syncing any new data from `{{$labels.datasource}}`"
+ summary: "`{{ "{{" }} $labels.component {{ "}}" }}` is not syncing any new data from `{{ "{{" }} $labels.datasource {{ "}}" }}`"
description: >
- `{{$labels.component}}` is not syncing any objects from the datasource `{{$labels.datasource}}`. This may happen
+ `{{ "{{" }} $labels.component {{ "}}" }}` is not syncing any objects from the datasource `{{ "{{" }} $labels.datasource {{ "}}" }}`. This may happen
when the datasource (OpenStack, Prometheus, etc.) is down or the sync
module is misconfigured. No immediate action is needed, since the sync
module will retry the sync operation and the currently synced data will
@@ -193,7 +209,7 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/unready
annotations:
- summary: "Datasource `{{$labels.datasource}}` is in `{{$labels.state}}` state"
+ summary: "Datasource `{{ "{{" }} $labels.datasource {{ "}}" }}` is in `{{ "{{" }} $labels.state {{ "}}" }}` state"
description: >
This may indicate issues with the datasource
connectivity or configuration. It is recommended to investigate the
@@ -210,7 +226,7 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/unready
annotations:
- summary: "Knowledge `{{$labels.knowledge}}` is in `{{$labels.state}}` state"
+ summary: "Knowledge `{{ "{{" }} $labels.knowledge {{ "}}" }}` is in `{{ "{{" }} $labels.state {{ "}}" }}` state"
description: >
This may indicate issues with the knowledge
configuration. It is recommended to investigate the
@@ -226,7 +242,7 @@ groups:
severity: warning
support_group: workload-management
annotations:
- summary: "Some decisions are in error state for operator `{{$labels.operator}}`"
+ summary: "Some decisions are in error state for operator `{{ "{{" }} $labels.operator {{ "}}" }}`"
description: >
The cortex scheduling pipeline generated decisions that are in error state.
This may indicate issues with the decision logic or the underlying infrastructure.
@@ -243,7 +259,7 @@ groups:
severity: warning
support_group: workload-management
annotations:
- summary: "Too many decisions are in waiting state for operator `{{$labels.operator}}`"
+ summary: "Too many decisions are in waiting state for operator `{{ "{{" }} $labels.operator {{ "}}" }}`"
description: >
The cortex scheduling pipeline has a high number of decisions for which
no target host has been assigned yet.
@@ -264,7 +280,7 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/unready
annotations:
- summary: "KPI `{{$labels.kpi}}` is in `{{$labels.state}}` state"
+ summary: "KPI `{{ "{{" }} $labels.kpi {{ "}}" }}` is in `{{ "{{" }} $labels.state {{ "}}" }}` state"
description: >
This may indicate issues with the KPI
configuration. It is recommended to investigate the
@@ -281,12 +297,13 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/unready
annotations:
- summary: "Pipeline `{{$labels.pipeline}}` is in `{{$labels.state}}` state"
+ summary: "Pipeline `{{ "{{" }} $labels.pipeline {{ "}}" }}` is in `{{ "{{" }} $labels.state {{ "}}" }}` state"
description: >
This may indicate issues with the pipeline
configuration. It is recommended to investigate the
pipeline status and logs for more details.
+ {{- if .Values.kvm.enabled }}
- alert: CortexNovaDoesntFindValidKVMHosts
expr: sum by (az, hvtype) (increase(cortex_vm_faults{hvtype=~"CH|QEMU",faultmsg=~".*No valid host was found.*",faultmsg!~".*No such host.*"}[5m])) > 0
for: 5m
@@ -300,10 +317,11 @@ groups:
annotations:
summary: "Nova scheduling cannot find valid KVM hosts"
description: >
- Cortex is seeing new faulty vms in `{{$labels.az}}` where Nova scheduling
- failed to find a valid `{{$labels.hvtype}}` host. This may indicate
+ Cortex is seeing new faulty vms in `{{ "{{" }} $labels.az {{ "}}" }}` where Nova scheduling
+ failed to find a valid `{{ "{{" }} $labels.hvtype {{ "}}" }}` host. This may indicate
capacity issues, misconfigured filters, or resource constraints in the
datacenter. Investigate the affected VMs and hypervisor availability.
+ {{- end }}
- alert: CortexNovaNewDatasourcesNotReconciling
expr: count by(datasource) (cortex_datasource_seconds_until_reconcile{queued="false",domain="nova"}) > 0
@@ -316,9 +334,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/datasources
annotations:
- summary: "New datasource `{{$labels.datasource}}` has not reconciled"
+ summary: "New datasource `{{ "{{" }} $labels.datasource {{ "}}" }}` has not reconciled"
description: >
- A new datasource `{{$labels.datasource}}` has been added but has not
+ A new datasource `{{ "{{" }} $labels.datasource {{ "}}" }}` has been added but has not
completed its first reconciliation yet. This may indicate issues with
the datasource controller's workqueue overprioritizing other datasources.
@@ -335,9 +353,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/datasources
annotations:
- summary: "Existing datasource `{{$labels.datasource}}` is lacking behind"
+ summary: "Existing datasource `{{ "{{" }} $labels.datasource {{ "}}" }}` is lacking behind"
description: >
- An existing datasource `{{$labels.datasource}}` has been queued for
+ An existing datasource `{{ "{{" }} $labels.datasource {{ "}}" }}` has been queued for
reconciliation for more than 10 minutes. This may indicate issues with
the datasource controller's workqueue or that this or another datasource
is taking an unusually long time to reconcile.
@@ -365,7 +383,7 @@ groups:
- alert: CortexNovaReconcileDurationHigher10Min
expr: |
(sum by (controller) (rate(controller_runtime_reconcile_time_seconds_sum{service="cortex-nova-metrics"}[5m])))
- / (sum by (controller) (rate(controller_runtime_reconcile_time_seconds_count{service="cortex-nova-metrics"}[5m]))) > 600
+ / (sum by (controller) (rate(controller_runtime_reconcile_time_seconds_count{service="cortex-nova-metrics"}[5m]))) > {{ .Values.alerts.thresholds.reconcileDurationSeconds }}
for: 15m
labels:
context: controller-duration
@@ -375,8 +393,8 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/reconciles
annotations:
- summary: "Controller reconciliation takes longer than ({{ $value | humanizeDuration }})"
- description: "Reconcile duration higher than 10m while reconciling {{ $labels.controller }}"
+ summary: "Controller reconciliation takes longer than ({{ "{{" }} $value | humanizeDuration {{ "}}" }})"
+ description: "Reconcile duration higher than 10m while reconciling {{ "{{" }} $labels.controller {{ "}}" }}"
- alert: CortexNovaWorkqueueNotDrained
expr: |
@@ -390,9 +408,9 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/datasources
annotations:
- summary: "Controller {{ $labels.name }}'s backlog is not being drained."
+ summary: "Controller {{ "{{" }} $labels.name {{ "}}" }}'s backlog is not being drained."
description: >
- The workqueue for controller {{ $labels.name }} has a backlog that is
+ The workqueue for controller {{ "{{" }} $labels.name {{ "}}" }} has a backlog that is
not being drained. This may indicate that the controller is overwhelmed
with work or is stuck on certain resources. Check the controller logs
and the state of the resources it manages for more details.
@@ -408,9 +426,9 @@ groups:
severity: warning
support_group: workload-management
annotations:
- summary: "Controller webhook {{ $labels.webhook }} latency is high"
+ summary: "Controller webhook {{ "{{" }} $labels.webhook {{ "}}" }} latency is high"
description: >
- The latency for webhook {{ $labels.webhook }} is higher than expected (p90 > 200ms).
+ The latency for webhook {{ "{{" }} $labels.webhook {{ "}}" }} is higher than expected (p90 > 200ms).
This may indicate performance issues with the webhook server or the logic it executes.
Check the webhook server logs and monitor its resource usage for more insights.
@@ -426,9 +444,9 @@ groups:
severity: warning
support_group: workload-management
annotations:
- summary: "Controller webhook {{ $labels.webhook }} is experiencing errors"
+ summary: "Controller webhook {{ "{{" }} $labels.webhook {{ "}}" }} is experiencing errors"
description: >
- The webhook {{ $labels.webhook }} has experienced errors in the last 5 minutes.
+ The webhook {{ "{{" }} $labels.webhook {{ "}}" }} has experienced errors in the last 5 minutes.
This may indicate issues with the webhook logic, connectivity problems, or
external factors causing failures. Check the webhook server logs for error
details and investigate the affected resources.
@@ -489,7 +507,7 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/committed-resource-performance
annotations:
- summary: "Committed Resource rejection rate too high ({{ $value | humanizePercentage }})"
+ summary: "Committed Resource rejection rate too high ({{ "{{" }} $value | humanizePercentage {{ "}}" }})"
description: >
More than 30% of commitment changes have been rejected over the last 15 minutes.
This may indicate insufficient capacity to fulfill new commitments. Rejected
@@ -563,10 +581,10 @@ groups:
support_group: workload-management
playbook: docs/support/playbook/cortex/alerts/committed-resource-capacity
annotations:
- summary: "Committed Resource capacity for {{ $labels.resource }} in {{ $labels.az }} dropped to zero"
+ summary: "Committed Resource capacity for {{ "{{" }} $labels.resource {{ "}}" }} in {{ "{{" }} $labels.az {{ "}}" }} dropped to zero"
description: >
- The reported capacity for committed resource {{ $labels.resource }} in
- availability zone {{ $labels.az }} has dropped from a positive value to zero.
+ The reported capacity for committed resource {{ "{{" }} $labels.resource {{ "}}" }} in
+ availability zone {{ "{{" }} $labels.az {{ "}}" }} has dropped from a positive value to zero.
This may mean hypervisors in that AZ are fully utilized for the corresponding
flavor group and no further committed resources can be placed there.
@@ -607,3 +625,4 @@ groups:
The committed resource quota API (Limes LIQUID integration) is returning
HTTP 5xx errors. This indicates internal problems computing or applying
quota. Limes may not be able to enforce committed resource quotas.
+{{- end }} |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@helm/bundles/cortex-manila/templates/alerts.yaml`:
- Around line 236-251: The alert definition for CortexManilaPipelineUnready has
the wrong context label; update the labels block in the
CortexManilaPipelineUnready alert (alert name: CortexManilaPipelineUnready,
expr: cortex_pipeline_state{domain="manila",state!="ready"}) to change context:
kpis to context: pipelines so the alert is correctly categorized under pipeline
alerts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bd871581-32a8-44fc-af7e-2ee6f7b1a03b
📒 Files selected for processing (12)
docs/reservations/committed-resource-reservations.mdhelm/bundles/cortex-cinder/alerts/cinder.alerts.yamlhelm/bundles/cortex-cinder/templates/alerts.yamlhelm/bundles/cortex-ironcore/alerts/ironcore.alerts.yamlhelm/bundles/cortex-ironcore/templates/alerts.yamlhelm/bundles/cortex-manila/alerts/manila.alerts.yamlhelm/bundles/cortex-manila/templates/alerts.yamlhelm/bundles/cortex-nova/alerts/nova.alerts.yamlhelm/bundles/cortex-nova/templates/alerts.yamlhelm/bundles/cortex-nova/values.yamlhelm/bundles/cortex-placement-shim/alerts/placement-shim.alerts.yamlhelm/bundles/cortex-placement-shim/templates/alerts.yaml
💤 Files with no reviewable changes (6)
- helm/bundles/cortex-placement-shim/alerts/placement-shim.alerts.yaml
- helm/bundles/cortex-ironcore/alerts/ironcore.alerts.yaml
- helm/bundles/cortex-manila/alerts/manila.alerts.yaml
- helm/bundles/cortex-cinder/alerts/cinder.alerts.yaml
- helm/bundles/cortex-nova/alerts/nova.alerts.yaml
- helm/bundles/cortex-ironcore/templates/alerts.yaml
| - alert: CortexManilaPipelineUnready | ||
| expr: cortex_pipeline_state{domain="manila",state!="ready"} != 0 | ||
| for: 5m | ||
| labels: | ||
| context: kpis | ||
| dashboard: cortex-status-dashboard/cortex-status-dashboard | ||
| service: cortex | ||
| severity: warning | ||
| support_group: workload-management | ||
| playbook: docs/support/playbook/cortex/alerts/unready | ||
| annotations: | ||
| summary: "Pipeline `{{ "{{" }} $labels.pipeline {{ "}}" }}` is in `{{ "{{" }} $labels.state {{ "}}" }}` state" | ||
| description: > | ||
| This may indicate issues with the pipeline | ||
| configuration. It is recommended to investigate the | ||
| pipeline status and logs for more details. |
There was a problem hiding this comment.
Fix incorrect context label.
The CortexManilaPipelineUnready alert uses context: kpis on line 240, but this should be context: pipelines to correctly reflect that it monitors pipeline state, not KPI state.
🔧 Proposed fix
- alert: CortexManilaPipelineUnready
expr: cortex_pipeline_state{domain="manila",state!="ready"} != 0
for: 5m
labels:
- context: kpis
+ context: pipelines
dashboard: cortex-status-dashboard/cortex-status-dashboard
service: cortex
severity: warning📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - alert: CortexManilaPipelineUnready | |
| expr: cortex_pipeline_state{domain="manila",state!="ready"} != 0 | |
| for: 5m | |
| labels: | |
| context: kpis | |
| dashboard: cortex-status-dashboard/cortex-status-dashboard | |
| service: cortex | |
| severity: warning | |
| support_group: workload-management | |
| playbook: docs/support/playbook/cortex/alerts/unready | |
| annotations: | |
| summary: "Pipeline `{{ "{{" }} $labels.pipeline {{ "}}" }}` is in `{{ "{{" }} $labels.state {{ "}}" }}` state" | |
| description: > | |
| This may indicate issues with the pipeline | |
| configuration. It is recommended to investigate the | |
| pipeline status and logs for more details. | |
| - alert: CortexManilaPipelineUnready | |
| expr: cortex_pipeline_state{domain="manila",state!="ready"} != 0 | |
| for: 5m | |
| labels: | |
| context: pipelines | |
| dashboard: cortex-status-dashboard/cortex-status-dashboard | |
| service: cortex | |
| severity: warning | |
| support_group: workload-management | |
| playbook: docs/support/playbook/cortex/alerts/unready | |
| annotations: | |
| summary: "Pipeline `{{ "{{" }} $labels.pipeline {{ "}}" }}` is in `{{ "{{" }} $labels.state {{ "}}" }}` state" | |
| description: > | |
| This may indicate issues with the pipeline | |
| configuration. It is recommended to investigate the | |
| pipeline status and logs for more details. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@helm/bundles/cortex-manila/templates/alerts.yaml` around lines 236 - 251, The
alert definition for CortexManilaPipelineUnready has the wrong context label;
update the labels block in the CortexManilaPipelineUnready alert (alert name:
CortexManilaPipelineUnready, expr:
cortex_pipeline_state{domain="manila",state!="ready"}) to change context: kpis
to context: pipelines so the alert is correctly categorized under pipeline
alerts.
Test Coverage ReportTest Coverage 📊: 69.6% |
Inline alert rules into each bundle's templates/alerts.yaml so they can be gated on Helm values. Nova: severity of CortexNovaSchedulingDown depends on kvm.enabled, CortexNovaDoesntFindValidKVMHosts only renders when KVM is enabled, memory and reconcile-duration thresholds are configurable via .Values.alerts.thresholds. Other bundles: structural relocation only with Style-B escaping of Prometheus directives. Ironcore: empty rules removed.