Skip to content

feat: pluggable per-project Slack notifications with KPI comparison#103

Merged
kpouget merged 12 commits into
openshift-psap:mainfrom
ashtarkb:Slack-notificaions
Jul 3, 2026
Merged

feat: pluggable per-project Slack notifications with KPI comparison#103
kpouget merged 12 commits into
openshift-psap:mainfrom
ashtarkb:Slack-notificaions

Conversation

@ashtarkb

@ashtarkb ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Add a pluggable notification provider system allowing each FORGE project to define its own Slack notification behavior. Implement MCPGatewaySlackProvider with structured messages including KPI comparison against previous MLflow runs.

  • Add SlackNotificationProvider ABC and NotificationContext dataclass
  • Auto-discover provider via config.yaml (notifications.slack.provider_module)
  • Parameterize Slack API helpers to accept arbitrary channel_id
  • MCP Gateway provider: header, metadata, KPI table, artifact links
  • Reuse get_ocpci_link from core notification system for link generation

Summary by CodeRabbit

  • New Features

    • Added Slack notifications for MCP Gateway runs.
    • Notifications now include run duration, key KPI summaries, MLflow links, and structured failure details.
    • Messages are posted in threads for easier follow-up and are routed to the configured Slack channel.
  • Bug Fixes

    • Reduced duplicate notifications by suppressing repeats for matching runs.
    • Improved notification delivery handling when Slack setup is unavailable or incomplete.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 1, 2026
@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kpouget for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@ashtarkb, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 15 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2b309782-08c9-48c7-af8c-2a7c580082b5

📥 Commits

Reviewing files that changed from the base of the PR and between 8ab7b36 and 50c2fee.

📒 Files selected for processing (2)
  • projects/core/notifications/provider.py
  • projects/mcp_gateway/orchestration/notifications.py
📝 Walkthrough

Walkthrough

Introduces a pluggable per-project Slack notification provider framework: a NotificationContext dataclass and abstract SlackNotificationProvider base class, dynamic provider resolution in CI app, wiring into export notification dispatch, extended Slack API functions with channel_id support, shared notification helper utilities, and a concrete MCP Gateway Slack provider with KPI comparison via MLflow.

Changes

Slack Notification Provider Framework

Layer / File(s) Summary
NotificationContext and provider base class
projects/core/notifications/provider.py
Defines NotificationContext dataclass and abstract SlackNotificationProvider with token resolution via vault, thread anchor logic, and notify() orchestration.
Slack API channel_id parameterization
projects/core/notifications/slack/api.py
search_channel_message and send_message accept an optional channel_id, defaulting to module CHANNEL_ID.
Shared notification helpers
projects/core/notifications/helpers.py
New module resolving artifact roots, reading test duration/labels, discovering KPI files, extracting MLflow URLs, and formatting KPI tables/lists.
CI app and export wiring
projects/agentic_tools/ci_base.py, projects/core/library/export.py
CIApp._resolve_notification_provider dynamically imports a configured provider and stores it on ctx.obj; send_notification accepts and uses notification_provider to build a NotificationContext and call notify().
MCP Gateway Slack provider
projects/mcp_gateway/orchestration/config.yaml, projects/mcp_gateway/orchestration/notifications.py
Adds MCPGatewaySlackProvider, formatting message sections, loading current/previous KPIs via kpis.jsonl and MLflow, run-name parsing, and duplicate-notification suppression.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant ExportPipeline
  participant MCPGatewaySlackProvider
  participant MLflow
  participant SlackAPI

  ExportPipeline->>MCPGatewaySlackProvider: notify(context)
  MCPGatewaySlackProvider->>MCPGatewaySlackProvider: format_message(context)
  MCPGatewaySlackProvider->>MCPGatewaySlackProvider: _load_current_kpis(context)
  MCPGatewaySlackProvider->>MLflow: _load_previous_kpis_from_mlflow()
  MLflow-->>MCPGatewaySlackProvider: previous run metrics
  MCPGatewaySlackProvider->>MCPGatewaySlackProvider: build_comparison_table / _skip_notification
  MCPGatewaySlackProvider->>SlackAPI: search_channel_message(anchor)
  SlackAPI-->>MCPGatewaySlackProvider: existing message or none
  MCPGatewaySlackProvider->>SlackAPI: send_message(formatted message)
  SlackAPI-->>MCPGatewaySlackProvider: success/failure
Loading

Possibly related PRs

  • openshift-psap/forge#2: Extends the same Slack API functions (`search_channel_message`/`send_message`) integrated here into the new provider-based flow.
  • openshift-psap/forge#97: Also modifies `send_notification(...)` in `projects/core/library/export.py`, directly overlapping with the notification-provider wiring in this PR.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: pluggable per-project Slack notifications with KPI comparison.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 0.7.0

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🟢 Execution of mcp_gateway smoke 🟢

Execution Engine Configuration

forge:
  args:
  - smoke
  configOverrides:
    infrastructure.mcp_gateway_version: 0.7.0
  project: mcp_gateway

Artifact Links

Test Logs

00 Pre-Cleanup 1 second

01 Prepare 2 minutes, 25 seconds

02 Preflight 51 seconds

03 Test 1 minute, 40 seconds

04 Post-Cleanup 1 minute, 45 seconds

🔄 05 Export-Artifacts

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline prepare-only
/version 0.7.0

1 similar comment
@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline prepare-only
/version 0.7.0

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown
🔴 Submission of mcp_gateway smoke failed after 5 seconds 🔴
/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline prepare-only
/test fournos mcp_gateway smoke
/version 0.7.0

Comment thread projects/core/notifications/provider.py Outdated
DEFAULT_SLACK_TOKEN_FILE = "topsail-bot.slack-token"
DEFAULT_SECRET_ENV_KEYS = (
"PSAP_FORGE_NOTIFICATIONS_SECRET_PATH",
"PSAP_FORGE_JUMP_CI_SECRET_PATH",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need that (the jump-ci is deprecated, it was before Fournos got ready)

Comment thread projects/mcp_gateway/orchestration/notifications.py Outdated
Comment thread projects/core/notifications/provider.py Outdated
@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 0.7.0

1 similar comment
@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 0.7.0

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🟢 Execution of mcp_gateway smoke 🟢

Execution Engine Configuration

forge:
  args:
  - smoke
  configOverrides:
    infrastructure.mcp_gateway_version: 0.7.0
  project: mcp_gateway

Artifact Links

Test Logs

00 Pre-Cleanup 1 second

01 Prepare 2 minutes, 47 seconds

02 Preflight 53 seconds

03 Test 1 minute, 40 seconds

04 Post-Cleanup 1 minute, 46 seconds

🔄 05 Export-Artifacts

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 0.7.0

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🟢 Execution of mcp_gateway smoke 🟢

Execution Engine Configuration

forge:
  args:
  - smoke
  configOverrides:
    infrastructure.mcp_gateway_version: 0.7.0
  project: mcp_gateway

Artifact Links

Test Logs

00 Pre-Cleanup 1 second

01 Prepare 2 minutes, 30 seconds

02 Preflight 1 minute, 16 seconds

03 Test 1 minute, 46 seconds

04 Post-Cleanup 2 minutes, 9 seconds

🔄 05 Export-Artifacts

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 0.7.0

1 similar comment
@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 0.7.0

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🟢 Execution of mcp_gateway smoke 🟢

Execution Engine Configuration

forge:
  args:
  - smoke
  configOverrides:
    infrastructure.mcp_gateway_version: 0.7.0
  project: mcp_gateway

Artifact Links

Test Logs

00 Pre-Cleanup 1 second

01 Prepare 2 minutes, 37 seconds

02 Preflight 1 minute, 20 seconds

03 Test 1 minute, 39 seconds

04 Post-Cleanup 1 minute, 50 seconds

🔄 05 Export-Artifacts

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 0.7.0

3 similar comments
@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 0.7.0

@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 0.7.0

@ashtarkb

ashtarkb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway smoke
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 0.7.0

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🟢 Execution of mcp_gateway smoke 🟢

Execution Engine Configuration

forge:
  args:
  - smoke
  configOverrides:
    infrastructure.mcp_gateway_version: 0.7.0
  project: mcp_gateway

Artifact Links

Test Logs

00 Pre-Cleanup 1 second

01 Prepare 2 minutes, 33 seconds

02 Preflight 1 minute, 16 seconds

03 Test 1 minute, 50 seconds

04 Post-Cleanup 1 minute, 39 seconds

🔄 05 Export-Artifacts

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🟢 Execution of mcp_gateway demo 🟢

Execution Engine Configuration

forge:
  args:
  - demo
  configOverrides:
    infrastructure.mcp_gateway_version: 4b62869534684bb98ace76c235be8bde1b88c9ea
  project: mcp_gateway

Artifact Links

Test Logs

00 Pre-Cleanup 1 second

01 Prepare 3 minutes, 4 seconds

02 Preflight 28 seconds

03 Test 14 minutes, 42 seconds

04 Post-Cleanup 4 minutes, 42 seconds

🔄 05 Export-Artifacts

@psap-forge-bot

psap-forge-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

ashtarkb and others added 11 commits July 2, 2026 16:35
Add a pluggable notification provider system allowing each FORGE project
to define its own Slack notification behavior. Implement
MCPGatewaySlackProvider with structured messages including KPI comparison
against previous MLflow runs.

- Add SlackNotificationProvider ABC and NotificationContext dataclass
- Auto-discover provider via config.yaml (notifications.slack.provider_module)
- Parameterize Slack API helpers to accept arbitrary channel_id
- MCP Gateway provider: header, metadata, KPI table, artifact links
- Reuse get_ocpci_link from core notification system for link generation

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Read duration from 000__ci_metadata/test_duration.yaml instead of
  status dict (which is the caliper export result, not test metadata)
- Parse nested test labels format {"version": "1", "labels": {...}}
- Skip CI artifact links when not in OpenShift CI (avoid broken URLs)
- Add MLflow run link from caliper export status when available
- Guard against status being non-dict in link extraction

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The jump-ci secret path is deprecated (pre-Fournos), only the
forge notifications secret path is needed.

Co-authored-by: Cursor <cursoragent@cursor.com>
…esolution

- Create projects/core/notifications/helpers.py with framework-wide
  utilities (get_test_artifacts_root, read_test_duration, get_label_value,
  find_kpis_jsonl, extract_mlflow_url, KPI table formatting)
- Use env.BASE_ARTIFACT_DIR to correctly resolve test artifacts root
  instead of the export phase's ARTIFACT_DIR
- Parse nested __test_labels__.yaml format (version/labels structure)
- Skip CI links when not in OpenShift CI, show MLflow link from status

Co-authored-by: Cursor <cursoragent@cursor.com>
In Fournos, each step runs as a separate container with its own
ARTIFACT_DIR and BASE_ARTIFACT_DIR. The correct root containing all
test artifacts is caliper.export.from, which the export entrypoint
sets to the shared /workspace/artifacts volume.

Co-authored-by: Cursor <cursoragent@cursor.com>
MLflow stores metrics from metrics.json (e.g. requests_per_second,
p95_ms, failure_rate) while kpis.jsonl uses prefixed IDs
(mcp_gw_requests_per_second, mcp_gw_p95_ms, etc). Add mapping
to translate between the two when querying previous run data.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Replace manual env-var/path secret lookup in provider.py with
  vault.get_vault_content_path for the Slack token.
- Replace broken OCP CI links with MLflow run URL in notifications.
- Add run name parsing and smart comparison matching: same load config,
  same version type (SHA vs release), and same preset.
- Skip duplicate notifications when version/SHA is unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ashtarkb ashtarkb force-pushed the Slack-notificaions branch from 9dad6a8 to 8ab7b36 Compare July 2, 2026 13:45
@ashtarkb

ashtarkb commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway demo
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 4b62869534684bb98ace76c235be8bde1b88c9ea

@psap-forge-bot

psap-forge-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

🟢 Execution of mcp_gateway demo 🟢

Execution Engine Configuration

forge:
  args:
  - demo
  configOverrides:
    infrastructure.mcp_gateway_version: 4b62869534684bb98ace76c235be8bde1b88c9ea
  project: mcp_gateway

Artifact Links

Test Logs

00 Pre-Cleanup 1 second

01 Prepare 2 minutes, 13 seconds

02 Preflight 29 seconds

03 Test 14 minutes, 39 seconds

04 Post-Cleanup 5 minutes, 2 seconds

🔄 05 Export-Artifacts

@psap-forge-bot

psap-forge-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

@ashtarkb ashtarkb marked this pull request as ready for review July 2, 2026 15:02
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 2, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
projects/core/notifications/helpers.py (2)

141-147: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Unguarded chained .get() can raise if a nested key is explicitly None.

If context.status["caliper_artifacts_export"] or its "backends" key is present but explicitly None (rather than omitted), backends.get("mlflow", {}) / .get("backends", {}) will raise AttributeError since .get() is called on None. This propagates out of format_message() uncaught within this function.

🛡️ Proposed fix
-    backends = context.status.get("caliper_artifacts_export", {}).get("backends", {})
-    mlflow_info = backends.get("mlflow", {})
+    backends = (context.status.get("caliper_artifacts_export") or {}).get("backends") or {}
+    mlflow_info = backends.get("mlflow") or {}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/notifications/helpers.py` around lines 141 - 147, The chained
lookups in extract_mlflow_url can fail when caliper_artifacts_export or backends
is explicitly None. Update extract_mlflow_url to guard each nested value from
context.status before calling .get(), defaulting non-dict values to {} so
mlflow_info lookup stays safe. Keep the same return behavior for run_url and
experiment_url, but make the nested access in extract_mlflow_url resilient to
None.

76-81: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Non-deterministic "first match" selection from glob results.

read_test_duration (candidates[0]), get_label_value (labels_glob[0]), and find_kpis_jsonl (for f in artifact_dir.glob(...): return f) all pick an arbitrary match when multiple candidate files exist under the artifact tree. Python's pathlib.Path.glob() returns paths in unspecified filesystem order, so with multi-step Fournos pipelines (prepare/test/export each with their own subdirs, as noted in get_test_artifacts_root's docstring), this can non-deterministically surface duration/label/KPI data from the wrong step or a stale run.

♻️ Suggested fix (deterministic ordering)
-        candidates = list(root.glob("**/000__ci_metadata/test_duration.yaml"))
+        candidates = sorted(root.glob("**/000__ci_metadata/test_duration.yaml"))
         if not candidates:
             return ""
         timing_file = candidates[0]

Apply the analogous sorted(...) fix to get_label_value and find_kpis_jsonl.

Also applies to: 104-109, 125-133

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/notifications/helpers.py` around lines 76 - 81, The helper
functions are choosing an arbitrary file when multiple matches exist, which
makes `read_test_duration`, `get_label_value`, and `find_kpis_jsonl`
non-deterministic. Update the glob-based selection in these helpers to use a
deterministic order by sorting the matches before picking the first one, and
keep the same selection pattern consistent across the `read_test_duration`,
`get_label_value`, and `find_kpis_jsonl` logic.
projects/core/notifications/provider.py (1)

114-125: 🩺 Stability & Availability | 🔵 Trivial | ⚖️ Poor tradeoff

TOCTOU risk on thread-anchor creation.

search_channel_message then send_message (create anchor) is not atomic; if two notifications for the same anchor race (e.g., concurrent CI jobs finishing near-simultaneously, combined with Slack search indexing lag), both can fail to find an existing anchor and each posts a duplicate channel header message. This same pattern is duplicated in the MCP Gateway provider override. Not blocking, but worth being aware of as usage scales.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/notifications/provider.py` around lines 114 - 125, The
thread-anchor creation flow in provider.py is subject to a TOCTOU race because
search_channel_message and then send_message are not atomic, so concurrent
notifications can create duplicate anchor messages. Update the anchor-creation
logic around slack_api.search_channel_message and slack_api.send_message to use
a safer single-creator pattern (or tolerate duplicates by rechecking/handling
conflicts after send), and apply the same fix in the MCP Gateway provider
override where this pattern is duplicated.
projects/agentic_tools/ci_base.py (1)

186-197: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider resolving the notification provider lazily, only when needed.

_resolve_notification_provider() runs in the group's main() callback for every subcommand invocation (test, prepare, export, etc.), not just when a notification will actually be sent. If provider_module is configured, this triggers a dynamic import/instantiation — and any resolution failure logs a warning — on every CI phase, even ones unrelated to notifications.

Moving this call into the export/notification path (e.g., inside caliper_export_entrypoint or send_notification) would scope the work and any failure warnings to where they're relevant.

♻️ Suggested approach
-            app.init_vaults(ctx.invoked_subcommand)
-
-            ctx.obj.notification_provider = app._resolve_notification_provider()
+            app.init_vaults(ctx.invoked_subcommand)
+
+            ctx.obj.resolve_notification_provider = app._resolve_notification_provider

Then call ctx.obj.resolve_notification_provider() from within caliper_export_entrypoint right before sending notifications.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/agentic_tools/ci_base.py` around lines 186 - 197, The notification
provider is being resolved eagerly in the group-level main() callback, which
causes unnecessary imports and warning logs for every CI subcommand. Move the
_resolve_notification_provider() call out of main() and resolve it lazily only
in the notification/export flow, such as within caliper_export_entrypoint or
send_notification, using ctx.obj.resolve_notification_provider() right before a
notification is actually sent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/core/notifications/provider.py`:
- Around line 91-134: The base notify() flow in Provider should honor the
post-format skip flag instead of forcing subclasses to duplicate the whole
dispatch path. After format_message(context) in notify(), check context.extra
for _skip_notification and return True early when it is set, so
MCPGatewaySlackProvider.notify() no longer needs its custom override. Keep the
existing behavior around should_notify, token/client setup, dry_run, and
slack_api.send_message intact while adding the skip hook in Provider.notify().

In `@projects/mcp_gateway/orchestration/notifications.py`:
- Line 201: The KPI comparison in notifications is not bound to the current
MLflow run, so `_load_previous_kpis_from_mlflow()` can pick the newest unrelated
run when jobs overlap. Update the call site in `notifications.py` to pass the
current run identity from `NotificationContext` or artifact metadata, and change
`_load_previous_kpis_from_mlflow` to locate that exact run first before looking
for older matching runs. Keep the duplicate-suppression and comparison logic
tied to the resolved run so it uses the correct version/SHA.
- Around line 120-136: The duplicate-run suppression in notify() is checked too
late because get_slack_token() runs before format_message() sets
context.extra["_skip_notification"]. Move the duplicate-skip path earlier in
notify() so format_message() (and the _skip_notification check) happens before
any Slack credential or channel lookup, and keep should_notify() as the first
gate in this flow.
- Around line 76-86: The version parsing in the notification path returns a dict
even when a malformed vsha-* or versioned name cannot be parsed, which conflicts
with the documented behavior. Update the parsing logic in the function that
builds the {"config", "is_sha", "version"} result so that if the expected regex
match fails for either vsha-... or the non-SHA version format, it returns None
instead of a dict with version=None. Keep the valid-path behavior unchanged for
successfully parsed names.

---

Nitpick comments:
In `@projects/agentic_tools/ci_base.py`:
- Around line 186-197: The notification provider is being resolved eagerly in
the group-level main() callback, which causes unnecessary imports and warning
logs for every CI subcommand. Move the _resolve_notification_provider() call out
of main() and resolve it lazily only in the notification/export flow, such as
within caliper_export_entrypoint or send_notification, using
ctx.obj.resolve_notification_provider() right before a notification is actually
sent.

In `@projects/core/notifications/helpers.py`:
- Around line 141-147: The chained lookups in extract_mlflow_url can fail when
caliper_artifacts_export or backends is explicitly None. Update
extract_mlflow_url to guard each nested value from context.status before calling
.get(), defaulting non-dict values to {} so mlflow_info lookup stays safe. Keep
the same return behavior for run_url and experiment_url, but make the nested
access in extract_mlflow_url resilient to None.
- Around line 76-81: The helper functions are choosing an arbitrary file when
multiple matches exist, which makes `read_test_duration`, `get_label_value`, and
`find_kpis_jsonl` non-deterministic. Update the glob-based selection in these
helpers to use a deterministic order by sorting the matches before picking the
first one, and keep the same selection pattern consistent across the
`read_test_duration`, `get_label_value`, and `find_kpis_jsonl` logic.

In `@projects/core/notifications/provider.py`:
- Around line 114-125: The thread-anchor creation flow in provider.py is subject
to a TOCTOU race because search_channel_message and then send_message are not
atomic, so concurrent notifications can create duplicate anchor messages. Update
the anchor-creation logic around slack_api.search_channel_message and
slack_api.send_message to use a safer single-creator pattern (or tolerate
duplicates by rechecking/handling conflicts after send), and apply the same fix
in the MCP Gateway provider override where this pattern is duplicated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 50cbac82-fc19-4d82-bb22-3260e5474e0a

📥 Commits

Reviewing files that changed from the base of the PR and between 0ab215a and 8ab7b36.

📒 Files selected for processing (7)
  • projects/agentic_tools/ci_base.py
  • projects/core/library/export.py
  • projects/core/notifications/helpers.py
  • projects/core/notifications/provider.py
  • projects/core/notifications/slack/api.py
  • projects/mcp_gateway/orchestration/config.yaml
  • projects/mcp_gateway/orchestration/notifications.py

Comment on lines +91 to +134
def notify(self, context: NotificationContext, *, dry_run: bool = False) -> bool:
"""Execute the full notification flow.

Returns True on success, False on failure.
"""
if not self.should_notify(context):
logger.info("Provider %s: should_notify returned False, skipping", type(self).__name__)
return True

token = self.get_slack_token()
if not token:
logger.warning("Provider %s: no Slack token available", type(self).__name__)
return False

channel_id = self.get_channel_id()
message = self.format_message(context)
anchor = self.get_thread_anchor(context)

client = slack_api.init_client(token)
if not client:
logger.error("Provider %s: failed to init Slack client", type(self).__name__)
return False

channel_msg_ts, _ = slack_api.search_channel_message(client, anchor, channel_id=channel_id)

if not channel_msg_ts:
channel_message = f"🧵 {anchor}"
if dry_run:
logger.info("Would post channel message: %s", channel_message)
else:
channel_msg_ts, ok = slack_api.send_message(
client, message=channel_message, channel_id=channel_id
)
if not ok:
return False

if dry_run:
logger.info("Would post thread message:\n%s", message)
return True

_, ok = slack_api.send_message(
client, message=message, main_ts=channel_msg_ts, channel_id=channel_id
)
return ok

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Base notify() doesn't support post-format skip, forcing full duplication downstream.

MCPGatewaySlackProvider.notify() (see projects/mcp_gateway/orchestration/notifications.py) overrides this entire ~40-line method just to add one check — if context.extra.get("_skip_notification"): return True — after format_message(). Since format_message() is where _skip_notification gets set (via _format_kpi_table), the base class should support this hook natively instead of forcing subclasses to copy the whole dispatch flow.

♻️ Proposed fix
         channel_id = self.get_channel_id()
         message = self.format_message(context)
+        if context.extra.get("_skip_notification"):
+            logger.info("Provider %s: skip_notification flag set, skipping", type(self).__name__)
+            return True
         anchor = self.get_thread_anchor(context)

Downstream MCPGatewaySlackProvider.notify() could then be removed entirely.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def notify(self, context: NotificationContext, *, dry_run: bool = False) -> bool:
"""Execute the full notification flow.
Returns True on success, False on failure.
"""
if not self.should_notify(context):
logger.info("Provider %s: should_notify returned False, skipping", type(self).__name__)
return True
token = self.get_slack_token()
if not token:
logger.warning("Provider %s: no Slack token available", type(self).__name__)
return False
channel_id = self.get_channel_id()
message = self.format_message(context)
anchor = self.get_thread_anchor(context)
client = slack_api.init_client(token)
if not client:
logger.error("Provider %s: failed to init Slack client", type(self).__name__)
return False
channel_msg_ts, _ = slack_api.search_channel_message(client, anchor, channel_id=channel_id)
if not channel_msg_ts:
channel_message = f"🧵 {anchor}"
if dry_run:
logger.info("Would post channel message: %s", channel_message)
else:
channel_msg_ts, ok = slack_api.send_message(
client, message=channel_message, channel_id=channel_id
)
if not ok:
return False
if dry_run:
logger.info("Would post thread message:\n%s", message)
return True
_, ok = slack_api.send_message(
client, message=message, main_ts=channel_msg_ts, channel_id=channel_id
)
return ok
def notify(self, context: NotificationContext, *, dry_run: bool = False) -> bool:
"""Execute the full notification flow.
Returns True on success, False on failure.
"""
if not self.should_notify(context):
logger.info("Provider %s: should_notify returned False, skipping", type(self).__name__)
return True
token = self.get_slack_token()
if not token:
logger.warning("Provider %s: no Slack token available", type(self).__name__)
return False
channel_id = self.get_channel_id()
message = self.format_message(context)
if context.extra.get("_skip_notification"):
logger.info("Provider %s: skip_notification flag set, skipping", type(self).__name__)
return True
anchor = self.get_thread_anchor(context)
client = slack_api.init_client(token)
if not client:
logger.error("Provider %s: failed to init Slack client", type(self).__name__)
return False
channel_msg_ts, _ = slack_api.search_channel_message(client, anchor, channel_id=channel_id)
if not channel_msg_ts:
channel_message = f"🧵 {anchor}"
if dry_run:
logger.info("Would post channel message: %s", channel_message)
else:
channel_msg_ts, ok = slack_api.send_message(
client, message=channel_message, channel_id=channel_id
)
if not ok:
return False
if dry_run:
logger.info("Would post thread message:\n%s", message)
return True
_, ok = slack_api.send_message(
client, message=message, main_ts=channel_msg_ts, channel_id=channel_id
)
return ok
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/notifications/provider.py` around lines 91 - 134, The base
notify() flow in Provider should honor the post-format skip flag instead of
forcing subclasses to duplicate the whole dispatch path. After
format_message(context) in notify(), check context.extra for _skip_notification
and return True early when it is set, so MCPGatewaySlackProvider.notify() no
longer needs its custom override. Keep the existing behavior around
should_notify, token/client setup, dry_run, and slack_api.send_message intact
while adding the skip hook in Provider.notify().

Comment on lines +76 to +86
if rest.startswith("vsha-"):
is_sha = True
sha_match = re.match(r"vsha-([a-f0-9]+)-\d{8}-\d{6}$", rest)
if sha_match:
version = sha_match.group(1)
else:
version_match = re.match(r"v?(.+?)-\d{8}-\d{6}$", rest)
if version_match:
version = version_match.group(1)

return {"config": config, "is_sha": is_sha, "version": version}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Return None when the version cannot be parsed.

The docstring says unparseable names return None, but malformed vsha-... names currently return a dict with version=None. That can later make two malformed runs compare equal and suppress a notification.

Proposed fix
     else:
         version_match = re.match(r"v?(.+?)-\d{8}-\d{6}$", rest)
         if version_match:
             version = version_match.group(1)
 
+    if version is None:
+        return None
+
     return {"config": config, "is_sha": is_sha, "version": version}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if rest.startswith("vsha-"):
is_sha = True
sha_match = re.match(r"vsha-([a-f0-9]+)-\d{8}-\d{6}$", rest)
if sha_match:
version = sha_match.group(1)
else:
version_match = re.match(r"v?(.+?)-\d{8}-\d{6}$", rest)
if version_match:
version = version_match.group(1)
return {"config": config, "is_sha": is_sha, "version": version}
if rest.startswith("vsha-"):
is_sha = True
sha_match = re.match(r"vsha-([a-f0-9]+)-\d{8}-\d{6}$", rest)
if sha_match:
version = sha_match.group(1)
else:
version_match = re.match(r"v?(.+?)-\d{8}-\d{6}$", rest)
if version_match:
version = version_match.group(1)
if version is None:
return None
return {"config": config, "is_sha": is_sha, "version": version}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/mcp_gateway/orchestration/notifications.py` around lines 76 - 86,
The version parsing in the notification path returns a dict even when a
malformed vsha-* or versioned name cannot be parsed, which conflicts with the
documented behavior. Update the parsing logic in the function that builds the
{"config", "is_sha", "version"} result so that if the expected regex match fails
for either vsha-... or the non-SHA version format, it returns None instead of a
dict with version=None. Keep the valid-path behavior unchanged for successfully
parsed names.

Comment on lines +120 to +136
def notify(self, context: NotificationContext, *, dry_run: bool = False) -> bool:
"""Override to skip notification for duplicate version/SHA runs."""
if not self.should_notify(context):
logger.info("Provider %s: should_notify returned False, skipping", type(self).__name__)
return True

token = self.get_slack_token()
if not token:
logger.warning("Provider %s: no Slack token available", type(self).__name__)
return False

channel_id = self.get_channel_id()
message = self.format_message(context)

if context.extra.get("_skip_notification"):
logger.info("Skipping notification: duplicate version/SHA already notified")
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Evaluate duplicate suppression before requiring Slack credentials.

format_message() is what sets context.extra["_skip_notification"], but the current order returns False on a missing Slack token before that skip can be detected. Duplicate runs should be skipped without needing Slack token/channel resolution.

Proposed fix
         if not self.should_notify(context):
             logger.info("Provider %s: should_notify returned False, skipping", type(self).__name__)
             return True
 
+        message = self.format_message(context)
+
+        if context.extra.get("_skip_notification"):
+            logger.info("Skipping notification: duplicate version/SHA already notified")
+            return True
+
         token = self.get_slack_token()
         if not token:
             logger.warning("Provider %s: no Slack token available", type(self).__name__)
             return False
 
         channel_id = self.get_channel_id()
-        message = self.format_message(context)
-
-        if context.extra.get("_skip_notification"):
-            logger.info("Skipping notification: duplicate version/SHA already notified")
-            return True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def notify(self, context: NotificationContext, *, dry_run: bool = False) -> bool:
"""Override to skip notification for duplicate version/SHA runs."""
if not self.should_notify(context):
logger.info("Provider %s: should_notify returned False, skipping", type(self).__name__)
return True
token = self.get_slack_token()
if not token:
logger.warning("Provider %s: no Slack token available", type(self).__name__)
return False
channel_id = self.get_channel_id()
message = self.format_message(context)
if context.extra.get("_skip_notification"):
logger.info("Skipping notification: duplicate version/SHA already notified")
return True
def notify(self, context: NotificationContext, *, dry_run: bool = False) -> bool:
"""Override to skip notification for duplicate version/SHA runs."""
if not self.should_notify(context):
logger.info("Provider %s: should_notify returned False, skipping", type(self).__name__)
return True
message = self.format_message(context)
if context.extra.get("_skip_notification"):
logger.info("Skipping notification: duplicate version/SHA already notified")
return True
token = self.get_slack_token()
if not token:
logger.warning("Provider %s: no Slack token available", type(self).__name__)
return False
channel_id = self.get_channel_id()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/mcp_gateway/orchestration/notifications.py` around lines 120 - 136,
The duplicate-run suppression in notify() is checked too late because
get_slack_token() runs before format_message() sets
context.extra["_skip_notification"]. Move the duplicate-skip path earlier in
notify() so format_message() (and the _skip_notification check) happens before
any Slack credential or channel lookup, and keep should_notify() as the first
gate in this flow.

if not current_kpis:
return ""

previous_kpis, previous_run_name, skip = _load_previous_kpis_from_mlflow()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Bind the comparison to this notification’s MLflow run.

_load_previous_kpis_from_mlflow() has no context, so it treats the newest MLflow run as “current”. Parallel or delayed jobs can make runs[0] belong to another run, causing the KPI comparison and duplicate suppression to target the wrong version/SHA. Pass the current run identity from NotificationContext/artifact metadata and find that run explicitly before scanning older matches.

Also applies to: 276-340

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/mcp_gateway/orchestration/notifications.py` at line 201, The KPI
comparison in notifications is not bound to the current MLflow run, so
`_load_previous_kpis_from_mlflow()` can pick the newest unrelated run when jobs
overlap. Update the call site in `notifications.py` to pass the current run
identity from `NotificationContext` or artifact metadata, and change
`_load_previous_kpis_from_mlflow` to locate that exact run first before looking
for older matching runs. Keep the duplicate-suppression and comparison logic
tied to the resolved run so it uses the correct version/SHA.

- Add _skip_notification check in base SlackNotificationProvider.notify()
  so subclasses don't need to duplicate the entire dispatch flow.
- Remove notify() override from MCPGatewaySlackProvider.
- Accept current_run_id in _load_previous_kpis_from_mlflow() and locate
  the current run by ID to avoid races with parallel/delayed jobs.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ashtarkb

ashtarkb commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos mcp_gateway demo
/cluster agentic-cpt-8xa100
/pipeline forge-full
/version 4b62869534684bb98ace76c235be8bde1b88c9ea

@psap-forge-bot

psap-forge-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

🟢 Execution of mcp_gateway demo 🟢

Execution Engine Configuration

forge:
  args:
  - demo
  configOverrides:
    infrastructure.mcp_gateway_version: 4b62869534684bb98ace76c235be8bde1b88c9ea
  project: mcp_gateway

Artifact Links

Test Logs

00 Pre-Cleanup 10 seconds

01 Prepare 2 minutes, 26 seconds

02 Preflight 35 seconds

03 Test 14 minutes, 42 seconds

04 Post-Cleanup 4 minutes, 33 seconds

🔄 05 Export-Artifacts

@psap-forge-bot

psap-forge-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

Comment on lines +150 to +153
# ---------------------------------------------------------------------------
# KPI comparison table formatting
# ---------------------------------------------------------------------------

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in another PR, this will have to move to the caliper project (as helper functions just like that)

Comment on lines +120 to +133
# ---------------------------------------------------------------------------
# KPI file discovery
# ---------------------------------------------------------------------------


def find_kpis_jsonl(artifact_dir: Path) -> Path | None:
"""Find kpis.jsonl in artifact directory tree."""
direct = artifact_dir / "kpis.jsonl"
if direct.exists():
return direct

for f in artifact_dir.glob("**/kpis.jsonl"):
return f
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, will have to move to caliper

Comment on lines +34 to +39
TARGET_KPIS = [
("mcp_gw_requests_per_second", "RPS", "req/s"),
("mcp_gw_p95_ms", "P95 latency", "ms"),
("mcp_gw_p99_ms", "P99 latency", "ms"),
("mcp_gw_failure_rate", "Failure rate", "%"),
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this file will break when I merge the PR refining the KPI structure

@kpouget

kpouget commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

thanks Avi, let's merge this. There will be some work to do:

  • when I merge the KPI refine PR
  • when we'll want to integrate the notifications for other projects
    but we'll do that in due time :)

@kpouget

kpouget commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@kpouget kpouget merged commit 138b14f into openshift-psap:main Jul 3, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants