fix(db): unify state history tables by osu · Pull Request #2957 · NVIDIA/infra-controller

osu · 2026-06-28T18:40:04Z

Description

Normalize all eight state-history tables around a common five-column layout with an object_id TEXT resource key.

This change:

makes timestamps mandatory and backfills existing NULL values;
removes cascading resource foreign keys so history survives deletion;
updates shared accessors, direct joins, retention triggers, indexes, and machine cleanup;
retains power-shelf, rack, switch, DPA-interface, and machine-cleanup history;
adds schema, 250-record retention, deletion-retention, and DPA history-query coverage.

Related issues

Closes #1136

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Completed validation:

cargo test --locked -p carbide-api-db -- --test-threads=1 — 217 passed; doc tests 3 passed, 1 ignored.
cargo test --locked -p carbide-api-db dpa_interface::test::deleting_ -- --nocapture --test-threads=1 — 2 passed.
Linux: cargo test --locked -p carbide-api-core test_force_delete_ -- --nocapture --test-threads=1 — 9 passed.
Linux: machine_history filter — 2 passed.
Linux: exact VPC-prefix renamed-column test — 1 passed.
Linux: exact network-segment retention test — 1 passed.
Linux: cargo clippy --locked -p carbide-api-db -p carbide-api-core --all-targets --all-features — passed.
Pinned nightly cargo fmt --all -- --check — passed.
Populated-upgrade rehearsal from origin/main: seeded all eight legacy tables, including NULL power-shelf/switch timestamps, then applied this migration. All eight rows were preserved; resource IDs became TEXT; timestamps were non-null; every table had the common five-column layout and zero foreign keys.

Additional Notes

The migration renames internal history columns in place. As with existing direct-rename migrations in this repository, old application binaries are not compatible with the post-migration column names during the rollout window.

Signed-off-by: Hasan Khan <hasank@nvidia.com>

coderabbitai · 2026-06-28T18:40:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5466c580-c592-4b4a-9c5e-86f89845948e

📥 Commits

Reviewing files that changed from the base of the PR and between 4717b8c and 8dfbe81.

📒 Files selected for processing (1)

crates/api-core/tests/integration/rack_find.rs

🚧 Files skipped from review as they are similar to previous changes (1)

crates/api-core/tests/integration/rack_find.rs

Summary by CodeRabbit

Bug Fixes
- Hard deletes for racks, switches, and power shelves now keep their state history instead of removing it.
- History lookups and snapshots now consistently return the expected records after deletions.
- Cleanup flows now preserve relevant state-history entries while removing the associated resource.

Walkthrough

The PR unifies state-history tables around object_id, updates Rust query/build helpers and SQL joins to use that column, preserves history during force-delete flows, and expands tests to verify retention and the revised SQL shapes.

Changes

State History Unification

Layer / File(s)	Summary
DB migration: schema unification and retention triggers `crates/api-db/migrations/20260628120000_unify_state_history_tables.sql`	Drops cascading foreign keys, normalizes `timestamp` to `NOT NULL`, renames entity-specific IDs to `object_id`, standardizes indexes, recreates per-object retention triggers, and adds `cleanup_machine_by_id` without deleting state history.
state_history Rust API: unified object_id queries `crates/api-db/src/state_history.rs`	Removes per-table object-id helpers and `delete_by_object_id`, rewrites lookup/persist/update paths to use fixed `object_id` bindings, and adds integration tests for schema, persistence, retention, and concurrency.
History joins updated to object_id across entities `crates/api-db/src/dpa_interface.rs`, `crates/api-db/src/network_segment.rs`, `crates/api-db/src/sql/machine_snapshot_history_join.snippet`, `crates/api-db/src/sql/managed_host_history_join.snippet`	History aggregation queries switch to `object_id`; DPA interface deletion no longer pre-removes state-history rows, and DPA interface retention tests are added.
Force-delete handlers: history retention `crates/api-core/src/handlers/power_shelf.rs`, `crates/api-core/src/handlers/rack.rs`, `crates/api-core/src/handlers/switch.rs`	Force-delete handlers stop deleting state-history records and update doc comments to describe retained history.
Test updates: retention assertions and object_id SQL `crates/api-core/src/tests/power_shelf.rs`, `crates/api-core/src/tests/switch.rs`, `crates/api-core/tests/integration/rack_find.rs`, `crates/api-core/src/tests/machine_history.rs`, `crates/api-core/src/tests/vpc_prefix.rs`	Force-delete tests now seed and verify retained history, while machine history and VPC prefix tests use `object_id`-based SQL.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and accurately summarizes the main database normalization change.
Description check	✅ Passed	The description directly matches the changeset and the stated schema and history-retention goals.
Linked Issues check	✅ Passed	The PR satisfies `#1136` by standardizing object_id, enforcing timestamps, removing cascading FKs, and preserving 250-row retention.
Out of Scope Changes check	✅ Passed	No clear unrelated code changes are present; the cleanup and query updates support the state-history unification work.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-db/migrations/20260628120000_unify_state_history_tables.sql`:
- Around line 80-158: The retention triggers in
machine_state_history_keep_limit, network_segment_state_history_keep_limit,
vpc_prefix_state_history_keep_limit, dpa_interface_state_history_keep_limit,
ib_partition_state_history_keep_limit, power_shelf_state_history_keep_limit,
rack_state_history_keep_limit, and switch_state_history_keep_limit can race
under concurrent inserts for the same object_id. Add per-object serialization
inside each trigger body before the DELETE/overflow calculation, using a lock or
equivalent mechanism keyed by NEW.object_id, so only one transaction performs
retention cleanup for a given object at a time. Keep the existing 250-row cap
logic, but ensure it runs after the lock is acquired.

In `@crates/api-db/src/dpa_interface.rs`:
- Around line 309-310: The history JSON built in the DPA interface query is not
ordered deterministically, so align the aggregate in the history_agg subquery
with state_history::for_object by preserving a stable sort before json_agg.
Update the query around dpa_interface_state_history and the
json_agg/json_build_object expression so include_history=true returns history
entries in the same id ASC order every time.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d93c359a-59de-4c3f-a8d7-095f96b2f97d

📥 Commits

Reviewing files that changed from the base of the PR and between d3395d8 and 9fd867f.

📒 Files selected for processing (14)

crates/api-core/src/handlers/power_shelf.rs
crates/api-core/src/handlers/rack.rs
crates/api-core/src/handlers/switch.rs
crates/api-core/src/tests/machine_history.rs
crates/api-core/src/tests/power_shelf.rs
crates/api-core/src/tests/rack_find.rs
crates/api-core/src/tests/switch.rs
crates/api-core/src/tests/vpc_prefix.rs
crates/api-db/migrations/20260628120000_unify_state_history_tables.sql
crates/api-db/src/dpa_interface.rs
crates/api-db/src/network_segment.rs
crates/api-db/src/sql/machine_snapshot_history_join.snippet
crates/api-db/src/sql/managed_host_history_join.snippet
crates/api-db/src/state_history.rs

Signed-off-by: Hasan Khan <hasank@nvidia.com>

osu · 2026-06-28T19:19:36Z

@coderabbitai review

coderabbitai · 2026-06-28T19:19:42Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

🧹 Nitpick comments (1)

crates/api-db/src/state_history.rs (1)

236-288: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Join the spawned writer even when the wait assertion fails.

timeout(...)? can return before second_insert.await, which drops the JoinHandle and detaches the DB task on failure. Capture the wait result, release the first transaction, await the spawned writer, then propagate the original error.

Proposed cleanup ordering

-        tokio::time::timeout(std::time::Duration::from_secs(5), async {
+        let wait_result = tokio::time::timeout(std::time::Duration::from_secs(5), async {
             loop {
                 let waiting: bool = sqlx::query_scalar(
@@
-        })
-        .await
-        .map_err(|_| {
+        })
+        .await
+        .map_err(|_| {
             std::io::Error::other(format!(
                 "second writer did not wait for {table_name} retention lock",
             ))
-        })??;
+        });
         drop(observer);
 
         first_txn.commit().await?;
-        second_insert.await?.map_err(std::io::Error::other)?;
+        let second_result = second_insert.await?.map_err(std::io::Error::other);
+        wait_result??;
+        second_result?;

As per coding guidelines, avoid spawning background tasks without joining them; as per path instructions, prioritize concurrency and resource-lifetime findings.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-db/src/state_history.rs` around lines 236 - 288, The spawned
writer task in the state history test is being detached when the wait assertion
times out, because `timeout(...)?` returns before `second_insert.await` and the
`JoinHandle` is dropped. Update the flow around `second_insert`, `pid_receiver`,
and `first_txn.commit` so you capture the wait result first, always
release/commit the first transaction, then await `second_insert` before
returning or propagating the original error. Keep the original timeout/assertion
error, but ensure the spawned DB task is joined on both success and failure
paths.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/api-db/src/state_history.rs`:
- Around line 236-288: The spawned writer task in the state history test is
being detached when the wait assertion times out, because `timeout(...)?`
returns before `second_insert.await` and the `JoinHandle` is dropped. Update the
flow around `second_insert`, `pid_receiver`, and `first_txn.commit` so you
capture the wait result first, always release/commit the first transaction, then
await `second_insert` before returning or propagating the original error. Keep
the original timeout/assertion error, but ensure the spawned DB task is joined
on both success and failure paths.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 14ac5f81-b88e-419a-8f42-4d898de2967e

📥 Commits

Reviewing files that changed from the base of the PR and between 9fd867f and c46d99e.

📒 Files selected for processing (3)

crates/api-db/migrations/20260628120000_unify_state_history_tables.sql
crates/api-db/src/dpa_interface.rs
crates/api-db/src/state_history.rs

🚧 Files skipped from review as they are similar to previous changes (2)

crates/api-db/src/dpa_interface.rs
crates/api-db/migrations/20260628120000_unify_state_history_tables.sql

osu · 2026-06-28T19:26:05Z

@coderabbitai review

coderabbitai · 2026-06-28T19:26:11Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Signed-off-by: Hasan Khan <hasank@nvidia.com>

github-actions · 2026-06-28T20:03:22Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
boot-artifacts-aarch64	3	0	0	3	0	0
boot-artifacts-x86_64	3	0	0	3	0	0
forge-admin-cli-x86_64	302	6	30	114	8	144
machine-validation-runner	765	30	194	283	37	221
machine_validation	765	30	194	283	37	221
machine_validation-aarch64	765	30	194	283	37	221
nvmetal-carbide	765	30	194	283	37	221
TOTAL	3368	126	806	1252	156	1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

ajf · 2026-06-29T20:09:34Z

@kensimon can you review this please.

Signed-off-by: Hasan Khan <hasank@nvidia.com>

github-actions · 2026-06-30T18:41:17Z

🌿 Preview your docs: https://nvidia-preview-pull-request-2957.docs.buildwithfern.com/infra-controller

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/api-core/tests/integration/rack_find.rs (1)

172-205: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Cover history retention in the already-soft-deleted branch.

This test exercises a distinct admin_force_delete_rack path, but it never verifies the PR’s retention contract there. Seed a rack state-history row before delete_rack() and assert it is still returned after the hard delete, otherwise a branch-specific regression can slip through unnoticed.

💡 Suggested test extension

 async fn test_force_delete_rack_already_soft_deleted(
     pool: PgPool,
 ) -> Result<(), Box<dyn std::error::Error>> {
     let env = TestHarness::builder(pool).build().await;
     let TestRack { id: rack_id } = env.create_rack().await;
 
+    let mut txn = env.pool.begin().await?;
+    db::state_history::persist(
+        &mut txn,
+        db::state_history::StateHistoryTableId::Rack,
+        &rack_id,
+        &"retained-through-soft-delete",
+        config_version::ConfigVersion::initial(),
+    )
+    .await?;
+    txn.commit().await?;
+
     env.api()
         .delete_rack(tonic::Request::new(DeleteRackRequest {
             id: rack_id.to_string(),
         }))
         .await?;
@@
     assert!(racks.is_empty(), "Rack should be hard-deleted");
+
+    let mut conn = env.pool.acquire().await?;
+    let history = db::state_history::for_object(
+        &mut conn,
+        db::state_history::StateHistoryTableId::Rack,
+        &rack_id,
+    )
+    .await?;
+    assert!(
+        history
+            .iter()
+            .any(|record| record.state == r#""retained-through-soft-delete""#),
+        "Rack state history should be retained",
+    );
 
     Ok(())
 }

As per coding guidelines, "Add or update focused tests for bug fixes, shared behavior, API contracts, migrations, and cross-module changes."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/tests/integration/rack_find.rs` around lines 172 - 205, The
`test_force_delete_rack_already_soft_deleted` integration test covers
`admin_force_delete_rack` but does not verify retention of rack history in this
branch. Before calling `delete_rack()`, seed a rack state-history record for the
created rack, then after the force delete assert that the history is still
retrievable while the rack itself is gone. Use the existing `TestHarness`,
`delete_rack`, and `admin_force_delete_rack` flow in this test to add the
branch-specific retention check.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/api-core/tests/integration/rack_find.rs`:
- Around line 172-205: The `test_force_delete_rack_already_soft_deleted`
integration test covers `admin_force_delete_rack` but does not verify retention
of rack history in this branch. Before calling `delete_rack()`, seed a rack
state-history record for the created rack, then after the force delete assert
that the history is still retrievable while the rack itself is gone. Use the
existing `TestHarness`, `delete_rack`, and `admin_force_delete_rack` flow in
this test to add the branch-specific retention check.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bc0639f2-9924-4714-9f1f-b772c3615eee

📥 Commits

Reviewing files that changed from the base of the PR and between 219e815 and 4717b8c.

📒 Files selected for processing (3)

crates/api-core/src/tests/vpc_prefix.rs
crates/api-core/tests/integration/rack_find.rs
crates/api-db/src/dpa_interface.rs

🚧 Files skipped from review as they are similar to previous changes (2)

crates/api-core/src/tests/vpc_prefix.rs
crates/api-db/src/dpa_interface.rs

copy-pr-bot · 2026-06-30T20:17:08Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Hasan Khan <hasank@nvidia.com>

osu · 2026-07-01T00:35:44Z

@coderabbitai review

coderabbitai · 2026-07-01T00:35:49Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

osu · 2026-07-01T01:25:48Z

/ok to test 8dfbe81

fix(db): unify state history tables

9fd867f

Signed-off-by: Hasan Khan <hasank@nvidia.com>

osu requested a review from a team as a code owner June 28, 2026 18:40

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

Comment thread crates/api-db/migrations/20260628120000_unify_state_history_tables.sql

Comment thread crates/api-db/src/dpa_interface.rs Outdated

fix(db): serialize state history retention

c46d99e

Signed-off-by: Hasan Khan <hasank@nvidia.com>

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

test(db): clean up retention concurrency test

219e815

Signed-off-by: Hasan Khan <hasank@nvidia.com>

osu self-assigned this Jun 28, 2026

ajf requested a review from kensimon June 29, 2026 20:09

Merge branch 'main' into fix/1136-unify-state-history-tables

4717b8c

Signed-off-by: Hasan Khan <hasank@nvidia.com>

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

osu marked this pull request as draft June 30, 2026 20:17

test(api-core): fix rack history regression setup

8dfbe81

Signed-off-by: Hasan Khan <hasank@nvidia.com>

Uh oh!

Conversation

osu commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Type of Change

Breaking Changes

Testing

Additional Notes

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

osu commented Jun 28, 2026

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

osu commented Jun 28, 2026

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Container Scan Summary

Uh oh!

ajf commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

osu commented Jul 1, 2026

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

osu commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

osu commented Jun 28, 2026 •

edited

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

github-actions Bot commented Jun 28, 2026 •

edited

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading