fix(db): unify state history tables#2957
Conversation
Signed-off-by: Hasan Khan <hasank@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Summary by CodeRabbit
WalkthroughThe PR unifies state-history tables around ChangesState History Unification
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/api-db/migrations/20260628120000_unify_state_history_tables.sql`:
- Around line 80-158: The retention triggers in
machine_state_history_keep_limit, network_segment_state_history_keep_limit,
vpc_prefix_state_history_keep_limit, dpa_interface_state_history_keep_limit,
ib_partition_state_history_keep_limit, power_shelf_state_history_keep_limit,
rack_state_history_keep_limit, and switch_state_history_keep_limit can race
under concurrent inserts for the same object_id. Add per-object serialization
inside each trigger body before the DELETE/overflow calculation, using a lock or
equivalent mechanism keyed by NEW.object_id, so only one transaction performs
retention cleanup for a given object at a time. Keep the existing 250-row cap
logic, but ensure it runs after the lock is acquired.
In `@crates/api-db/src/dpa_interface.rs`:
- Around line 309-310: The history JSON built in the DPA interface query is not
ordered deterministically, so align the aggregate in the history_agg subquery
with state_history::for_object by preserving a stable sort before json_agg.
Update the query around dpa_interface_state_history and the
json_agg/json_build_object expression so include_history=true returns history
entries in the same id ASC order every time.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d93c359a-59de-4c3f-a8d7-095f96b2f97d
📒 Files selected for processing (14)
crates/api-core/src/handlers/power_shelf.rscrates/api-core/src/handlers/rack.rscrates/api-core/src/handlers/switch.rscrates/api-core/src/tests/machine_history.rscrates/api-core/src/tests/power_shelf.rscrates/api-core/src/tests/rack_find.rscrates/api-core/src/tests/switch.rscrates/api-core/src/tests/vpc_prefix.rscrates/api-db/migrations/20260628120000_unify_state_history_tables.sqlcrates/api-db/src/dpa_interface.rscrates/api-db/src/network_segment.rscrates/api-db/src/sql/machine_snapshot_history_join.snippetcrates/api-db/src/sql/managed_host_history_join.snippetcrates/api-db/src/state_history.rs
Signed-off-by: Hasan Khan <hasank@nvidia.com>
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/api-db/src/state_history.rs (1)
236-288: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winJoin the spawned writer even when the wait assertion fails.
timeout(...)?can return beforesecond_insert.await, which drops theJoinHandleand detaches the DB task on failure. Capture the wait result, release the first transaction, await the spawned writer, then propagate the original error.Proposed cleanup ordering
- tokio::time::timeout(std::time::Duration::from_secs(5), async { + let wait_result = tokio::time::timeout(std::time::Duration::from_secs(5), async { loop { let waiting: bool = sqlx::query_scalar( @@ - }) - .await - .map_err(|_| { + }) + .await + .map_err(|_| { std::io::Error::other(format!( "second writer did not wait for {table_name} retention lock", )) - })??; + }); drop(observer); first_txn.commit().await?; - second_insert.await?.map_err(std::io::Error::other)?; + let second_result = second_insert.await?.map_err(std::io::Error::other); + wait_result??; + second_result?;As per coding guidelines, avoid spawning background tasks without joining them; as per path instructions, prioritize concurrency and resource-lifetime findings.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-db/src/state_history.rs` around lines 236 - 288, The spawned writer task in the state history test is being detached when the wait assertion times out, because `timeout(...)?` returns before `second_insert.await` and the `JoinHandle` is dropped. Update the flow around `second_insert`, `pid_receiver`, and `first_txn.commit` so you capture the wait result first, always release/commit the first transaction, then await `second_insert` before returning or propagating the original error. Keep the original timeout/assertion error, but ensure the spawned DB task is joined on both success and failure paths.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/api-db/src/state_history.rs`:
- Around line 236-288: The spawned writer task in the state history test is
being detached when the wait assertion times out, because `timeout(...)?`
returns before `second_insert.await` and the `JoinHandle` is dropped. Update the
flow around `second_insert`, `pid_receiver`, and `first_txn.commit` so you
capture the wait result first, always release/commit the first transaction, then
await `second_insert` before returning or propagating the original error. Keep
the original timeout/assertion error, but ensure the spawned DB task is joined
on both success and failure paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 14ac5f81-b88e-419a-8f42-4d898de2967e
📒 Files selected for processing (3)
crates/api-db/migrations/20260628120000_unify_state_history_tables.sqlcrates/api-db/src/dpa_interface.rscrates/api-db/src/state_history.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- crates/api-db/src/dpa_interface.rs
- crates/api-db/migrations/20260628120000_unify_state_history_tables.sql
|
@coderabbitai review |
✅ Action performedReview finished.
|
Signed-off-by: Hasan Khan <hasank@nvidia.com>
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
|
@kensimon can you review this please. |
Signed-off-by: Hasan Khan <hasank@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2957.docs.buildwithfern.com/infra-controller |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/api-core/tests/integration/rack_find.rs (1)
172-205: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winCover history retention in the already-soft-deleted branch.
This test exercises a distinct
admin_force_delete_rackpath, but it never verifies the PR’s retention contract there. Seed a rack state-history row beforedelete_rack()and assert it is still returned after the hard delete, otherwise a branch-specific regression can slip through unnoticed.💡 Suggested test extension
async fn test_force_delete_rack_already_soft_deleted( pool: PgPool, ) -> Result<(), Box<dyn std::error::Error>> { let env = TestHarness::builder(pool).build().await; let TestRack { id: rack_id } = env.create_rack().await; + let mut txn = env.pool.begin().await?; + db::state_history::persist( + &mut txn, + db::state_history::StateHistoryTableId::Rack, + &rack_id, + &"retained-through-soft-delete", + config_version::ConfigVersion::initial(), + ) + .await?; + txn.commit().await?; + env.api() .delete_rack(tonic::Request::new(DeleteRackRequest { id: rack_id.to_string(), })) .await?; @@ assert!(racks.is_empty(), "Rack should be hard-deleted"); + + let mut conn = env.pool.acquire().await?; + let history = db::state_history::for_object( + &mut conn, + db::state_history::StateHistoryTableId::Rack, + &rack_id, + ) + .await?; + assert!( + history + .iter() + .any(|record| record.state == r#""retained-through-soft-delete""#), + "Rack state history should be retained", + ); Ok(()) }As per coding guidelines, "Add or update focused tests for bug fixes, shared behavior, API contracts, migrations, and cross-module changes."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/tests/integration/rack_find.rs` around lines 172 - 205, The `test_force_delete_rack_already_soft_deleted` integration test covers `admin_force_delete_rack` but does not verify retention of rack history in this branch. Before calling `delete_rack()`, seed a rack state-history record for the created rack, then after the force delete assert that the history is still retrievable while the rack itself is gone. Use the existing `TestHarness`, `delete_rack`, and `admin_force_delete_rack` flow in this test to add the branch-specific retention check.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/api-core/tests/integration/rack_find.rs`:
- Around line 172-205: The `test_force_delete_rack_already_soft_deleted`
integration test covers `admin_force_delete_rack` but does not verify retention
of rack history in this branch. Before calling `delete_rack()`, seed a rack
state-history record for the created rack, then after the force delete assert
that the history is still retrievable while the rack itself is gone. Use the
existing `TestHarness`, `delete_rack`, and `admin_force_delete_rack` flow in
this test to add the branch-specific retention check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bc0639f2-9924-4714-9f1f-b772c3615eee
📒 Files selected for processing (3)
crates/api-core/src/tests/vpc_prefix.rscrates/api-core/tests/integration/rack_find.rscrates/api-db/src/dpa_interface.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- crates/api-core/src/tests/vpc_prefix.rs
- crates/api-db/src/dpa_interface.rs
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Hasan Khan <hasank@nvidia.com>
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
/ok to test 8dfbe81 |
Description
Normalize all eight state-history tables around a common five-column layout with an
object_id TEXTresource key.This change:
Related issues
Closes #1136
Type of Change
Breaking Changes
Testing
Completed validation:
cargo test --locked -p carbide-api-db -- --test-threads=1— 217 passed; doc tests 3 passed, 1 ignored.cargo test --locked -p carbide-api-db dpa_interface::test::deleting_ -- --nocapture --test-threads=1— 2 passed.cargo test --locked -p carbide-api-core test_force_delete_ -- --nocapture --test-threads=1— 9 passed.machine_historyfilter — 2 passed.cargo clippy --locked -p carbide-api-db -p carbide-api-core --all-targets --all-features— passed.cargo fmt --all -- --check— passed.origin/main: seeded all eight legacy tables, including NULL power-shelf/switch timestamps, then applied this migration. All eight rows were preserved; resource IDs becameTEXT; timestamps were non-null; every table had the common five-column layout and zero foreign keys.Additional Notes
The migration renames internal history columns in place. As with existing direct-rename migrations in this repository, old application binaries are not compatible with the post-migration column names during the rollout window.