Skip to content

[#11194] feat(maintenance): Add builtin-iceberg-expire-snapshots job#11206

Open
laserninja wants to merge 1 commit into
apache:mainfrom
laserninja:feat/11194-iceberg-expire-snapshots
Open

[#11194] feat(maintenance): Add builtin-iceberg-expire-snapshots job#11206
laserninja wants to merge 1 commit into
apache:mainfrom
laserninja:feat/11194-iceberg-expire-snapshots

Conversation

@laserninja
Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

Add a new built-in Iceberg maintenance job builtin-iceberg-expire-snapshots that expires old snapshots from Iceberg tables via Spark's expire_snapshots procedure.

Changes:

  • New IcebergExpireSnapshotsJob class in maintenance/jobs following the same pattern as IcebergRewriteDataFilesJob
  • Supports configurable parameters: older_than (timestamp), retain_last (number of snapshots to keep), stream_results (boolean)
  • SQL injection protection via escapeSqlString() and escapeSqlIdentifier()
  • Input validation for retain_last (must be positive integer) and stream_results (must be true/false)
  • Registered in BuiltInJobTemplateProvider

Why are the changes needed?

Without periodic snapshot expiration, Iceberg table metadata grows indefinitely, accumulating snapshot JSON files and manifest lists that slow down table operations and waste storage. The existing built-in jobs (builtin-iceberg-rewrite-data-files and builtin-iceberg-update-stats) cover data compaction and metrics but do not address metadata cleanup.

This is one of the most critical Iceberg housekeeping operations. PR #10500 added Trino-side delegation for expire_snapshots as a procedure, but there is no server-side built-in job that can be triggered automatically via the Optimizer.

Fix: #11194

Note: This PR covers the job layer. The end-to-end policy/strategy integration (e.g. IcebergSnapshotExpirationContent, SnapshotExpirationStrategyHandler) can be added as a follow-up, as discussed in the issue.

Does this PR introduce any user-facing change?

No user-facing API changes. Adds a new built-in job template builtin-iceberg-expire-snapshots that will be available for maintenance job scheduling.

How was this patch tested?

  • Added TestIcebergExpireSnapshotsJob with 40 unit tests covering:
    • Job template metadata (name, comment, executable, mainClass, arguments, sparkConfigs, version)
    • Argument parsing (required, optional, empty values, missing values, all options, order independence)
    • Procedure call building (minimal, with older-than, retain-last, stream-results, all params, empty params)
    • SQL escaping and injection prevention
    • Input validation for retain-last and stream-results
    • Custom Spark config parsing (valid JSON, numeric values, empty, null, invalid JSON)
  • All 142 tests in maintenance:jobs module pass with 0 failures

…s job

Add IcebergExpireSnapshotsJob that expires old Iceberg table snapshots
via Spark's expire_snapshots procedure. Supports configurable parameters:
older_than, retain_last, and stream_results.

Register the new job in BuiltInJobTemplateProvider.
@github-actions
Copy link
Copy Markdown

Code Coverage Report

Overall Project 72.22% -0.52% 🟢
Files changed 60.26% 🟢

Module Coverage
jobs 64.09% -1.3% 🟢
optimizer 82.95% 🟢
optimizer-api 21.95% 🔴
Files
Module File Coverage
jobs BuiltInJobTemplateProvider.java 81.82% 🟢
IcebergExpireSnapshotsJob.java 56.59% 🔴

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add builtin-iceberg-expire-snapshots maintenance job

1 participant