Skip to content

SOLR-17949: Add Azure Blob Storage backup repository module#3750

Open
prateeksinghalgit wants to merge 17 commits into
apache:mainfrom
prateeksinghalgit:SOLR-17949-azure-blob-repository
Open

SOLR-17949: Add Azure Blob Storage backup repository module#3750
prateeksinghalgit wants to merge 17 commits into
apache:mainfrom
prateeksinghalgit:SOLR-17949-azure-blob-repository

Conversation

@prateeksinghalgit

@prateeksinghalgit prateeksinghalgit commented Oct 9, 2025

Copy link
Copy Markdown

Implements Azure Blob Storage backup repository as discussed in SOLR-17949.

Description

Adds a new BackupRepository implementation for Azure Blob Storage, enabling Solr collections to be backed up to and restored from Azure. The module follows the structure and patterns of the existing S3 and GCS repository modules.

Highlights

  • Full and incremental backup/restore, including the Install Shard API.
  • Four authentication methods: connection string, account key, SAS token, and Azure Identity (Managed Identity / Service Principal / Azure CLI).
  • Checksum verification on index-file copy, consistent with the S3/GCS modules.
  • Uses azure-core-http-okhttp instead of the default Netty transport, for Security Manager compatibility and a smaller dependency footprint.
  • Secrets can be supplied via system properties / environment variables instead of solr.xml (documented in the Reference Guide).

Main classes

  • AzureBlobBackupRepository — the BackupRepository implementation.
  • AzureBlobStorageClient — Azure SDK wrapper.
  • AzureBlobIndexInput / AzureBlobOutputStream — Lucene BufferedIndexInput and block-blob upload stream.

All dependencies are Apache 2.0 or MIT licensed (Azure SDK and msal4j are MIT).

Tests

The module runs the same shared SolrCloud test suites used by the S3 and GCS modules, backed by an Azurite emulator via Testcontainers (tests skip cleanly when Docker is unavailable):

  • AbstractBackupRepositoryTest — repository contract compliance, exercised through the real init() / configuration path.
  • AbstractIncrementalBackupTest — full incremental backup/restore against a SolrCloud cluster.
  • AbstractInstallShardTest — Install Shard API flows.

Plus module-specific tests for the storage client, output stream (block staging/commit, failure handling), index input (slices, clones, connection-loss resume via ResumableInputStream), and path handling.

Connection-string authentication is exercised in CI against Azurite. Account key, SAS token, and Azure Identity (Service Principal) were verified manually against a real Azure Blob Storage account.

Testing Instructions:
Can be tested locally with Azurite emulator (no Azure account needed) or with real Azure Blob Storage. See solr/modules/azure-blob-repository/README.md for detailed setup instructions.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew :solr:modules:azure-blob-repository:check (module-specific check passed).
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@github-actions github-actions Bot added documentation Improvements or additions to documentation dependencies Dependency upgrades tool:build tests labels Oct 9, 2025
@pratsgit

Copy link
Copy Markdown

Quick update: following up on the dev@ thread.

To clarify scope — although the diff is large, the actual implementation is centered in 8 main files and 8 test files; the rest are license header updates.

Happy to split the PR into smaller logical chunks (core module / tests / docs) if that helps with review.

Thanks everyone!

@janhoy

janhoy commented Nov 20, 2025

Copy link
Copy Markdown
Contributor

Don’t have time to review but asked copilot for an opinion 😉

@janhoy janhoy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only immediate comment is naming. «Blob-repository» is too generic. Should contain word «azure»?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements Azure Blob Storage backup repository support for Apache Solr, enabling collections to be backed up to and restored from Microsoft Azure. The implementation follows established patterns from existing S3 and GCS modules, providing 4 authentication methods (Connection String, Account Key, SAS Token, Azure Identity), incremental backup support, and comprehensive documentation with 76 passing unit tests.

Reviewed Changes

Copilot reviewed 70 out of 70 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
backup-restore.adoc Adds comprehensive documentation for BlobBackupRepository with authentication methods and configuration
BlobBackupRepository.java Main repository implementation following Solr's BackupRepository interface
BlobStorageClient.java Azure SDK wrapper providing blob storage operations
BlobOutputStream.java Custom output stream for block-based blob uploads
BlobIndexInput.java Lucene IndexInput implementation with page caching
Test files 8 test classes with 76 passing tests covering all functionality
build.gradle Module build configuration with Azure SDK dependencies
License files License and notice files for Azure and Netty dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread solr/licenses/azure-NOTICE.txt Outdated
Comment thread solr/licenses/msal4j-NOTICE.txt Outdated
Comment thread solr/licenses/reactor-NOTICE.txt Outdated
Comment thread gradle/libs.versions.toml Outdated
This commit adds support for backing up and restoring Solr collections
to Azure Blob Storage with multiple authentication options.

Features:
- Full backup/restore functionality to Azure Blob Storage
- Support for 4 authentication methods:
  * Connection String (for development)
  * Account Name + Key (for simple production)
  * SAS Token (recommended for production)
  * Azure Identity (Managed Identity, Service Principal, Azure CLI)
- Incremental backup support with versioning
- Data integrity verification (checksum validation)
- Compatible with Azurite emulator for local testing
- Comprehensive documentation and 76 passing unit tests

Implementation:
- 8 implementation files (1,606 LOC)
- 8 test files (2,180 LOC)
- All dependencies Apache 2.0 licensed
- Follows Solr's backup repository patterns
@prateeksinghalgit prateeksinghalgit force-pushed the SOLR-17949-azure-blob-repository branch 3 times, most recently from 32efbbd to 5462186 Compare November 21, 2025 23:10
- Renamed module from blob-repository to azure-blob-repository
- Renamed all classes from Blob* to AzureBlob* for clarity
- Updated package from org.apache.solr.blob to org.apache.solr.azureblob
- Added Azure SDK dependencies (azure-storage-blob, azure-identity)
- Updated Solr Reference Guide with Azure Blob Storage documentation
- Added .gitignore entries for Azurite test infrastructure

All authentication methods tested successfully with real Azure Blob Storage:
- Connection String authentication
- Account Name + Key authentication
- SAS Token authentication
- Service Principal (Azure Identity) authentication

Testing completed with 100% success rate on backup/restore operations.
@prateeksinghalgit prateeksinghalgit force-pushed the SOLR-17949-azure-blob-repository branch from 5462186 to 47f456a Compare November 22, 2025 02:35
@prateeksinghalgit

Copy link
Copy Markdown
Author

Only immediate comment is naming. «Blob-repository» is too generic. Should contain word «azure»?

made the change to azure-blob-repository

Comment thread solr/licenses/msal4j-NOTICE.txt Outdated
@prateeksinghalgit prateeksinghalgit force-pushed the SOLR-17949-azure-blob-repository branch from 24daaaa to e6d9a64 Compare November 24, 2025 09:22
Comment thread solr/solr-ref-guide/modules/deployment-guide/pages/backup-restore.adoc Outdated
- Switch from Netty to OkHttp for better Security Manager compatibility
- Use static shared HttpClient for better resource management
- Fix licenses: msal4j and Azure SDK are MIT licensed
- Add changelog entry
- Add JFR permissions for Reactor
@prateeksinghalgit prateeksinghalgit force-pushed the SOLR-17949-azure-blob-repository branch from e6d9a64 to f7adb05 Compare November 24, 2025 23:43
@janhoy

janhoy commented Nov 25, 2025

Copy link
Copy Markdown
Contributor

I'll leave the rest of the review to others more proficient in Azure Blob than me (have never used it). I'd love to test it with a read AzBlob but I'll leave that to someone who already have an active account.

I have a concern that the PR feels largely AI generated(?), lacking the care to details that we require for contributions. Have me excused @prateeksinghalgit if this is not correct, it was just a hunch I got while reviewing. I'd rather have a short well thought out README with useful advice for users than three pages of detailed step by step instructions for testing and developing the feature.

For other reviewers to pick up where I left, consider in partifular the HTTP client choice, correct licensing and avoiding hard coded ports in tests. I have not done a complete review, not read the ref-guide part at all, just commented on things me and Copilot saw immediately.

@prateeksinghalgit prateeksinghalgit force-pushed the SOLR-17949-azure-blob-repository branch 3 times, most recently from 8e812b1 to 3545833 Compare November 26, 2025 07:25
- Replace fully-qualified class names with imports in test files
  (Error Prone UnnecessarilyFullyQualified under -Werror)
- Remove invalid 'description' key from changelog entry
- Regenerate dependency lockfiles after merging upstream/main

Co-authored-by: Cursor <cursoragent@cursor.com>
@prateeksinghalgit prateeksinghalgit force-pushed the SOLR-17949-azure-blob-repository branch from 3944ad0 to 75a5db7 Compare May 28, 2026 22:29
@janhoy janhoy removed their request for review May 29, 2026 12:15
@prateeksinghalgit

Copy link
Copy Markdown
Author

@janhoy could you please approve workflow runs on this fork PR again. I did run the ci jobs locally with all the flags used in CI to make sure .

@prateeksinghalgit

Copy link
Copy Markdown
Author

Status update: CI is now fully green (Gradle Precommit, Crave tests, and changelog validation all passing on the latest commit). All of @psalagnac's feedback is addressed.

@psalagnac, when you have a chance, could you please review the pr . @HoustonPutman, your feedback is welcome too. Thanks!

…) no-op

Verify Lucene checksums when copying index files to the Azure Blob
backup repository (mirroring the S3 repository), rewriting the footer
after validation and rejecting too-small or corrupt files. Clarify that
AzureBlobOutputStream.flush() is intentionally a no-op so frequent
flushes cannot exhaust Azure's committed-block limit, and document the
pathExists/isDirectory asymmetry. Add covering tests and group test
helpers and lifecycle methods consistently.

Co-authored-by: Cursor <cursoragent@cursor.com>
@prateeksinghalgit prateeksinghalgit force-pushed the SOLR-17949-azure-blob-repository branch from c56699e to c7ffc1a Compare June 17, 2026 19:16
@prateeksinghalgit

Copy link
Copy Markdown
Author

@janhoy could you please approve workflow runs on this fork PR again. I had to make updates to the pr. Also I did run the ci jobs locally with all the flags to make sure .

…ob-repository

Co-authored-by: Cursor <cursoragent@cursor.com>
@prateeksinghalgit

Copy link
Copy Markdown
Author

@janhoy could you please approve the workflow runs on this PR again? The previous CI failure was because my branch hadn't merged the latest main

@prateeksinghalgit

Copy link
Copy Markdown
Author

@janhoy could you please trigger a CI re-run, seems like a unrelated flaky test in the CI run. I ran CI locally, it did run successfully.

@janhoy janhoy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed through and found some questions and suggestsions.

Comment thread solr/solr-ref-guide/modules/deployment-guide/pages/backup-restore.adoc Outdated
Comment thread solr/core/gradle.lockfile
Comment thread solr/solrj/gradle.lockfile
Comment thread solr/modules/azure-blob-repository/README.md Outdated
Comment thread solr/modules/azure-blob-repository/gradle.lockfile Outdated
Comment thread solr/modules/azure-blob-repository/build.gradle
prateeksinghalgit and others added 2 commits June 23, 2026 16:47
- Exclude msal4j-persistence-extension and jna-platform from azure-identity
  (drops 3 unused native credential-cache jars and resolves the jna/jna-platform
  version mismatch); remove their orphaned license files.
- Revert core/solrj/solrj-zookeeper/test-framework/solr-ref-guide lockfiles to
  match main (drop spurious resolveAndLockAll *Copy/spotless config tags); only
  the new azure-blob-repository lockfile changes.
- Trim the module README to a developer pointer and move user-facing content
  (auth methods, usage, troubleshooting) into the Backup/Restore reference guide.
- Clarify the <backup> snippet belongs in solr.xml, and document verified Azure
  Identity behavior under the Security Manager with a minimal CLI-credential grant.

Co-authored-by: Cursor <cursoragent@cursor.com>
@prateeksinghalgit

Copy link
Copy Markdown
Author

@janhoy thank you for the review. I have addressed all review feedback.

@prateeksinghalgit prateeksinghalgit requested a review from janhoy June 24, 2026 18:51
@janhoy

janhoy commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Thanks for the latest iteration, to me this starts to look quite mature. But as said earlier I'll defer to someone who have an actual AZ environment at hand to test it and do the last blessing.

@prateeksinghalgit

Copy link
Copy Markdown
Author

Thanks @janhoy. I've addressed all your points (dependency trim, lockfile cleanup, README, solr.xml, and the verified security.policy example). If you are ok with the updates could you please clear the "changes requested" status so it's not blocking the final reviewer.

@prateeksinghalgit

Copy link
Copy Markdown
Author

To make the final review easier: testing doesn't actually require an Azure account — the integration tests run against the Azurite emulator (Docker, free)- https://hub.docker.com/r/microsoft/azure-storage-azurite. I've separately verified backup/restore against a real Azure Storage account, including under the Java Security Manager.

prateeksinghalgit and others added 2 commits July 1, 2026 13:55
… backup suites

Align the Azure Blob backup repository with S3/GCS conventions:
- resolve()/getBlobPath() now mirror S3 exactly (fold URI host into the path),
  fixing exists()/getPathType() on virtual directories
- listDir() strips trailing delimiters so it returns bare child names per the
  BackupRepository contract
- delete() tolerates already-absent files (lenient, like local/S3)
- commitBlockList(..., true) for overwrite semantics on retried uploads
- handleBlobException() logs HTTP 404 at debug, other failures at error
- fail fast with a clear message when the container name is missing
- drop the unused CHUNK_SIZE constant and unused throws on sanitizedPath()
- document the ~195 GiB single-file block-count ceiling

Rewrite the tests to extend the shared abstract suites (AbstractBackupRepositoryTest,
AbstractIncrementalBackupTest, AbstractInstallShardTest) against Azurite via a new
AzuriteTestContainer helper. Recommend supplying secrets via sysprops/env vars in the
reference guide.

Co-authored-by: Cursor <cursoragent@cursor.com>
@prateeksinghalgit prateeksinghalgit force-pushed the SOLR-17949-azure-blob-repository branch from 3b74e10 to 53037d6 Compare July 2, 2026 18:07
…ob-repository

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	gradle/libs.versions.toml
@prateeksinghalgit prateeksinghalgit force-pushed the SOLR-17949-azure-blob-repository branch from 203f4ff to 1e832a7 Compare July 2, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency upgrades documentation Improvements or additions to documentation jetty-server tests tool:build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants