SOLR-17949: Add Azure Blob Storage backup repository module#3750
SOLR-17949: Add Azure Blob Storage backup repository module#3750prateeksinghalgit wants to merge 17 commits into
Conversation
|
Quick update: following up on the dev@ thread. To clarify scope — although the diff is large, the actual implementation is centered in 8 main files and 8 test files; the rest are license header updates. Happy to split the PR into smaller logical chunks (core module / tests / docs) if that helps with review. Thanks everyone! |
|
Don’t have time to review but asked copilot for an opinion 😉 |
janhoy
left a comment
There was a problem hiding this comment.
Only immediate comment is naming. «Blob-repository» is too generic. Should contain word «azure»?
There was a problem hiding this comment.
Pull Request Overview
This PR implements Azure Blob Storage backup repository support for Apache Solr, enabling collections to be backed up to and restored from Microsoft Azure. The implementation follows established patterns from existing S3 and GCS modules, providing 4 authentication methods (Connection String, Account Key, SAS Token, Azure Identity), incremental backup support, and comprehensive documentation with 76 passing unit tests.
Reviewed Changes
Copilot reviewed 70 out of 70 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| backup-restore.adoc | Adds comprehensive documentation for BlobBackupRepository with authentication methods and configuration |
| BlobBackupRepository.java | Main repository implementation following Solr's BackupRepository interface |
| BlobStorageClient.java | Azure SDK wrapper providing blob storage operations |
| BlobOutputStream.java | Custom output stream for block-based blob uploads |
| BlobIndexInput.java | Lucene IndexInput implementation with page caching |
| Test files | 8 test classes with 76 passing tests covering all functionality |
| build.gradle | Module build configuration with Azure SDK dependencies |
| License files | License and notice files for Azure and Netty dependencies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This commit adds support for backing up and restoring Solr collections to Azure Blob Storage with multiple authentication options. Features: - Full backup/restore functionality to Azure Blob Storage - Support for 4 authentication methods: * Connection String (for development) * Account Name + Key (for simple production) * SAS Token (recommended for production) * Azure Identity (Managed Identity, Service Principal, Azure CLI) - Incremental backup support with versioning - Data integrity verification (checksum validation) - Compatible with Azurite emulator for local testing - Comprehensive documentation and 76 passing unit tests Implementation: - 8 implementation files (1,606 LOC) - 8 test files (2,180 LOC) - All dependencies Apache 2.0 licensed - Follows Solr's backup repository patterns
32efbbd to
5462186
Compare
- Renamed module from blob-repository to azure-blob-repository - Renamed all classes from Blob* to AzureBlob* for clarity - Updated package from org.apache.solr.blob to org.apache.solr.azureblob - Added Azure SDK dependencies (azure-storage-blob, azure-identity) - Updated Solr Reference Guide with Azure Blob Storage documentation - Added .gitignore entries for Azurite test infrastructure All authentication methods tested successfully with real Azure Blob Storage: - Connection String authentication - Account Name + Key authentication - SAS Token authentication - Service Principal (Azure Identity) authentication Testing completed with 100% success rate on backup/restore operations.
5462186 to
47f456a
Compare
made the change to azure-blob-repository |
24daaaa to
e6d9a64
Compare
- Switch from Netty to OkHttp for better Security Manager compatibility - Use static shared HttpClient for better resource management - Fix licenses: msal4j and Azure SDK are MIT licensed - Add changelog entry - Add JFR permissions for Reactor
e6d9a64 to
f7adb05
Compare
|
I'll leave the rest of the review to others more proficient in Azure Blob than me (have never used it). I'd love to test it with a read AzBlob but I'll leave that to someone who already have an active account. I have a concern that the PR feels largely AI generated(?), lacking the care to details that we require for contributions. Have me excused @prateeksinghalgit if this is not correct, it was just a hunch I got while reviewing. I'd rather have a short well thought out README with useful advice for users than three pages of detailed step by step instructions for testing and developing the feature. For other reviewers to pick up where I left, consider in partifular the HTTP client choice, correct licensing and avoiding hard coded ports in tests. I have not done a complete review, not read the ref-guide part at all, just commented on things me and Copilot saw immediately. |
8e812b1 to
3545833
Compare
- Replace fully-qualified class names with imports in test files (Error Prone UnnecessarilyFullyQualified under -Werror) - Remove invalid 'description' key from changelog entry - Regenerate dependency lockfiles after merging upstream/main Co-authored-by: Cursor <cursoragent@cursor.com>
3944ad0 to
75a5db7
Compare
|
@janhoy could you please approve workflow runs on this fork PR again. I did run the ci jobs locally with all the flags used in CI to make sure . |
|
Status update: CI is now fully green (Gradle Precommit, Crave tests, and changelog validation all passing on the latest commit). All of @psalagnac's feedback is addressed. @psalagnac, when you have a chance, could you please review the pr . @HoustonPutman, your feedback is welcome too. Thanks! |
…) no-op Verify Lucene checksums when copying index files to the Azure Blob backup repository (mirroring the S3 repository), rewriting the footer after validation and rejecting too-small or corrupt files. Clarify that AzureBlobOutputStream.flush() is intentionally a no-op so frequent flushes cannot exhaust Azure's committed-block limit, and document the pathExists/isDirectory asymmetry. Add covering tests and group test helpers and lifecycle methods consistently. Co-authored-by: Cursor <cursoragent@cursor.com>
c56699e to
c7ffc1a
Compare
|
@janhoy could you please approve workflow runs on this fork PR again. I had to make updates to the pr. Also I did run the ci jobs locally with all the flags to make sure . |
…ob-repository Co-authored-by: Cursor <cursoragent@cursor.com>
|
@janhoy could you please approve the workflow runs on this PR again? The previous CI failure was because my branch hadn't merged the latest main |
|
@janhoy could you please trigger a CI re-run, seems like a unrelated flaky test in the CI run. I ran CI locally, it did run successfully. |
janhoy
left a comment
There was a problem hiding this comment.
Skimmed through and found some questions and suggestsions.
- Exclude msal4j-persistence-extension and jna-platform from azure-identity (drops 3 unused native credential-cache jars and resolves the jna/jna-platform version mismatch); remove their orphaned license files. - Revert core/solrj/solrj-zookeeper/test-framework/solr-ref-guide lockfiles to match main (drop spurious resolveAndLockAll *Copy/spotless config tags); only the new azure-blob-repository lockfile changes. - Trim the module README to a developer pointer and move user-facing content (auth methods, usage, troubleshooting) into the Backup/Restore reference guide. - Clarify the <backup> snippet belongs in solr.xml, and document verified Azure Identity behavior under the Security Manager with a minimal CLI-credential grant. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@janhoy thank you for the review. I have addressed all review feedback. |
|
Thanks for the latest iteration, to me this starts to look quite mature. But as said earlier I'll defer to someone who have an actual AZ environment at hand to test it and do the last blessing. |
|
Thanks @janhoy. I've addressed all your points (dependency trim, lockfile cleanup, README, solr.xml, and the verified security.policy example). If you are ok with the updates could you please clear the "changes requested" status so it's not blocking the final reviewer. |
|
To make the final review easier: testing doesn't actually require an Azure account — the integration tests run against the Azurite emulator (Docker, free)- https://hub.docker.com/r/microsoft/azure-storage-azurite. I've separately verified backup/restore against a real Azure Storage account, including under the Java Security Manager. |
… backup suites Align the Azure Blob backup repository with S3/GCS conventions: - resolve()/getBlobPath() now mirror S3 exactly (fold URI host into the path), fixing exists()/getPathType() on virtual directories - listDir() strips trailing delimiters so it returns bare child names per the BackupRepository contract - delete() tolerates already-absent files (lenient, like local/S3) - commitBlockList(..., true) for overwrite semantics on retried uploads - handleBlobException() logs HTTP 404 at debug, other failures at error - fail fast with a clear message when the container name is missing - drop the unused CHUNK_SIZE constant and unused throws on sanitizedPath() - document the ~195 GiB single-file block-count ceiling Rewrite the tests to extend the shared abstract suites (AbstractBackupRepositoryTest, AbstractIncrementalBackupTest, AbstractInstallShardTest) against Azurite via a new AzuriteTestContainer helper. Recommend supplying secrets via sysprops/env vars in the reference guide. Co-authored-by: Cursor <cursoragent@cursor.com>
3b74e10 to
53037d6
Compare
…ob-repository Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # gradle/libs.versions.toml
203f4ff to
1e832a7
Compare
Implements Azure Blob Storage backup repository as discussed in SOLR-17949.
Description
Adds a new
BackupRepositoryimplementation for Azure Blob Storage, enabling Solr collections to be backed up to and restored from Azure. The module follows the structure and patterns of the existing S3 and GCS repository modules.Highlights
azure-core-http-okhttpinstead of the default Netty transport, for Security Manager compatibility and a smaller dependency footprint.solr.xml(documented in the Reference Guide).Main classes
AzureBlobBackupRepository— theBackupRepositoryimplementation.AzureBlobStorageClient— Azure SDK wrapper.AzureBlobIndexInput/AzureBlobOutputStream— LuceneBufferedIndexInputand block-blob upload stream.All dependencies are Apache 2.0 or MIT licensed (Azure SDK and msal4j are MIT).
Tests
The module runs the same shared SolrCloud test suites used by the S3 and GCS modules, backed by an Azurite emulator via Testcontainers (tests skip cleanly when Docker is unavailable):
AbstractBackupRepositoryTest— repository contract compliance, exercised through the realinit()/ configuration path.AbstractIncrementalBackupTest— full incremental backup/restore against a SolrCloud cluster.AbstractInstallShardTest— Install Shard API flows.Plus module-specific tests for the storage client, output stream (block staging/commit, failure handling), index input (slices, clones, connection-loss resume via
ResumableInputStream), and path handling.Connection-string authentication is exercised in CI against Azurite. Account key, SAS token, and Azure Identity (Service Principal) were verified manually against a real Azure Blob Storage account.
Testing Instructions:
Can be tested locally with Azurite emulator (no Azure account needed) or with real Azure Blob Storage. See
solr/modules/azure-blob-repository/README.mdfor detailed setup instructions.Checklist
Please review the following and check all that apply:
mainbranch../gradlew :solr:modules:azure-blob-repository:check(module-specific check passed).