fix: retry transient per-object delete failures in removeObjects() cleanup#1700
fix: retry transient per-object delete failures in removeObjects() cleanup#1700allanrogerr wants to merge 10 commits intominio:masterfrom
Conversation
The removeObjects() cleanup helper silently ignored all DeleteErrors via ignore(r.get()), masking transient server-side failures. When the batch delete of 1050 objects returned partial errors (e.g. SlowDown, InternalError), those objects remained in the bucket and the subsequent removeBucket() call failed with BucketNotEmpty — surfacing a cleanup race as a false test failure. Replace the single ignore() call with two focused helpers: - retryRemoveObject(): retries a single object with exponential backoff (500ms base) up to MAX_DELETE_RETRIES attempts. Accepts NoSuchKey/NoSuchVersion silently (already gone), retries SlowDown/InternalError/RequestTimeout/ServiceUnavailable, and propagates anything else immediately. - removeObjects(): fully consumes the batch-delete iterator first (so all 1000-object HTTP batches are sent), then delegates each DeleteResult.Error to retryRemoveObject() individually. DeleteResult.Error only exposes objectName() via ErrorResponse (no versionId), so a HashMap<String,List<ObjectWriteResponse>> index preserves correct versioned-delete semantics by storing all responses keyed by name and retrying every (name,versionId) pair on a miss.
Switch Result<?> to Result<DeleteResult.Error> to restore compile-time type safety and eliminate the unchecked cast. Drain the full server response before throwing on a non-transient error so OkHttp can reuse the connection slot. Restore the thread interrupt flag if Thread.sleep is interrupted. Wrap TRANSIENT_DELETE_CODES in an unmodifiable set. Guard the testRemoveObjects finally block so the redundant cleanup batch is only sent on failure. Add bucket name to error messages and document the retry-by-name re-queueing trade-off.
There was a problem hiding this comment.
Pull request overview
Improves functional-test cleanup reliability by adding retry logic for transient per-object failures during multi-object delete, to prevent false BucketNotEmpty failures during post-test teardown.
Changes:
- Add transient-error classification and retry constants for multi-object deletion.
- Rework
removeObjects()to retry deletion based onDeleteResult.Errorresponses with exponential backoff. - Avoid double-invoking
removeObjects()intestRemoveObjects()when the first call succeeds.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
MinioAsyncClient.removeObjects() sets completed=true as soon as any 1000-object chunk returns errors, silently dropping all subsequent chunks in that call. When toDelete exceeds 1000 entries and a transient error hits the first chunk, objects in later chunks are never attempted and not carried into the retry set, leaving the bucket non-empty. Fix by iterating toDelete in DELETE_BATCH_SIZE slices and issuing a separate removeObjects() call per slice so each slice is always fully processed regardless of errors in other slices.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
After capturing a non-transient error, the previous code continued issuing removeObjects() calls for all remaining chunks before throwing. Move the nonTransientErr check to after each individual chunk's response is drained so no further delete RPCs are made once a fatal error is seen.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Rename MAX_DELETE_RETRIES to MAX_DELETE_ATTEMPTS to better reflect that the constant bounds attempts, not retries. Include err.message() in the non-transient IOException for richer diagnostics. Track failedNames across iterations and include a 5-object sample in the exhaustion IOException to aid debugging when all retries are consumed.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
If removeObjects() in the try block throws, handleException() rethrows it. A second throw from the finally cleanup would then mask the original exception, making the test failure harder to diagnose. Wrap the cleanup call in try/catch so the original exception propagates unmasked.
The cleanup catch block in the finally clause intentionally discards the exception so the original test failure propagates unmasked. Suppress the DE_MIGHT_IGNORE finding at the method level.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| result -> { | ||
| return new DeleteRequest.Object(result.object(), result.versionId()); | ||
| }) | ||
| .map(r -> new DeleteRequest.Object(r.object(), r.versionId())) |
There was a problem hiding this comment.
Why not retry entire results on failure?
There was a problem hiding this comment.
Retrying all results on any transient failure would re-send the full ~1050-object set each time — up to 4 extra full-batch passes — when in practice O(1–2) objects actually returned SlowDown/InternalError. The byName index exists specifically to enable cheap, targeted retries: narrow to just the failing names, re-queue all their (name, versionId) pairs (because DeleteResult.Error omits versionId), and only delete what still needs deleting.
A second reason: non-transient codes throw immediately after draining the current chunk, without entering the retry loop at all. Retrying the full set on any error would wrap a non-transient failure in the same retry path, burning all 5 attempts on 1050+ objects before surfacing it.
There was a problem hiding this comment.
Correction on the second point I raised — that argument doesn't hold. throw nonTransientErr (line 1090) fires immediately after the current chunk is drained, before toDelete is ever reassigned to retryNames. The retry loop never gets a second attempt when a non-transient error occurs regardless of whether toDelete holds the full set or the narrowed names. So non-transient fast-fail is identical either way.
The only real argument for narrowing is efficiency: avoiding re-sending already-deleted objects in the retry batch.
There was a problem hiding this comment.
Why not retry entire
resultson failure?
1.- retrying the entire results re- sends ~1050 objects per attempt when only just a couple may have actually failed transiently. I would prefer to avoid this. Your thoughts @balamurugana ?
There was a problem hiding this comment.
It doesn't matter for tests.
WRT public void removeObjectst(String bucketName, List<ObjectWriteResponse> results) throws Exception,
listObjects()test is the only method creates 1050 objects to makeremoveObjects()on cleanup. Rest of the usages deal with fewer objects.removeObjects()test sends 3005 objects on which 3004 objects are non-existent.
We just need plain/simple code.
…ailure Replace byName index, retryNames set, and toDelete rebuild loop with a single anyTransient boolean flag. Retry the full object list on transient errors rather than narrowing to failing names — already-deleted objects are silently ignored by S3 and MinioAsyncClient filters NoSuchVersion.
6c2be27 to
039724d
Compare
|
@balamurugana PTAL |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Problem
The
removeObjects()cleanup helper inTestMinioClient.javasilently discarded all deletion errors viaignore(r.get()). On transient server-side errors (e.g.SlowDown,InternalError) during a batch delete of 1050 objects, the failing objects remain in the bucket. The subsequentremoveBucket()call then returns409 BucketNotEmpty, causing the test to log a false FAIL from the cleanup path — even when the test logic itself passed.This failure was observed in 5 distinct CI runs across the week ending 2026-04-24, always in
testListObjectswith 1050 objects, always as the sole minio-java failure while 13 of 14 other mint suites passed. Tracked in miniohq/eos#4608.CI evidence (all show the same
BucketNotEmptyonlistObjects()/ 1050 objects):Repeated error fragment across all logs: