fix: retry transient per-object delete failures in removeObjects() cleanup by allanrogerr · Pull Request #1700 · minio/minio-java

allanrogerr · 2026-04-27T13:16:12Z

Problem

The removeObjects() cleanup helper in TestMinioClient.java silently discarded all deletion errors via ignore(r.get()). On transient server-side errors (e.g. SlowDown, InternalError) during a batch delete of 1050 objects, the failing objects remain in the bucket. The subsequent removeBucket() call then returns 409 BucketNotEmpty, causing the test to log a false FAIL from the cleanup path — even when the test logic itself passed.

This failure was observed in 5 distinct CI runs across the week ending 2026-04-24, always in testListObjects with 1050 objects, always as the sole minio-java failure while 13 of 14 other mint suites passed. Tracked in miniohq/eos#4608.

CI evidence (all show the same BucketNotEmpty on listObjects() / 1050 objects):

Repeated error fragment across all logs:

"name":"minio-java","function":"listObjects()","args":"[bucket, prefix, recursive, 1050 objects]",
"status":"FAIL","error":"S3 operation failed; ErrorResponseException{errorResponse=
ErrorResponse{code='BucketNotEmpty', message='The bucket you tried to delete is not empty',
bucketName='minio-java-test-<random>'}}`
``

## Fix

`removeObjects()` now **detects deletion errors and retries transient ones**:

- **Read errors**: Process each `DeleteResult.Error` instead of ignoring via `ignore(r.get())`
- **Non-transient errors** (`!TRANSIENT_DELETE_CODES`): throw `IOException` immediately (preserves actual failures)
- **Transient errors** (`InternalError`, `RequestTimeout`, `ServiceUnavailable`, `SlowDown`): retry the full set up to `MAX_DELETE_ATTEMPTS` (5) times with exponential backoff (1s / 2s / 4s / 8s)
- **Success**: break early if no errors returned

The `testRemoveObjects` wrapper's `finally` block is guarded by a `succeeded` flag so the cleanup is only re-attempted on test failure, preventing spurious timeouts on success.

The removeObjects() cleanup helper silently ignored all DeleteErrors via ignore(r.get()), masking transient server-side failures. When the batch delete of 1050 objects returned partial errors (e.g. SlowDown, InternalError), those objects remained in the bucket and the subsequent removeBucket() call failed with BucketNotEmpty — surfacing a cleanup race as a false test failure. Replace the single ignore() call with two focused helpers: - retryRemoveObject(): retries a single object with exponential backoff (500ms base) up to MAX_DELETE_RETRIES attempts. Accepts NoSuchKey/NoSuchVersion silently (already gone), retries SlowDown/InternalError/RequestTimeout/ServiceUnavailable, and propagates anything else immediately. - removeObjects(): fully consumes the batch-delete iterator first (so all 1000-object HTTP batches are sent), then delegates each DeleteResult.Error to retryRemoveObject() individually. DeleteResult.Error only exposes objectName() via ErrorResponse (no versionId), so a HashMap<String,List<ObjectWriteResponse>> index preserves correct versioned-delete semantics by storing all responses keyed by name and retrying every (name,versionId) pair on a miss.

Switch Result<?> to Result<DeleteResult.Error> to restore compile-time type safety and eliminate the unchecked cast. Drain the full server response before throwing on a non-transient error so OkHttp can reuse the connection slot. Restore the thread interrupt flag if Thread.sleep is interrupted. Wrap TRANSIENT_DELETE_CODES in an unmodifiable set. Guard the testRemoveObjects finally block so the redundant cleanup batch is only sent on failure. Add bucket name to error messages and document the retry-by-name re-queueing trade-off.

Copilot

Pull request overview

Improves functional-test cleanup reliability by adding retry logic for transient per-object failures during multi-object delete, to prevent false BucketNotEmpty failures during post-test teardown.

Changes:

Add transient-error classification and retry constants for multi-object deletion.
Rework removeObjects() to retry deletion based on DeleteResult.Error responses with exponential backoff.
Avoid double-invoking removeObjects() in testRemoveObjects() when the first call succeeds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

MinioAsyncClient.removeObjects() sets completed=true as soon as any 1000-object chunk returns errors, silently dropping all subsequent chunks in that call. When toDelete exceeds 1000 entries and a transient error hits the first chunk, objects in later chunks are never attempted and not carried into the retry set, leaving the bucket non-empty. Fix by iterating toDelete in DELETE_BATCH_SIZE slices and issuing a separate removeObjects() call per slice so each slice is always fully processed regardless of errors in other slices.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

After capturing a non-transient error, the previous code continued issuing removeObjects() calls for all remaining chunks before throwing. Move the nonTransientErr check to after each individual chunk's response is drained so no further delete RPCs are made once a fatal error is seen.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Rename MAX_DELETE_RETRIES to MAX_DELETE_ATTEMPTS to better reflect that the constant bounds attempts, not retries. Include err.message() in the non-transient IOException for richer diagnostics. Track failedNames across iterations and include a 5-object sample in the exhaustion IOException to aid debugging when all retries are consumed.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

If removeObjects() in the try block throws, handleException() rethrows it. A second throw from the finally cleanup would then mask the original exception, making the test failure harder to diagnose. Wrap the cleanup call in try/catch so the original exception propagates unmasked.

The cleanup catch block in the finally clause intentionally discards the exception so the original test failure propagates unmasked. Suppress the DE_MIGHT_IGNORE finding at the method level.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

balamurugana · 2026-04-28T01:13:53Z

-                result -> {
-                  return new DeleteRequest.Object(result.object(), result.versionId());
-                })
+            .map(r -> new DeleteRequest.Object(r.object(), r.versionId()))


Why not retry entire results on failure?

Retrying all results on any transient failure would re-send the full ~1050-object set each time — up to 4 extra full-batch passes — when in practice O(1–2) objects actually returned SlowDown/InternalError. The byName index exists specifically to enable cheap, targeted retries: narrow to just the failing names, re-queue all their (name, versionId) pairs (because DeleteResult.Error omits versionId), and only delete what still needs deleting.

A second reason: non-transient codes throw immediately after draining the current chunk, without entering the retry loop at all. Retrying the full set on any error would wrap a non-transient failure in the same retry path, burning all 5 attempts on 1050+ objects before surfacing it.

Correction on the second point I raised — that argument doesn't hold. throw nonTransientErr (line 1090) fires immediately after the current chunk is drained, before toDelete is ever reassigned to retryNames. The retry loop never gets a second attempt when a non-transient error occurs regardless of whether toDelete holds the full set or the narrowed names. So non-transient fast-fail is identical either way.

The only real argument for narrowing is efficiency: avoiding re-sending already-deleted objects in the retry batch.

Why not retry entire results on failure?

1.- retrying the entire results re- sends ~1050 objects per attempt when only just a couple may have actually failed transiently. I would prefer to avoid this. Your thoughts @balamurugana ?

It doesn't matter for tests.
WRT public void removeObjectst(String bucketName, List<ObjectWriteResponse> results) throws Exception,

listObjects() test is the only method creates 1050 objects to make removeObjects() on cleanup. Rest of the usages deal with fewer objects.

removeObjects() test sends 3005 objects on which 3004 objects are non-existent.

We just need plain/simple code.

…ailure Replace byName index, retryNames set, and toDelete rebuild loop with a single anyTransient boolean flag. Retry the full object list on transient errors rather than narrowing to failing names — already-deleted objects are silently ignored by S3 and MinioAsyncClient filters NoSuchVersion.

allanrogerr · 2026-05-02T15:53:59Z

@balamurugana PTAL

…ailure

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

allanrogerr added 2 commits April 27, 2026 13:01

style: apply Google Java Format (spotless)

30b5c35

allanrogerr self-assigned this Apr 27, 2026

balamurugana reviewed Apr 27, 2026

View reviewed changes

Comment thread functional/TestMinioClient.java Outdated

Comment thread functional/TestMinioClient.java Outdated

allanrogerr requested a review from Copilot April 27, 2026 20:39

Copilot started reviewing on behalf of allanrogerr April 27, 2026 20:40 View session