Skip to content

fix: retry transient per-object delete failures in removeObjects() cleanup#1700

Open
allanrogerr wants to merge 10 commits intominio:masterfrom
allanrogerr:fix/retry-transient-delete-errors
Open

fix: retry transient per-object delete failures in removeObjects() cleanup#1700
allanrogerr wants to merge 10 commits intominio:masterfrom
allanrogerr:fix/retry-transient-delete-errors

Conversation

@allanrogerr
Copy link
Copy Markdown

@allanrogerr allanrogerr commented Apr 27, 2026

Problem

The removeObjects() cleanup helper in TestMinioClient.java silently discarded all deletion errors via ignore(r.get()). On transient server-side errors (e.g. SlowDown, InternalError) during a batch delete of 1050 objects, the failing objects remain in the bucket. The subsequent removeBucket() call then returns 409 BucketNotEmpty, causing the test to log a false FAIL from the cleanup path — even when the test logic itself passed.

This failure was observed in 5 distinct CI runs across the week ending 2026-04-24, always in testListObjects with 1050 objects, always as the sole minio-java failure while 13 of 14 other mint suites passed. Tracked in miniohq/eos#4608.

CI evidence (all show the same BucketNotEmpty on listObjects() / 1050 objects):

Repeated error fragment across all logs:

"name":"minio-java","function":"listObjects()","args":"[bucket, prefix, recursive, 1050 objects]",
"status":"FAIL","error":"S3 operation failed; ErrorResponseException{errorResponse=
ErrorResponse{code='BucketNotEmpty', message='The bucket you tried to delete is not empty',
bucketName='minio-java-test-<random>'}}`
``

## Fix

`removeObjects()` now **detects deletion errors and retries transient ones**:

- **Read errors**: Process each `DeleteResult.Error` instead of ignoring via `ignore(r.get())`
- **Non-transient errors** (`!TRANSIENT_DELETE_CODES`): throw `IOException` immediately (preserves actual failures)
- **Transient errors** (`InternalError`, `RequestTimeout`, `ServiceUnavailable`, `SlowDown`): retry the full set up to `MAX_DELETE_ATTEMPTS` (5) times with exponential backoff (1s / 2s / 4s / 8s)
- **Success**: break early if no errors returned

The `testRemoveObjects` wrapper's `finally` block is guarded by a `succeeded` flag so the cleanup is only re-attempted on test failure, preventing spurious timeouts on success.

The removeObjects() cleanup helper silently ignored all DeleteErrors via
ignore(r.get()), masking transient server-side failures. When the batch
delete of 1050 objects returned partial errors (e.g. SlowDown,
InternalError), those objects remained in the bucket and the subsequent
removeBucket() call failed with BucketNotEmpty — surfacing a cleanup
race as a false test failure.

Replace the single ignore() call with two focused helpers:

- retryRemoveObject(): retries a single object with exponential backoff
  (500ms base) up to MAX_DELETE_RETRIES attempts. Accepts
  NoSuchKey/NoSuchVersion silently (already gone), retries
  SlowDown/InternalError/RequestTimeout/ServiceUnavailable, and
  propagates anything else immediately.
- removeObjects(): fully consumes the batch-delete iterator first (so
  all 1000-object HTTP batches are sent), then delegates each
  DeleteResult.Error to retryRemoveObject() individually.

DeleteResult.Error only exposes objectName() via ErrorResponse (no
versionId), so a HashMap<String,List<ObjectWriteResponse>> index
preserves correct versioned-delete semantics by storing all responses
keyed by name and retrying every (name,versionId) pair on a miss.
@allanrogerr allanrogerr self-assigned this Apr 27, 2026
Comment thread functional/TestMinioClient.java Outdated
Comment thread functional/TestMinioClient.java Outdated
Switch Result<?> to Result<DeleteResult.Error> to restore compile-time
type safety and eliminate the unchecked cast. Drain the full server
response before throwing on a non-transient error so OkHttp can reuse
the connection slot. Restore the thread interrupt flag if
Thread.sleep is interrupted. Wrap TRANSIENT_DELETE_CODES in an
unmodifiable set. Guard the testRemoveObjects finally block so the
redundant cleanup batch is only sent on failure. Add bucket name to
error messages and document the retry-by-name re-queueing trade-off.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves functional-test cleanup reliability by adding retry logic for transient per-object failures during multi-object delete, to prevent false BucketNotEmpty failures during post-test teardown.

Changes:

  • Add transient-error classification and retry constants for multi-object deletion.
  • Rework removeObjects() to retry deletion based on DeleteResult.Error responses with exponential backoff.
  • Avoid double-invoking removeObjects() in testRemoveObjects() when the first call succeeds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread functional/TestMinioClient.java
Comment thread functional/TestMinioClient.java
Comment thread functional/TestMinioClient.java Outdated
MinioAsyncClient.removeObjects() sets completed=true as soon as any
1000-object chunk returns errors, silently dropping all subsequent
chunks in that call. When toDelete exceeds 1000 entries and a transient
error hits the first chunk, objects in later chunks are never attempted
and not carried into the retry set, leaving the bucket non-empty.

Fix by iterating toDelete in DELETE_BATCH_SIZE slices and issuing a
separate removeObjects() call per slice so each slice is always fully
processed regardless of errors in other slices.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread functional/TestMinioClient.java Outdated
After capturing a non-transient error, the previous code continued
issuing removeObjects() calls for all remaining chunks before throwing.
Move the nonTransientErr check to after each individual chunk's response
is drained so no further delete RPCs are made once a fatal error is seen.
@allanrogerr allanrogerr requested a review from Copilot April 27, 2026 21:14
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread functional/TestMinioClient.java Outdated
Comment thread functional/TestMinioClient.java Outdated
Comment thread functional/TestMinioClient.java
Rename MAX_DELETE_RETRIES to MAX_DELETE_ATTEMPTS to better reflect
that the constant bounds attempts, not retries. Include err.message()
in the non-transient IOException for richer diagnostics. Track
failedNames across iterations and include a 5-object sample in the
exhaustion IOException to aid debugging when all retries are consumed.
@allanrogerr allanrogerr requested a review from Copilot April 27, 2026 21:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread functional/TestMinioClient.java Outdated
Comment thread functional/TestMinioClient.java
If removeObjects() in the try block throws, handleException() rethrows
it. A second throw from the finally cleanup would then mask the original
exception, making the test failure harder to diagnose. Wrap the cleanup
call in try/catch so the original exception propagates unmasked.
The cleanup catch block in the finally clause intentionally discards
the exception so the original test failure propagates unmasked.
Suppress the DE_MIGHT_IGNORE finding at the method level.
@allanrogerr allanrogerr requested a review from Copilot April 27, 2026 21:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

result -> {
return new DeleteRequest.Object(result.object(), result.versionId());
})
.map(r -> new DeleteRequest.Object(r.object(), r.versionId()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not retry entire results on failure?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrying all results on any transient failure would re-send the full ~1050-object set each time — up to 4 extra full-batch passes — when in practice O(1–2) objects actually returned SlowDown/InternalError. The byName index exists specifically to enable cheap, targeted retries: narrow to just the failing names, re-queue all their (name, versionId) pairs (because DeleteResult.Error omits versionId), and only delete what still needs deleting.

A second reason: non-transient codes throw immediately after draining the current chunk, without entering the retry loop at all. Retrying the full set on any error would wrap a non-transient failure in the same retry path, burning all 5 attempts on 1050+ objects before surfacing it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction on the second point I raised — that argument doesn't hold. throw nonTransientErr (line 1090) fires immediately after the current chunk is drained, before toDelete is ever reassigned to retryNames. The retry loop never gets a second attempt when a non-transient error occurs regardless of whether toDelete holds the full set or the narrowed names. So non-transient fast-fail is identical either way.

The only real argument for narrowing is efficiency: avoiding re-sending already-deleted objects in the retry batch.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not retry entire results on failure?

1.- retrying the entire results re- sends ~1050 objects per attempt when only just a couple may have actually failed transiently. I would prefer to avoid this. Your thoughts @balamurugana ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter for tests.
WRT public void removeObjectst(String bucketName, List<ObjectWriteResponse> results) throws Exception,

  1. listObjects() test is the only method creates 1050 objects to make removeObjects() on cleanup. Rest of the usages deal with fewer objects.
  2. removeObjects() test sends 3005 objects on which 3004 objects are non-existent.

We just need plain/simple code.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

…ailure

Replace byName index, retryNames set, and toDelete rebuild loop with a
single anyTransient boolean flag. Retry the full object list on transient
errors rather than narrowing to failing names — already-deleted objects
are silently ignored by S3 and MinioAsyncClient filters NoSuchVersion.
@allanrogerr allanrogerr force-pushed the fix/retry-transient-delete-errors branch from 6c2be27 to 039724d Compare April 28, 2026 12:34
@allanrogerr
Copy link
Copy Markdown
Author

@balamurugana PTAL

harshavardhana
harshavardhana previously approved these changes May 2, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants