Skip to content

[pull] main from containerd:main#368

Merged
pull[bot] merged 5 commits into
meonBot:mainfrom
containerd:main
Jun 3, 2026
Merged

[pull] main from containerd:main#368
pull[bot] merged 5 commits into
meonBot:mainfrom
containerd:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Jun 3, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

dmcgowan and others added 5 commits April 28, 2026 17:25
The pull progress reporter resets lastSeenTimestamp on every tick where
activeReqs == 0, but never on the transition to a non-zero count. When a
pull is held in content.OpenWriter (idle in HTTP terms) and then
unblocks, the next request can be cancelled less than `timeout` after it
was actually issued — its first byte must arrive within whatever fraction
of `timeout` remains on the timer captured during the previous idle
tick.

Track the previous tick's activeReqs and reset the timer on the 0→1
transition so a newly-issued request always gets a full timeout window
to produce its first byte. This deflakes
TestCRIImagePullTimeout/HoldingContentOpenWriterWithLocalPull, which
hits ghcr.io directly and can exceed the shrunken window during
auth handshakes in CI.

Signed-off-by: Derek McGowan <derek@mcg.dev>
…-reset

cri: reset pull progress timer on idle→active transition
The CRI progress reporter cancels an image pull if it sees no progress
for 5 seconds. It tracks this through active HTTP requests. During
remote fetches, the HTTP response reader is closed via a deferred
call after `content.Copy` completes.

Diagnosis:
`content.Copy` handles both downloading the stream and committing
the writer to the content store. Any delays during the database
commit phase (e.g. from database locks, slow disk syncs, or concurrent
pull deduplication blocks) keep the HTTP connection open. The progress
reporter sees the request is still active (`activeReqs = 1`) but no new
bytes are coming in, leading to a premature timeout cancellation.

Reproduction:
We reproduced this flakiness deterministically on a GCE VM under a
simulated 2 Mbps ingress bandwidth limit using Linux traffic control
ingress policing (`tc filter ... action police rate 2mbit`). Under this
slowness, the download took longer than the progress timeout during the
slow commit phase, triggering context cancellation and failing the
`TestCRIImagePullTimeout/HoldingContentOpenWriterWithLocalPull` test.

Solution:
To fix this, we wrap the HTTP reader in a `closeOnEOFReader` or
`closeOnEOFReadSeeker` before handing it to `content.Copy`. If the
underlying connection reader implements `io.Seeker`, it is dynamically
wrapped in `closeOnEOFReadSeeker` to forward `Seek` operations. This
ensures that O(1) Range seeks are fully preserved during network
resumes or retries. The wrappers automatically close the underlying
network stream as soon as `Read()` returns `io.EOF` (when the download
completes, before the database commit begins). This drops `activeReqs`
to `0` early, freeing the socket and preventing progress timeouts
during commits. A `sync.Once` ensures that subsequent deferred
`Close()` calls do not double-decrement the reporter.

How it was tested:
Verified the fix on a GCE VM under a simulated 2 Mbps ingress
bandwidth limit. Verified seeker safety via standalone logic audits
and trace proofs.

Assisted-by: Antigravity
Signed-off-by: Samuel Karp <samuelkarp@google.com>
The TestCRIImagePullTimeout test case "NoDataTransferred" flaked under
constrained networks because the test proxy mirror registry used a
blocking ReadAtLeast call to forward bytes to containerd.

This blocking wait (up to 4KB) meant the mirror registry server
completely stopped forwarding data during network slowness, triggering
containerd's aggressive 5-second progress timeout and canceling the
pull before it could reach its 3MB circuit-breaker limit.

This is resolved by changing the proxy's custom copy loop from
io.ReadAtLeast(src, buf, len(buf)) to standard src.Read(buf). This
streams network chunks to containerd immediately as they arrive,
preventing false timeout cancellations while maintaining correct
circuit-breaker byte tracking.

Assisted-by: Antigravity
Signed-off-by: Samuel Karp <samuelkarp@google.com>
remotes: close fetch reader immediately on EOF
@pull pull Bot locked and limited conversation to collaborators Jun 3, 2026
@pull pull Bot added the ⤵️ pull label Jun 3, 2026
@pull pull Bot merged commit 9eed061 into meonBot:main Jun 3, 2026
1 check passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants