Skip to content

Batch prefetching#2089

Open
newren wants to merge 3 commits intogitgitgadget:masterfrom
newren:batch-prefetching
Open

Batch prefetching#2089
newren wants to merge 3 commits intogitgitgadget:masterfrom
newren:batch-prefetching

Conversation

@newren
Copy link
Copy Markdown

@newren newren commented Apr 15, 2026

Partial clones provide a trade-off for users: avoid downloading blobs upfront, at the expense of needing to download them later as they run other commands. This tradeoff can sometimes incur a more severe cost than expected, particularly if needed blobs are discovered as they are accessed, resulting in downloading blobs one at a time. Some commands like checkout, diff, and merge do batch prefetches of necessary blobs, since that can dramatically reduce the pain of on-demand loading. Extend this ability to two more commands: cherry and grep.

This series was spurred by a report where git cherry jobs were each doing hundreds of single-blob fetches, at a cost of 3s each. Batching those downloads should dramatically speed up their jobs. (And I decided to fix up git grep similarly while at it.)

I'll also note that git backfill with revisions and/or pathspecs could also improve things for these users, but since backfill is a manual command users would have to run and requires users to try to figure out which data is needed (a challenge in the case of cherry), it still makes sense to provide smarter behavior for folks who don't choose to manually run backfill.

Also, correct a documentation typo I noticed in patch-ids.h (related to code I was using for the git cherry fixes) as a preparatory fixup.

newren added 3 commits April 15, 2026 07:25
Signed-off-by: Elijah Newren <newren@gmail.com>
In partial clones, `git cherry` fetches necessary blobs on-demand one
at a time, which can be very slow.  We would like to prefetch all
necessary blobs upfront.  To do so, we need to be able to first figure
out which blobs are needed.

`git cherry` does its work in a two-phase approach: first computing
header-only IDs (based on file paths and modes), then falling back to
full content-based IDs only when header-only IDs collide -- or, more
accurately, whenever the oidhash() of the header-only object_ids
collide.

patch-ids.c handles this by creating an ids->patches hashmap that has
all the data we need, but the problem is that any attempt to query the
hashmap will invoke the patch_id_neq() function on any colliding objects,
which causes the on-demand fetching.

Insert a new prefetch_cherry_blobs() function before checking for
collisions.  Use a temporary replacement on the ids->patches.cmpfn
in order to enumerate the blobs that would be needed without yet
fetching them, and then fetch them all at once, then restore the old
ids->patches.cmpfn.

Signed-off-by: Elijah Newren <newren@gmail.com>
In partial clones, `git grep` fetches necessary blobs on-demand one
at a time, which can be very slow.  In partial clones, add an extra
preliminary walk over the tree similar to grep_tree() which collects
the blobs of interest, and then prefetches them.

Signed-off-by: Elijah Newren <newren@gmail.com>
@newren
Copy link
Copy Markdown
Author

newren commented Apr 16, 2026

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget bot commented Apr 16, 2026

Submitted as pull.2089.git.1776379694.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2089/newren/batch-prefetching-v1

To fetch this version to local tag pr-2089/newren/batch-prefetching-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2089/newren/batch-prefetching-v1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant