Skip to content

Add gRPC message size limit handling to query execution#116

Open
btipling wants to merge 2 commits intomainfrom
claude/slack-session-MvKQ6
Open

Add gRPC message size limit handling to query execution#116
btipling wants to merge 2 commits intomainfrom
claude/slack-session-MvKQ6

Conversation

@btipling
Copy link
Copy Markdown
Contributor

@btipling btipling commented Apr 8, 2026

Description

  • New feature

This PR adds gRPC message size limit handling to prevent responses from exceeding the 4MB gRPC message size limit. The changes introduce size-aware query execution that tracks response sizes and terminates queries early when approaching the limit.

Changes

Core Changes:

  • Added runQueryWithSizeLimit() method that executes queries with size-based early termination
  • Introduced queryResult struct to track pagination tokens, total size, item count, and whether size limits were hit
  • Added estimateRowSize() function to estimate serialized size of database rows
  • Defined constants for gRPC limits (3.5MB conservative threshold, size estimate overhead, etc.)

Size Estimation Functions:

  • estimateResourceSize() - estimates serialized size of Resource objects
  • estimateEntitlementSize() - estimates serialized size of Entitlement objects
  • estimateGrantSize() - estimates serialized size of Grant objects

Updated Query Execution:

  • Modified List() in resources.go to use size-aware query execution
  • Modified dynamicEntitlements() in entitlements.go to use size-aware query execution
  • Modified listGrants() in grants.go to use size-aware query execution
  • All callbacks now return item size information for accurate tracking

How It Works

The new runQueryWithSizeLimit() method:

  1. Executes queries with standard pagination limits
  2. Estimates the serialized size of each row before processing
  3. Stops early if adding the next item would exceed the 3.5MB threshold
  4. Generates a next page token when either pagination or size limits are hit
  5. Allows callbacks to provide actual item sizes for more accurate tracking

This ensures that responses stay well below the 4MB gRPC limit while maintaining pagination functionality.

Test Plan

Existing tests pass. The size limit handling is transparent to callers - pagination behavior remains unchanged, with the addition of early termination when size limits are approached.

https://claude.ai/code/session_01Gg185kHrjNU2aznuRQ7CsG

Add size-aware query execution that tracks approximate result sizes and
automatically stops fetching early when approaching the 4MB gRPC message
size limit. This prevents "message too big" errors by:

- Adding runQueryWithSizeLimit() that estimates serialized size during query
- Stopping early when cumulative size approaches 3MB (conservative threshold)
- Returning with a next page token so subsequent calls continue from where we stopped
- Adding size estimation functions for resources, entitlements, and grants

Fixes CXH-1300

https://claude.ai/code/session_01Gg185kHrjNU2aznuRQ7CsG
@btipling btipling requested a review from a team April 8, 2026 15:17
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 8, 2026

Claude finished @btipling's task in 2m 15s —— View job


PR Review: gRPC message size limit handling

  • Gather context and read changed files
  • Analyze code changes for bugs, security, and best practices
  • Post review feedback

Overall: Good idea to add size-based early termination to prevent hitting the 4MB gRPC limit. The approach of tracking cumulative size and breaking early with a next-page token is sound. However, there are two critical pagination bugs that will cause data loss when the size limit is triggered.

Critical Issues

  1. Cursor-based pagination skips rows on size limit (query.go:588-600) — When the size limit is hit, lastRowID has already been updated to the current (unprocessed) row's primary key. The next page query uses WHERE id > lastRowID, permanently skipping that row. Fix: track lastProcessedRowID separately, only updating it after the callback succeeds.

  2. Offset-based pagination skips rows on size limit (query.go:641-650) — nextPageToken computes offset + limit, but when the size limit stops processing early (e.g., 30 of 100 rows), the next page skips ahead by the full limit, losing rows 30-99. Fix: use actual processed item count for the offset calculation.

Minor Issues

  1. Misleading comment (query.go:30-35) — Comment says "3.5MB" but constant is 3MB. PR description also says 3.5MB. Two unused constants (minPageSizeForSizeLimit, pageSizeReductionDivisor) suggest an incomplete page-size reduction strategy.

  2. Inconsistent size estimation (query.go:606-616) — Pre-check uses estimateRowSize(rowMap) (raw row data) but accumulated size uses callback's itemSize (structured object estimate). These can diverge, making the guard unreliable.

  3. Potential nil pointer (grants.go:119-122) — g.Principal.Id could be nil even when g.Principal is not. Same pattern in estimateResourceSize with r.Id.

What looks good

  • The queryResult struct cleanly encapsulates size tracking state
  • The result.TotalSize > 0 guard ensures at least one item is always returned (avoids infinite pagination loops)
  • Preserving the original runQuery avoids breaking provisioning code that doesn't need size limits
  • Size estimation functions are straightforward and easy to tune

Fix the pagination bugs →

Comment thread pkg/bsql/query.go Outdated
Comment on lines +30 to +35
// gRPC message size limits - we use a conservative threshold to avoid hitting the 4MB limit
// The actual limit is 4MB, but we use 3.5MB to leave room for overhead and metadata
maxResponseSizeBytes = 3 * 1024 * 1024 // 3MB conservative limit
sizeEstimateOverhead = 500 // Overhead per item for protobuf encoding
minPageSizeForSizeLimit = 1 // Minimum page size when reducing due to size limits
pageSizeReductionDivisor = 2 // How much to reduce page size when hitting limits
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Misleading comment + unused constants

  1. The comment says "we use 3.5MB" but the constant is 3 * 1024 * 1024 = 3MB, not 3.5MB. The PR description also says 3.5MB. One of them is wrong.

  2. minPageSizeForSizeLimit and pageSizeReductionDivisor are defined but never used anywhere in this PR. They suggest an incomplete "reduce page size" strategy that was never implemented. These should be removed to avoid confusion.

Suggested change
// gRPC message size limits - we use a conservative threshold to avoid hitting the 4MB limit
// The actual limit is 4MB, but we use 3.5MB to leave room for overhead and metadata
maxResponseSizeBytes = 3 * 1024 * 1024 // 3MB conservative limit
sizeEstimateOverhead = 500 // Overhead per item for protobuf encoding
minPageSizeForSizeLimit = 1 // Minimum page size when reducing due to size limits
pageSizeReductionDivisor = 2 // How much to reduce page size when hitting limits
// gRPC message size limits - we use a conservative threshold to avoid hitting the 4MB limit
// The actual limit is 4MB, but we use 3MB to leave room for overhead and metadata
maxResponseSizeBytes = 3 * 1024 * 1024 // 3MB conservative limit
sizeEstimateOverhead = 500 // Overhead per item for protobuf encoding

Comment thread pkg/bsql/query.go
Comment on lines +588 to +600
return nil, err
}

foundPaginationKey := false
rowMap := make(map[string]interface{})
for i, colName := range columns {
rowMap[colName] = values[i]
if pCtx != nil && pCtx.PrimaryKey == colName {
lastRowID = values[i]
foundPaginationKey = true
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (Critical): Data loss on size-limit-triggered pagination with cursor strategy

When the size limit is hit (line 611), lastRowID has already been updated to the current (unprocessed) row's primary key here at line 595. The current row is then skipped (we break without calling the callback).

For cursor-based pagination, nextPageToken returns lastRowID, and the next query uses WHERE id > lastRowID. This means the unprocessed row is permanently skipped — data loss.

For offset-based pagination, nextPageToken computes offset + limit, but we processed fewer rows than limit. This also skips rows.

Fix: track lastProcessedRowID separately and only update it after the callback succeeds.

Suggested change
return nil, err
}
foundPaginationKey := false
rowMap := make(map[string]interface{})
for i, colName := range columns {
rowMap[colName] = values[i]
if pCtx != nil && pCtx.PrimaryKey == colName {
lastRowID = values[i]
foundPaginationKey = true
}
}
if err := rows.Scan(scanArgs...); err != nil {
return nil, err
}
foundPaginationKey := false
var currentRowID any
rowMap := make(map[string]interface{})
for i, colName := range columns {
rowMap[colName] = values[i]
if pCtx != nil && pCtx.PrimaryKey == colName {
currentRowID = values[i]
foundPaginationKey = true
}
}

Then after the size check passes and the callback succeeds, add lastRowID = currentRowID. And for offset-based pagination, you'll need to track actual items processed to compute the correct next offset.

Comment thread pkg/bsql/query.go Outdated
Comment on lines +641 to +650
// Determine if we need a next page token
// We need one if: we hit the page limit, OR we hit the size limit
if pCtx != nil && (hitPageLimit || result.HitSizeLimit) {
result.NextPageToken, err = s.nextPageToken(pCtx, lastRowID)
if err != nil {
return nil, err
}
}

return result, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (Critical): Offset-based pagination skips rows when size limit is hit

For offset strategy, nextPageToken computes offset + pageSize (where pageSize = pCtx.Limit). But when the size limit causes early termination, the actual number of processed items is less than pCtx.Limit.

Example: offset=0, limit=100, but size limit stops after 30 items. Next token = 0 + 100 = 100, skipping rows 30-99.

The offset-based next page token must use the actual item count:

// For offset strategy when hitting size limits, adjust the pagination context
// so the next page starts at the correct offset
if result.HitSizeLimit && pCtx.Strategy == "offset" {
    pCtx.Limit = int64(result.ItemCount)
}

This adjustment should be applied before calling s.nextPageToken().

Comment thread pkg/bsql/grants.go Outdated
Comment on lines +119 to +122
if g.Principal != nil {
size += int64(len(g.Principal.Id.Resource))
size += int64(len(g.Principal.Id.ResourceType))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Potential nil pointer dereference

If g.Principal is non-nil but g.Principal.Id is nil, this will panic. Same pattern exists in estimateResourceSize with r.Id. Consider guarding:

Suggested change
if g.Principal != nil {
size += int64(len(g.Principal.Id.Resource))
size += int64(len(g.Principal.Id.ResourceType))
}
if g.Principal != nil && g.Principal.Id != nil {
size += int64(len(g.Principal.Id.Resource))
size += int64(len(g.Principal.Id.ResourceType))
}

Comment thread pkg/bsql/query.go Outdated
Comment on lines +606 to +616
rowSize := estimateRowSize(rowMap)

// Check if adding this item would exceed size limits
// We check BEFORE processing to avoid returning partial results
if result.TotalSize > 0 && result.TotalSize+rowSize > maxResponseSizeBytes {
result.HitSizeLimit = true
l.Info("stopping query early due to response size limit",
zap.Int64("current_size", result.TotalSize),
zap.Int64("row_size", rowSize),
zap.Int("items_returned", result.ItemCount))
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design concern: Pre-check uses different estimate than post-check

The size limit pre-check uses estimateRowSize(rowMap) (raw row data), but the actual accumulated size uses itemSize from the callback (which calls estimateResourceSize/estimateEntitlementSize/estimateGrantSize). These two estimates can diverge significantly — raw row data is just column key/values, while the callbacks account for protobuf structure, nested objects, annotations, etc.

This means the pre-check guard could allow a row through that actually pushes the total well over the limit, or block a row that would have been fine. Consider using a consistent estimation approach, or at minimum adding a post-callback size check that sets HitSizeLimit if the actual accumulated size exceeds the threshold after processing.

- Fix misleading comment (3.5MB -> 3MB) and remove unused constants
- Fix critical data loss: track lastRowID separately from currentRowID
  so unprocessed rows aren't skipped on cursor-based pagination
- Fix offset pagination: adjust limit to actual item count when size
  limit is hit, so next offset is computed correctly
- Add post-callback size check for consistent size enforcement
- Guard against nil pointer dereference in estimateGrantSize
- Fix godot lint issue (comment period)

https://claude.ai/code/session_01Gg185kHrjNU2aznuRQ7CsG
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 8, 2026

Claude finished @btipling's task in 2m 24s —— View job


PR Review: gRPC message size limit handling (post-fix)

  • Gather context and read changed files
  • Verify critical pagination bugs were fixed
  • Analyze code changes for remaining issues
  • Post review feedback

Overall: The fix commit (fc6c302) correctly addresses all critical and minor issues from the previous review. The code is in good shape.

Critical bugs — verified fixed

1. Cursor-based pagination data loss — ✅ Fixed. currentRowID is now tracked separately (query.go:590), and lastRowID is only updated after the callback succeeds (query.go:630). When the size limit breaks early, lastRowID still points to the last processed row, so the unprocessed row is correctly included in the next page.

2. Offset-based pagination data loss — ✅ Fixed. When HitSizeLimit is true with offset strategy, pCtx.Limit is adjusted to result.ItemCount before computing the next token (query.go:656-658). This ensures the next page offset advances by the actual number of processed items, not the full page size.

Minor issues — verified fixed

3. Misleading comment + unused constants — ✅ Comment now accurately says "3MB" (query.go:31). Unused constants minPageSizeForSizeLimit and pageSizeReductionDivisor removed.

4. Nil pointer guard — ✅ grants.go:119 now checks g.Principal != nil && g.Principal.Id != nil.

5. Inconsistent size estimation — ✅ Addressed by adding a post-callback size check (query.go:634-640). Even if the pre-check (using raw row estimate) allows a row through, the post-callback check catches it if the actual accumulated size exceeds the threshold.

Remaining observations (non-blocking)

  1. Double estimateRowSize call (query.go:607,624) — When itemSize from the callback is 0, estimateRowSize(rowMap) is called twice for the same row (once for the pre-check, once for accumulation). Could cache the result in a local variable. Trivial inefficiency, not a bug.

  2. First item always processed regardless of size — The pre-check guard result.TotalSize > 0 (query.go:607) ensures at least one item is always returned. This is the correct design — it prevents infinite pagination loops on oversized items. A single item exceeding 3MB would still land under the 4MB hard limit given the 1MB headroom.

  3. No tests for the new size-limit path — The PR description says "existing tests pass," but there are no dedicated tests for the early-termination behavior. Consider adding a test that verifies: (a) when a batch approaches 3MB, a next page token is generated; (b) cursor-based pagination resumes at the correct row; (c) offset-based pagination resumes at the correct offset.

Verdict: The pagination fixes are correct. This is ready to merge pending the author's judgment on whether to add targeted tests for the size-limit path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants