Skip to content

feat(bulk-import): add client-side parquet validation#638

Open
tim-pinecone wants to merge 3 commits intopinecone-io:mainfrom
tim-pinecone:feat/bulk-import-parquet-validation
Open

feat(bulk-import): add client-side parquet validation#638
tim-pinecone wants to merge 3 commits intopinecone-io:mainfrom
tim-pinecone:feat/bulk-import-parquet-validation

Conversation

@tim-pinecone
Copy link
Copy Markdown

@tim-pinecone tim-pinecone commented Apr 22, 2026

Summary

  • Adds index.bulk_import.validate(uri) and the standalone pinecone.validate_bulk_import(uri) to check parquet files for Pinecone bulk import compatibility before sending them to the server
  • Schema validation reads only the parquet file footer (no data downloaded) — cheap even for large remote files
  • Optional data sampling (sample_rows=100 by default) checks for null IDs, non-finite vector values, metadata JSON validity, and the 40 KB metadata size limit
  • Returns BulkImportValidationResult with .is_valid, .errors, .warnings, .files_checked, .rows_sampled; the .uri field can be passed directly to index.bulk_import.start()
  • Verbose mode prints per-file OK/BAD progress and a final summary
  • Supports single .parquet files and directory/prefix URIs; handles s3://, gs://, and az:// via the pyarrow filesystem abstraction
  • pyarrow is an optional dependency: pip install 'pinecone[parquet]'

Usage

# Standalone (no index needed)
from pinecone import validate_bulk_import

result = validate_bulk_import("s3://my-bucket/vectors/")
if result.is_valid:
    index.bulk_import.start(uri=result.uri)
else:
    for error in result.errors:
        print(error)

# Via the index object
result = index.bulk_import.validate(
    "s3://my-bucket/vectors/",
    dimension=1024,
    sample_rows=0,   # schema-only, no data download
)

# Verbose progress across many files
result = index.bulk_import.validate("s3://my-bucket/vectors/", verbose=True)
# [  1/50] OK   s3://my-bucket/vectors/part-00000.parquet
# [  2/50] OK   s3://my-bucket/vectors/part-00001.parquet
# ...
# Total: 50  OK: 50  BAD: 0  rows sampled: 5000

Test plan

  • uv run pytest tests/unit/data/test_bulk_import_validator.py — 40 unit tests (schema validation, data sampling, end-to-end file I/O with real parquet files on disk)
  • Validated manually against a real S3 bucket with 50 sparse parquet files
  • uv run mypy pinecone passes clean

🤖 Generated with Claude Code


Note

Medium Risk
Introduces new file/URI inspection logic using pyarrow (including remote filesystem access) and surfaces it via public SDK APIs; failures or edge cases could affect user workflows, but it is additive and optional.

Overview
Adds a client-side parquet compatibility validator for bulk import, exposed as index.bulk_import.validate(...) (sync and asyncio) and the top-level pinecone.validate_bulk_import(...) helper.

The validator inspects parquet schema (footer-only by default) and optionally samples rows to catch common import blockers (missing/typed columns, dimension mismatches, null/empty IDs, non-finite vector values, and metadata JSON/size issues), returning a new BulkImportValidationResult with errors/warnings and counts.

Introduces a new optional extra pinecone[parquet] (adds pyarrow) plus mypy overrides for pyarrow, and adds a comprehensive unit test suite covering schema, sampling, and end-to-end parquet files.

Reviewed by Cursor Bugbot for commit 322b522. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds `index.bulk_import.validate(uri)` and the top-level
`pinecone.validate_bulk_import(uri)` helper so users can check parquet
files for schema and data correctness before sending them to the server.

- Reads only the parquet footer (schema) by default — no vector data
  downloaded even for large remote files
- Optionally samples up to N rows to detect null IDs, non-finite values,
  metadata JSON errors, and the 40 KB metadata size limit
- Supports single files and directories via the pyarrow filesystem
  abstraction (s3://, gs://, az:// URIs work automatically)
- Returns BulkImportValidationResult with .is_valid, .errors, .warnings,
  .files_checked, .rows_sampled; the .uri field can be passed directly
  to index.bulk_import.start()
- Verbose mode prints per-file OK/BAD lines and a final summary
- pyarrow is an optional dependency: pip install 'pinecone[parquet]'
- 40 unit tests covering schema validation, data sampling, and
  end-to-end file I/O via real parquet files on disk

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread pinecone/db_data/resources/sync/bulk_import_validator.py
…message

sparse_indices is a sub-field of the sparse_values struct, not a valid
top-level column. Listing it as allowed in the error message contradicted
the validation logic and would confuse users trying to fix their files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d125b19. Configure here.

Comment thread pinecone/__init__.pyi
Comment thread pinecone/db_data/resources/sync/bulk_import_validator.py
- Add `continue` after metadata size error to prevent double-reporting the same row
- Apply consistent error prefix (respects multi-file flag) to schema-read failures
- Remove quoted return-type annotations on validate() — class is imported at module level
- Add BulkImportValidationResult and validate_bulk_import to __init__.pyi __all__
- Use explicit re-export pattern (import X as X) in __init__.pyi to satisfy ruff F401
- Remove unused TYPE_CHECKING import of pyarrow.parquet in bulk_import_validator.py
- Remove unused imports and variables in test_bulk_import_validator.py (ruff F841/F401)
- Add mypy overrides for pyarrow optional dependency to silence import-not-found errors

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant