feat(bulk-import): add client-side parquet validation by tim-pinecone · Pull Request #638 · pinecone-io/pinecone-python-client

tim-pinecone · 2026-04-22T00:32:30Z

Summary

Adds index.bulk_import.validate(uri) and the standalone pinecone.validate_bulk_import(uri) to check parquet files for Pinecone bulk import compatibility before sending them to the server
Schema validation reads only the parquet file footer (no data downloaded) — cheap even for large remote files
Optional data sampling (sample_rows=100 by default) checks for null IDs, non-finite vector values, metadata JSON validity, and the 40 KB metadata size limit
Returns BulkImportValidationResult with .is_valid, .errors, .warnings, .files_checked, .rows_sampled; the .uri field can be passed directly to index.bulk_import.start()
Verbose mode prints per-file OK/BAD progress and a final summary
Supports single .parquet files and directory/prefix URIs; handles s3://, gs://, and az:// via the pyarrow filesystem abstraction
pyarrow is an optional dependency: pip install 'pinecone[parquet]'

Usage

# Standalone (no index needed)
from pinecone import validate_bulk_import

result = validate_bulk_import("s3://my-bucket/vectors/")
if result.is_valid:
    index.bulk_import.start(uri=result.uri)
else:
    for error in result.errors:
        print(error)

# Via the index object
result = index.bulk_import.validate(
    "s3://my-bucket/vectors/",
    dimension=1024,
    sample_rows=0,   # schema-only, no data download
)

# Verbose progress across many files
result = index.bulk_import.validate("s3://my-bucket/vectors/", verbose=True)
# [  1/50] OK   s3://my-bucket/vectors/part-00000.parquet
# [  2/50] OK   s3://my-bucket/vectors/part-00001.parquet
# ...
# Total: 50  OK: 50  BAD: 0  rows sampled: 5000

Test plan

uv run pytest tests/unit/data/test_bulk_import_validator.py — 40 unit tests (schema validation, data sampling, end-to-end file I/O with real parquet files on disk)
Validated manually against a real S3 bucket with 50 sparse parquet files
uv run mypy pinecone passes clean

🤖 Generated with Claude Code

Note

Medium Risk
Introduces new file/URI inspection logic using pyarrow (including remote filesystem access) and surfaces it via public SDK APIs; failures or edge cases could affect user workflows, but it is additive and optional.

Overview
Adds a client-side parquet compatibility validator for bulk import, exposed as index.bulk_import.validate(...) (sync and asyncio) and the top-level pinecone.validate_bulk_import(...) helper.

The validator inspects parquet schema (footer-only by default) and optionally samples rows to catch common import blockers (missing/typed columns, dimension mismatches, null/empty IDs, non-finite vector values, and metadata JSON/size issues), returning a new BulkImportValidationResult with errors/warnings and counts.

Introduces a new optional extra pinecone[parquet] (adds pyarrow) plus mypy overrides for pyarrow, and adds a comprehensive unit test suite covering schema, sampling, and end-to-end parquet files.

^{Reviewed by Cursor Bugbot for commit 322b522. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds `index.bulk_import.validate(uri)` and the top-level `pinecone.validate_bulk_import(uri)` helper so users can check parquet files for schema and data correctness before sending them to the server. - Reads only the parquet footer (schema) by default — no vector data downloaded even for large remote files - Optionally samples up to N rows to detect null IDs, non-finite values, metadata JSON errors, and the 40 KB metadata size limit - Supports single files and directories via the pyarrow filesystem abstraction (s3://, gs://, az:// URIs work automatically) - Returns BulkImportValidationResult with .is_valid, .errors, .warnings, .files_checked, .rows_sampled; the .uri field can be passed directly to index.bulk_import.start() - Verbose mode prints per-file OK/BAD lines and a final summary - pyarrow is an optional dependency: pip install 'pinecone[parquet]' - 40 unit tests covering schema validation, data sampling, and end-to-end file I/O via real parquet files on disk Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…message sparse_indices is a sub-field of the sparse_values struct, not a valid top-level column. Listing it as allowed in the error message contradicted the validation logic and would confuse users trying to fix their files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit d125b19. Configure here.}

- Add `continue` after metadata size error to prevent double-reporting the same row - Apply consistent error prefix (respects multi-file flag) to schema-read failures - Remove quoted return-type annotations on validate() — class is imported at module level - Add BulkImportValidationResult and validate_bulk_import to __init__.pyi __all__ - Use explicit re-export pattern (import X as X) in __init__.pyi to satisfy ruff F401 - Remove unused TYPE_CHECKING import of pyarrow.parquet in bulk_import_validator.py - Remove unused imports and variables in test_bulk_import_validator.py (ruff F841/F401) - Add mypy overrides for pyarrow optional dependency to silence import-not-found errors Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread pinecone/db_data/resources/sync/bulk_import_validator.py

cursor Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread pinecone/__init__.pyi

Comment thread pinecone/db_data/resources/sync/bulk_import_validator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bulk-import): add client-side parquet validation#638

feat(bulk-import): add client-side parquet validation#638
tim-pinecone wants to merge 3 commits intopinecone-io:mainfrom
tim-pinecone:feat/bulk-import-parquet-validation

tim-pinecone commented Apr 22, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tim-pinecone commented Apr 22, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Test plan

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tim-pinecone commented Apr 22, 2026 •

edited by cursor Bot

Loading