feat(bulk-import): add client-side parquet validation#638
Open
tim-pinecone wants to merge 3 commits intopinecone-io:mainfrom
Open
feat(bulk-import): add client-side parquet validation#638tim-pinecone wants to merge 3 commits intopinecone-io:mainfrom
tim-pinecone wants to merge 3 commits intopinecone-io:mainfrom
Conversation
Adds `index.bulk_import.validate(uri)` and the top-level `pinecone.validate_bulk_import(uri)` helper so users can check parquet files for schema and data correctness before sending them to the server. - Reads only the parquet footer (schema) by default — no vector data downloaded even for large remote files - Optionally samples up to N rows to detect null IDs, non-finite values, metadata JSON errors, and the 40 KB metadata size limit - Supports single files and directories via the pyarrow filesystem abstraction (s3://, gs://, az:// URIs work automatically) - Returns BulkImportValidationResult with .is_valid, .errors, .warnings, .files_checked, .rows_sampled; the .uri field can be passed directly to index.bulk_import.start() - Verbose mode prints per-file OK/BAD lines and a final summary - pyarrow is an optional dependency: pip install 'pinecone[parquet]' - 40 unit tests covering schema validation, data sampling, and end-to-end file I/O via real parquet files on disk Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…message sparse_indices is a sub-field of the sparse_values struct, not a valid top-level column. Listing it as allowed in the error message contradicted the validation logic and would confuse users trying to fix their files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d125b19. Configure here.
- Add `continue` after metadata size error to prevent double-reporting the same row - Apply consistent error prefix (respects multi-file flag) to schema-read failures - Remove quoted return-type annotations on validate() — class is imported at module level - Add BulkImportValidationResult and validate_bulk_import to __init__.pyi __all__ - Use explicit re-export pattern (import X as X) in __init__.pyi to satisfy ruff F401 - Remove unused TYPE_CHECKING import of pyarrow.parquet in bulk_import_validator.py - Remove unused imports and variables in test_bulk_import_validator.py (ruff F841/F401) - Add mypy overrides for pyarrow optional dependency to silence import-not-found errors Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
index.bulk_import.validate(uri)and the standalonepinecone.validate_bulk_import(uri)to check parquet files for Pinecone bulk import compatibility before sending them to the serversample_rows=100by default) checks for null IDs, non-finite vector values, metadata JSON validity, and the 40 KB metadata size limitBulkImportValidationResultwith.is_valid,.errors,.warnings,.files_checked,.rows_sampled; the.urifield can be passed directly toindex.bulk_import.start().parquetfiles and directory/prefix URIs; handless3://,gs://, andaz://via the pyarrow filesystem abstractionpyarrowis an optional dependency:pip install 'pinecone[parquet]'Usage
Test plan
uv run pytest tests/unit/data/test_bulk_import_validator.py— 40 unit tests (schema validation, data sampling, end-to-end file I/O with real parquet files on disk)uv run mypy pineconepasses clean🤖 Generated with Claude Code
Note
Medium Risk
Introduces new file/URI inspection logic using
pyarrow(including remote filesystem access) and surfaces it via public SDK APIs; failures or edge cases could affect user workflows, but it is additive and optional.Overview
Adds a client-side parquet compatibility validator for bulk import, exposed as
index.bulk_import.validate(...)(sync and asyncio) and the top-levelpinecone.validate_bulk_import(...)helper.The validator inspects parquet schema (footer-only by default) and optionally samples rows to catch common import blockers (missing/typed columns, dimension mismatches, null/empty IDs, non-finite vector values, and metadata JSON/size issues), returning a new
BulkImportValidationResultwith errors/warnings and counts.Introduces a new optional extra
pinecone[parquet](addspyarrow) plus mypy overrides forpyarrow, and adds a comprehensive unit test suite covering schema, sampling, and end-to-end parquet files.Reviewed by Cursor Bugbot for commit 322b522. Bugbot is set up for automated code reviews on this repo. Configure here.