docs: SERIALIZATIONS.md — catalog the ~11 parquet files in flight by rdhyee · Pull Request #143 · isamplesorg/isamplesorg.github.io

rdhyee · 2026-04-24T14:34:35Z

Summary

Adds SERIALIZATIONS.md at the repo root: a cross-substrate catalog of the parquet files that together constitute the iSamples query substrate.
Organized by tier (source of truth / graph / derived aggregates / display / facet caches / source-specific variants), with a per-file detail section, DAG diagram, URL convention, and cross-links to query-spec.qmd, ZENODO_DEPOSITION_PLAN.md, and pqg/docs/PQG_SPECIFICATION.md.
All sizes, row counts, and schemas were verified by DuckDB DESCRIBE + COUNT(*) against https://data.isamples.org/ on 2026-04-24.

Why

Raymond has observed that ~10+ parquet serializations are in use across the web Explorer, the Python notebook, the progressive globe, the PQG conformance work, and various archival/caching tiers — but no single document catalogs them with role, upstream, and downstream consumers. This fills that gap as a top-level cross-cutting reference (not a tutorial).

Notable findings during verification

Narrow row count: 101.4 M on R2, not the 106 M referenced in the monorepo CLAUDE.md and ZENODO_DEPOSITION_PLAN. (Used the verified number.)
Lite / facets_v2 row count: both are 5.98 M (not 6.7 M); they are already filtered to MaterialSampleRecord-with-coordinates.
facet_summaries has 4 columns (facet_type, facet_value, scheme, count, 56 rows) — the scheme column isn't in the how-to-use description.
h3_summary_* has 7 columns including a resolution column, not the 6 in the how-to-use description.
No OpenContext sidecar file exists yet — thumbnails are currently merged into isamples_202604_wide.parquet via scripts/enrich_wide_with_oc_thumbnails.py. The sidecar pattern is noted as planned.
OC variants (oc_isamples_pqg.parquet, oc_isamples_pqg_wide.parquet) live on GCS, not data.isamples.org — catalogued as a separate tier.

Test plan

Review the DAG diagram for accuracy vs. how you actually build these files
Confirm narrow-label mismatch framing (202512 narrow vs 202601 wide) matches the plan
Spot-check 2-3 DuckDB recipes against live data
Confirm SERIALIZATIONS.md at repo root is the right home (vs. docs/ or a section of query-spec.qmd)
Reconcile any remaining row-count / column-count drift with how-to-use.qmd once this merges

🤖 Generated with Claude Code

Document the ~11 parquet files in flight across the iSamples query substrate: source of truth (Zenodo export), graph (narrow), entity (wide), aggregates (H3 res4/6/8, wide_h3), display projections (lite), facet caches (summaries, cross-filter, facets_v2), and OpenContext source-specific variants. For each: role, upstream, consumers, size, row count (verified against data.isamples.org on 2026-04-24), and headline schema. Cross-links to query-spec.qmd (dimension bindings), ZENODO_DEPOSITION_PLAN.md (archival scope), and pqg/docs/PQG_SPECIFICATION.md (format semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rdhyee · 2026-04-24T14:40:39Z

Coverage gap flagged (from reconciliation with Codex's parallel draft)

A parallel draft covered an additional tier of serializations this PR doesn't mention. Consider adding a small commit before merge:

Raw export tier upstream of the narrow: JSONL / CSV / GeoParquet outputs from export_client/isample export -f {jsonl,csv,geoparquet} + the STAC stac.json / manifest.json sidecars the client emits.
Solr indexed documents as a binding target (not a serialization), per QUERY_SPEC §5.3 — worth naming so readers understand the query dimensions have a Solr mapping even though the portable archive is parquet.
CSV twins for H3 / lite (excluded from the Zenodo deposition but present on R2, ~640 MB aggregate) — worth one row noting "convenience only, not authoritative."

None of these changes the primary catalog; they extend the "source" and "legacy binding" edges of the DAG. ~10 lines of additions.

Addresses the review comment on PR isamplesorg#143 flagging coverage gaps found during the parallel-draft reconciliation: - **Alternative export formats tier** — documents the JSONL/CSV flavors emitted by `isamplesorg/export_client` alongside the GeoParquet that lands in Zenodo. Includes the `stac.json` / `manifest.json` sidecars the client writes next to local exports. - **Legacy bindings and convenience copies tier** — adds a row for the Solr indexed documents (legacy binding, not a serialization — QUERY_SPEC §5.3 keeps this precedent alive even though iSamples Central is offline) and a row for the ~640 MB of CSV twins for H3 + lite files on R2 (convenience only, excluded from the Zenodo deposition). No changes to the existing catalog content — purely additive so the original PR reviewer's flow is preserved.

…NS link Two issues from Codex review: 1. **§2.4 callout wrong about h3_summary schema**: the previous text said the summary tier files carry `h3_res4`, `h3_res6`, `h3_res8`. They don't — they ship `h3_cell` (UBIGINT) + `resolution` (INTEGER) and filter by resolution. Corrected the callout and the §5.1 DuckDB binding row to show the actual form (`h3_cell IN (...) AND resolution = 6`). 2. **Appendix B wrong link target**: the SERIALIZATIONS.md reference pointed at `isamplesorg/pqg/pull/143`, but the catalog PR is `isamplesorg#143`. Fixed.

…arrays Codex review: live DESCRIBE shows material/context/object_type on sample_facets_v2 are VARCHAR scalars, not VARCHAR[] arrays. Previous catalog text described them as "string arrays of URIs" and the example query used `ANY(material)` which fails against scalar columns. Corrected: - §3 catalog row now reads "VARCHAR scalars; each facet column is a single URI per sample (not an array)" - §4.8 per-file detail corrected + query pattern updated to `material = '<uri>'` or `material ILIKE '%substring%'` - Noted that samples tagged with multiple material URIs are represented by a single chosen URI at this grain; for multi-material accuracy readers should JOIN back to wide.p__has_material_category No column shape change — just documentation fix to match the live file.

Codex round-2 review caught two claims on §4.3 (`isamples_202601_wide.parquet`): 1. **DuckDB example fails to execute**. The query `SELECT source, COUNT(*) FROM wide ...` references a `source` column that doesn't exist — wide uses PQG's `n` for the source dimension. Corrected to `SELECT n AS source, COUNT(*) ... GROUP BY n`. Verified: returns SESAR=4.69M, OPENCONTEXT=1.06M, GEOME=606K, SMITHSONIAN=322K. 2. **"each an INT32[]"** understates the actual mixed types. Live DESCRIBE shows some p__* columns are `INTEGER[]` (e.g. p__produced_by, p__sample_location, p__sampling_site, p__site_location, p__registrant, p__curation) and others are `BIGINT[]` (p__has_material_category, p__has_context_category, p__has_sample_object_type, p__keywords, p__responsibility, p__related_resource). Softened to "integer array" and listed the exact types. Also added a "Column name gotcha" bullet flagging the wide/narrow `n` vs lite/facets `source` column-name split — so readers know to alias when moving between files.

…rix) (#145) * Add QUERY_SPEC.md v0.1 (draft) Substrate-neutral query contract spanning DuckDB-WASM (web), DuckDB/Ibis (Python), and Apache Solr (legacy). Names mirror the Solr schema vocabulary (authoritative precedent) with substrate-specific aliases provided in §5. Scope: - Canonical facet / filter dimensions (§2) - Abstract filter grammar (§3) - Full-text search semantics (§3.2, the 16-field Solr searchText target) - Sample-card projection (§4.2) - Substrate binding tables (§5) - Open questions for v0.2 (§7) Out of scope: PQG graph traversal (see QUERY_COMPARISON.md), bulk export, ingestion. Refs isamplesorg.github.io#138. * Apply QUERY_SPEC v0.2 amendments from PQG conformance matrix Amendments informed by isamplesorg/pqg#22 (conformance_matrix.md §4-§5), which audited which shipped parquet files actually carry which spec dimensions: 1. Rename `specimen` → `objectType` (§2.2). Every shipped parquet uses `object_type` / `hasSampleObjectType`; adopt the data-side name as canonical, keep `hasSpecimenCategory` as Solr alias. 2. Drop ghosts: `informalClassification` (§2.2) and `resultTimeRange` (§2.3) — both were in Solr but never migrated to any parquet. Also drop `time_range OVERLAPS` from §3.1 grammar and §5.3 Solr binding. 3. Add `thumbnailURL` to §2.1 as optional (ships in `wide` today for OpenContext only; moving to per-source sidecars — issue #131). 4. Update §5.1 `time BETWEEN` binding from "TBD" to real DuckDB cast: `TRY_CAST(result_time AS TIMESTAMP) BETWEEN t1 AND t2`. `result_time` IS in lite (as VARCHAR). 5. Document H3 column availability in §2.4: `wide_h3` and `h3_summary_res{4,6,8}` carry res 4/6/8; `lite` has res 8 only; plain `wide` / `narrow` carry no H3 columns. 6. Pick `tmodified` (INTEGER epoch) over `last_modified_time` (VARCHAR) for `sourceUpdatedTime` in §2.1; alias the VARCHAR as deprecated. 7. Bump version callout to v0.2. 8. §7 open questions: close Q2 (time filter in lite — now resolved); reframe Q1 around the new `objectType` naming. 9. Appendix B: reference conformance_matrix.md and SERIALIZATIONS.md (pqg#143) as companion documents. Refs isamplesorg/pqg#22, isamplesorg.github.io#138. * fix(query-spec): Codex review — h3_summary column names, SERIALIZATIONS link Two issues from Codex review: 1. **§2.4 callout wrong about h3_summary schema**: the previous text said the summary tier files carry `h3_res4`, `h3_res6`, `h3_res8`. They don't — they ship `h3_cell` (UBIGINT) + `resolution` (INTEGER) and filter by resolution. Corrected the callout and the §5.1 DuckDB binding row to show the actual form (`h3_cell IN (...) AND resolution = 6`). 2. **Appendix B wrong link target**: the SERIALIZATIONS.md reference pointed at `isamplesorg/pqg/pull/143`, but the catalog PR is `#143`. Fixed. * fix(query-spec): source dimension column is 'n' on wide/narrow Codex round-2: §5.1 DuckDB binding claimed `source IN (…)` binds to `source IN (…) on wide / lite parquet`. Wrong for wide — it uses `n` (PQG convention), not `source`. The query as written fails with "Referenced column source not found". Updated the binding row to distinguish: wide / narrow: WHERE n IN (…) lite / sample_facets_v2: WHERE source IN (…) — alias already exposed

A public-facing companion to SERIALIZATIONS.md (PR #143). Where the catalog is internal reference ("every file with role, size, upstream, consumers"), this page is the researcher/developer landing: - Quick-pick table mapping "if you want to do X → use file Y" - Five copy-pasteable DuckDB snippets (every one executed clean against live R2 URLs during authoring) - H3 tier breakpoint reference for map authors - Cross-links to SERIALIZATIONS, QUERY_SPEC, PQG spec, conformance matrix - Data-source + licensing paragraph pointing to the Zenodo community (without speculating on specific license terms) Lands at the site root alongside pubs.qmd and query-spec.qmd. Note on column naming in snippets: the wide parquet uses `n` for the source column (PQG convention); lite and sample_facets_v2 use the friendlier alias `source`. Flagged inline in the snippet comment so Binder/Colab first-timers don't trip on it. Verified on 2026-04-24: all 6 snippets (incl. the callout quick-start) execute against data.isamples.org, returning non-empty results.

rdhyee mentioned this pull request Apr 24, 2026

Backfill project board 7 with active QUERY_SPEC, Zenodo, H3, samples, and sidecar threads #141

Open

rdhyee mentioned this pull request Apr 24, 2026

docs(pubs): expand GitHub Repositories with pipeline diagram + name reconciliation note #144

Merged

rdhyee mentioned this pull request Apr 24, 2026

docs: add user-facing data catalog page (data.qmd) #146

Merged

rdhyee added 2 commits April 24, 2026 08:32

rdhyee mentioned this pull request Apr 24, 2026

QUERY_SPEC.md v0.1 + v0.2 amendments (informed by PQG conformance matrix) #145

Merged

4 tasks

rdhyee merged commit 1932cd8 into isamplesorg:main Apr 24, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: SERIALIZATIONS.md — catalog the ~11 parquet files in flight#143

docs: SERIALIZATIONS.md — catalog the ~11 parquet files in flight#143
rdhyee merged 4 commits intoisamplesorg:mainfrom
rdhyee:docs/serializations-catalog

rdhyee commented Apr 24, 2026

Uh oh!

rdhyee commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rdhyee commented Apr 24, 2026

Summary

Why

Notable findings during verification

Test plan

Uh oh!

rdhyee commented Apr 24, 2026

Coverage gap flagged (from reconciliation with Codex's parallel draft)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant