Skip to content

docs: SERIALIZATIONS.md — catalog the ~11 parquet files in flight#143

Merged
rdhyee merged 4 commits intoisamplesorg:mainfrom
rdhyee:docs/serializations-catalog
Apr 24, 2026
Merged

docs: SERIALIZATIONS.md — catalog the ~11 parquet files in flight#143
rdhyee merged 4 commits intoisamplesorg:mainfrom
rdhyee:docs/serializations-catalog

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 24, 2026

Summary

  • Adds SERIALIZATIONS.md at the repo root: a cross-substrate catalog of the parquet files that together constitute the iSamples query substrate.
  • Organized by tier (source of truth / graph / derived aggregates / display / facet caches / source-specific variants), with a per-file detail section, DAG diagram, URL convention, and cross-links to query-spec.qmd, ZENODO_DEPOSITION_PLAN.md, and pqg/docs/PQG_SPECIFICATION.md.
  • All sizes, row counts, and schemas were verified by DuckDB DESCRIBE + COUNT(*) against https://data.isamples.org/ on 2026-04-24.

Why

Raymond has observed that ~10+ parquet serializations are in use across the web Explorer, the Python notebook, the progressive globe, the PQG conformance work, and various archival/caching tiers — but no single document catalogs them with role, upstream, and downstream consumers. This fills that gap as a top-level cross-cutting reference (not a tutorial).

Notable findings during verification

  • Narrow row count: 101.4 M on R2, not the 106 M referenced in the monorepo CLAUDE.md and ZENODO_DEPOSITION_PLAN. (Used the verified number.)
  • Lite / facets_v2 row count: both are 5.98 M (not 6.7 M); they are already filtered to MaterialSampleRecord-with-coordinates.
  • facet_summaries has 4 columns (facet_type, facet_value, scheme, count, 56 rows) — the scheme column isn't in the how-to-use description.
  • h3_summary_* has 7 columns including a resolution column, not the 6 in the how-to-use description.
  • No OpenContext sidecar file exists yet — thumbnails are currently merged into isamples_202604_wide.parquet via scripts/enrich_wide_with_oc_thumbnails.py. The sidecar pattern is noted as planned.
  • OC variants (oc_isamples_pqg.parquet, oc_isamples_pqg_wide.parquet) live on GCS, not data.isamples.org — catalogued as a separate tier.

Test plan

  • Review the DAG diagram for accuracy vs. how you actually build these files
  • Confirm narrow-label mismatch framing (202512 narrow vs 202601 wide) matches the plan
  • Spot-check 2-3 DuckDB recipes against live data
  • Confirm SERIALIZATIONS.md at repo root is the right home (vs. docs/ or a section of query-spec.qmd)
  • Reconcile any remaining row-count / column-count drift with how-to-use.qmd once this merges

🤖 Generated with Claude Code

Document the ~11 parquet files in flight across the iSamples query
substrate: source of truth (Zenodo export), graph (narrow), entity
(wide), aggregates (H3 res4/6/8, wide_h3), display projections (lite),
facet caches (summaries, cross-filter, facets_v2), and OpenContext
source-specific variants. For each: role, upstream, consumers, size,
row count (verified against data.isamples.org on 2026-04-24), and
headline schema.

Cross-links to query-spec.qmd (dimension bindings),
ZENODO_DEPOSITION_PLAN.md (archival scope), and
pqg/docs/PQG_SPECIFICATION.md (format semantics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 24, 2026

Coverage gap flagged (from reconciliation with Codex's parallel draft)

A parallel draft covered an additional tier of serializations this PR doesn't mention. Consider adding a small commit before merge:

  • Raw export tier upstream of the narrow: JSONL / CSV / GeoParquet outputs from export_client/isample export -f {jsonl,csv,geoparquet} + the STAC stac.json / manifest.json sidecars the client emits.
  • Solr indexed documents as a binding target (not a serialization), per QUERY_SPEC §5.3 — worth naming so readers understand the query dimensions have a Solr mapping even though the portable archive is parquet.
  • CSV twins for H3 / lite (excluded from the Zenodo deposition but present on R2, ~640 MB aggregate) — worth one row noting "convenience only, not authoritative."

None of these changes the primary catalog; they extend the "source" and "legacy binding" edges of the DAG. ~10 lines of additions.

Addresses the review comment on PR isamplesorg#143 flagging coverage gaps found
during the parallel-draft reconciliation:

- **Alternative export formats tier** — documents the JSONL/CSV flavors
  emitted by `isamplesorg/export_client` alongside the GeoParquet that
  lands in Zenodo. Includes the `stac.json` / `manifest.json` sidecars
  the client writes next to local exports.
- **Legacy bindings and convenience copies tier** — adds a row for the
  Solr indexed documents (legacy binding, not a serialization — QUERY_SPEC
  §5.3 keeps this precedent alive even though iSamples Central is offline)
  and a row for the ~640 MB of CSV twins for H3 + lite files on R2
  (convenience only, excluded from the Zenodo deposition).

No changes to the existing catalog content — purely additive so the
original PR reviewer's flow is preserved.
rdhyee added a commit to rdhyee/isamplesorg.github.io that referenced this pull request Apr 24, 2026
…NS link

Two issues from Codex review:

1. **§2.4 callout wrong about h3_summary schema**: the previous text
   said the summary tier files carry `h3_res4`, `h3_res6`, `h3_res8`.
   They don't — they ship `h3_cell` (UBIGINT) + `resolution` (INTEGER)
   and filter by resolution. Corrected the callout and the §5.1
   DuckDB binding row to show the actual form
   (`h3_cell IN (...) AND resolution = 6`).

2. **Appendix B wrong link target**: the SERIALIZATIONS.md reference
   pointed at `isamplesorg/pqg/pull/143`, but the catalog PR is
   `isamplesorg#143`. Fixed.
rdhyee added 2 commits April 24, 2026 08:32
…arrays

Codex review: live DESCRIBE shows material/context/object_type on
sample_facets_v2 are VARCHAR scalars, not VARCHAR[] arrays. Previous
catalog text described them as "string arrays of URIs" and the
example query used `ANY(material)` which fails against scalar columns.

Corrected:
- §3 catalog row now reads "VARCHAR scalars; each facet column is a
  single URI per sample (not an array)"
- §4.8 per-file detail corrected + query pattern updated to
  `material = '<uri>'` or `material ILIKE '%substring%'`
- Noted that samples tagged with multiple material URIs are represented
  by a single chosen URI at this grain; for multi-material accuracy
  readers should JOIN back to wide.p__has_material_category

No column shape change — just documentation fix to match the live file.
Codex round-2 review caught two claims on §4.3 (`isamples_202601_wide.parquet`):

1. **DuckDB example fails to execute**. The query `SELECT source, COUNT(*)
   FROM wide ...` references a `source` column that doesn't exist — wide
   uses PQG's `n` for the source dimension. Corrected to
   `SELECT n AS source, COUNT(*) ... GROUP BY n`. Verified: returns
   SESAR=4.69M, OPENCONTEXT=1.06M, GEOME=606K, SMITHSONIAN=322K.

2. **"each an INT32[]"** understates the actual mixed types. Live
   DESCRIBE shows some p__* columns are `INTEGER[]` (e.g. p__produced_by,
   p__sample_location, p__sampling_site, p__site_location, p__registrant,
   p__curation) and others are `BIGINT[]` (p__has_material_category,
   p__has_context_category, p__has_sample_object_type, p__keywords,
   p__responsibility, p__related_resource). Softened to "integer array"
   and listed the exact types.

Also added a "Column name gotcha" bullet flagging the wide/narrow `n`
vs lite/facets `source` column-name split — so readers know to alias
when moving between files.
@rdhyee rdhyee merged commit 1932cd8 into isamplesorg:main Apr 24, 2026
1 check passed
rdhyee added a commit that referenced this pull request Apr 24, 2026
…rix) (#145)

* Add QUERY_SPEC.md v0.1 (draft)

Substrate-neutral query contract spanning DuckDB-WASM (web), DuckDB/Ibis
(Python), and Apache Solr (legacy). Names mirror the Solr schema
vocabulary (authoritative precedent) with substrate-specific aliases
provided in §5.

Scope:
- Canonical facet / filter dimensions (§2)
- Abstract filter grammar (§3)
- Full-text search semantics (§3.2, the 16-field Solr searchText target)
- Sample-card projection (§4.2)
- Substrate binding tables (§5)
- Open questions for v0.2 (§7)

Out of scope: PQG graph traversal (see QUERY_COMPARISON.md), bulk
export, ingestion.

Refs isamplesorg.github.io#138.

* Apply QUERY_SPEC v0.2 amendments from PQG conformance matrix

Amendments informed by isamplesorg/pqg#22 (conformance_matrix.md §4-§5),
which audited which shipped parquet files actually carry which spec
dimensions:

1. Rename `specimen` → `objectType` (§2.2). Every shipped parquet uses
   `object_type` / `hasSampleObjectType`; adopt the data-side name as
   canonical, keep `hasSpecimenCategory` as Solr alias.
2. Drop ghosts: `informalClassification` (§2.2) and `resultTimeRange`
   (§2.3) — both were in Solr but never migrated to any parquet. Also
   drop `time_range OVERLAPS` from §3.1 grammar and §5.3 Solr binding.
3. Add `thumbnailURL` to §2.1 as optional (ships in `wide` today for
   OpenContext only; moving to per-source sidecars — issue #131).
4. Update §5.1 `time BETWEEN` binding from "TBD" to real DuckDB cast:
   `TRY_CAST(result_time AS TIMESTAMP) BETWEEN t1 AND t2`. `result_time`
   IS in lite (as VARCHAR).
5. Document H3 column availability in §2.4: `wide_h3` and
   `h3_summary_res{4,6,8}` carry res 4/6/8; `lite` has res 8 only;
   plain `wide` / `narrow` carry no H3 columns.
6. Pick `tmodified` (INTEGER epoch) over `last_modified_time` (VARCHAR)
   for `sourceUpdatedTime` in §2.1; alias the VARCHAR as deprecated.
7. Bump version callout to v0.2.
8. §7 open questions: close Q2 (time filter in lite — now resolved);
   reframe Q1 around the new `objectType` naming.
9. Appendix B: reference conformance_matrix.md and SERIALIZATIONS.md
   (pqg#143) as companion documents.

Refs isamplesorg/pqg#22, isamplesorg.github.io#138.

* fix(query-spec): Codex review — h3_summary column names, SERIALIZATIONS link

Two issues from Codex review:

1. **§2.4 callout wrong about h3_summary schema**: the previous text
   said the summary tier files carry `h3_res4`, `h3_res6`, `h3_res8`.
   They don't — they ship `h3_cell` (UBIGINT) + `resolution` (INTEGER)
   and filter by resolution. Corrected the callout and the §5.1
   DuckDB binding row to show the actual form
   (`h3_cell IN (...) AND resolution = 6`).

2. **Appendix B wrong link target**: the SERIALIZATIONS.md reference
   pointed at `isamplesorg/pqg/pull/143`, but the catalog PR is
   `#143`. Fixed.

* fix(query-spec): source dimension column is 'n' on wide/narrow

Codex round-2: §5.1 DuckDB binding claimed `source IN (…)` binds to
`source IN (…) on wide / lite parquet`. Wrong for wide — it uses `n`
(PQG convention), not `source`. The query as written fails with
"Referenced column source not found".

Updated the binding row to distinguish:
  wide / narrow: WHERE n IN (…)
  lite / sample_facets_v2: WHERE source IN (…) — alias already exposed
rdhyee added a commit that referenced this pull request Apr 24, 2026
A public-facing companion to SERIALIZATIONS.md (PR #143). Where the
catalog is internal reference ("every file with role, size, upstream,
consumers"), this page is the researcher/developer landing:

- Quick-pick table mapping "if you want to do X → use file Y"
- Five copy-pasteable DuckDB snippets (every one executed clean
  against live R2 URLs during authoring)
- H3 tier breakpoint reference for map authors
- Cross-links to SERIALIZATIONS, QUERY_SPEC, PQG spec, conformance
  matrix
- Data-source + licensing paragraph pointing to the Zenodo community
  (without speculating on specific license terms)

Lands at the site root alongside pubs.qmd and query-spec.qmd.

Note on column naming in snippets: the wide parquet uses `n` for the
source column (PQG convention); lite and sample_facets_v2 use the
friendlier alias `source`. Flagged inline in the snippet comment so
Binder/Colab first-timers don't trip on it.

Verified on 2026-04-24: all 6 snippets (incl. the callout quick-start)
execute against data.isamples.org, returning non-empty results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant