diff --git a/query-spec.qmd b/query-spec.qmd new file mode 100644 index 0000000..675f248 --- /dev/null +++ b/query-spec.qmd @@ -0,0 +1,411 @@ +--- +title: "iSamples Query Specification" +subtitle: "A substrate-neutral contract for searching and filtering iSamples data" +author: "iSamples team" +date: today +toc: true +sidebar: false +categories: [spec, architecture, query] +--- + +::: {.callout-warning} +## Draft — v0.2 + +Field inventories are drawn from the Solr schema (authoritative +precedent) and the PQG metadata model. v0.2 incorporates findings from +the [PQG conformance matrix][cmatrix] (which parquet files actually +carry which dimensions) to resolve naming drift, drop ghosts, and +tighten substrate bindings. Comments and PRs welcome — see +[issue tracker][issues]. + +[issues]: https://github.com/isamplesorg/isamplesorg.github.io/issues +[cmatrix]: https://github.com/isamplesorg/pqg/blob/main/docs/conformance_matrix.md +::: + +## 1. Purpose and scope {#sec-scope} + +iSamples data is reached today through at least three substrates — and +potentially more in the future: + +- **DuckDB-WASM against parquet** (this website's Interactive Explorer) +- **DuckDB / Ibis against parquet** (the Python client and notebooks) +- **Apache Solr** (legacy iSamples Central; potentially revived) + +Each substrate has its own query dialect. Users and maintainers shouldn't +have to relearn the facet vocabulary, the text-search semantics, or the +spatial filter grammar when moving between them. This document specifies +a **substrate-neutral query model** that each implementation can bind to. + +**What this spec covers:** + +- Canonical facet / filter dimensions and their names +- Filter grammar (an abstract syntax, not a wire format) +- Full-text search semantics (which fields participate) +- Spatial and temporal primitives +- Sample-card projection (what a clicked sample returns) +- Substrate binding tables (spec → DuckDB, spec → Solr) + +**What it does NOT cover:** + +- PQG graph traversal queries (edge walking, multi-hop joins). See + [QUERY_COMPARISON.md][qc] in the monorepo root for that work and the + Eric-vs-Observable alignment notes. +- Bulk export / download mechanics. See [how-to-use](how-to-use.qmd). +- Ingestion and metadata normalization. + +[qc]: https://github.com/isamplesorg/isamplesorg.github.io/blob/main/QUERY_COMPARISON.md + +**Normative precedent.** Where this spec names a field, the name mirrors +the iSamples metadata model's dotted-path form as used in the Solr schema +(`isamples_inabox/solr_schema_init/create_isb_core_schema.py`), because +that's the most complete, externally-documented query vocabulary the +project has shipped. Aliases for substrate-specific naming are provided +in §5. + +## 2. Canonical dimensions {#sec-dimensions} + +A **dimension** is an attribute of a material sample record that users +filter, facet, or search on. Every binding (§5) must provide at least +the **required** dimensions. + +### 2.1 Identity and provenance + +| Dimension | Type | Required | Solr field | PQG path | Notes | +|---|---|---|---|---|---| +| `pid` | string | ✅ | `id` | `MaterialSampleRecord.pid` | Primary key | +| `source` | enum | ✅ | `source` | `MaterialSampleRecord.source_name` | `SESAR\|OPENCONTEXT\|GEOME\|SMITHSONIAN` | +| `label` | string | ✅ | `label` | `MaterialSampleRecord.label` | Display name | +| `description` | text | ✅ | `description` | `MaterialSampleRecord.description` | Free text | +| `registrant` | string | | `registrant` | `MaterialSampleRecord.registrant` | Who registered | +| `sourceUpdatedTime` | instant | | `sourceUpdatedTime` | `MaterialSampleRecord.tmodified` | Freshness; bind to `tmodified` (INTEGER epoch) — see note below | +| `thumbnailURL` | string | | — | `MaterialSampleRecord.thumbnail_url` | Optional; shipped in `wide` today (OpenContext only). Expected to move to per-source sidecars over time (see §4.2 sample card, issue #131) | + +::: {.callout-note} +**`sourceUpdatedTime` binding**: the `wide` parquet ships both +`last_modified_time` (VARCHAR) and `tmodified` (INTEGER unix epoch). +v0.2 picks `tmodified` as canonical because epoch is easier to filter +and sort; `last_modified_time` is kept as a deprecated alias for +backwards compatibility and will be removed in a future major release. +::: + +### 2.2 Classification (the four facets) + +| Dimension | Type | Required | Solr field | PQG path | +|---|---|---|---|---| +| `material` | enum | ✅ | `hasMaterialCategory` | `MaterialSampleRecord.has_material_category.label` | +| `context` | enum | ✅ | `hasContextCategory` | `MaterialSampleRecord.has_context_category.label` | +| `objectType` | enum | ⚠️ (see below) | `hasSampleObjectType` (alias `hasSpecimenCategory`) | `MaterialSampleRecord.has_sample_object_type.label` | +| `keywords` | multi-string | | `keywords` | `MaterialSampleRecord.keywords[]` | + +::: {.callout-note} +**Naming resolution (v0.2)**: v0.1 named this dimension `specimen` with +Solr field `hasSpecimenCategory`. Every shipped parquet file uses +`object_type` / `hasSampleObjectType`. v0.2 adopts the data-side name +(`objectType`) as canonical and keeps `hasSpecimenCategory` as a Solr +alias. See [PQG conformance matrix §3.2][cmatrix-3-2] for the audit +that prompted this rename. + +`objectType` is in the blessed vocabulary but is **not currently +exposed** in the web Explorer. Adding it is on the P1 stack. + +[cmatrix-3-2]: https://github.com/isamplesorg/pqg/blob/main/docs/conformance_matrix.md#32-classification-query_spec-22 +::: + +::: {.callout-note} +**Dropped from v0.2**: `informalClassification` was named in v0.1 but +no shipped parquet file carries it (it was a Solr-era remnant). It is +removed from the canonical dimension list until/unless the pipeline +adds it. +::: + +Each of these has a paired **confidence** field (`…Confidence`, `pfloat`) +in Solr. The spec allows filters to reference confidence (e.g. +`material.confidence >= 0.8`) but implementations MAY omit if the +substrate doesn't carry the field. + +### 2.3 Sampling event and site + +| Dimension | Type | Solr field | PQG path | +|---|---|---|---| +| `resultTime` | instant | `producedBy_resultTime` (`pdate`) | `SamplingEvent.result_time` | +| `samplingPurpose` | string | `samplingPurpose` | `SamplingEvent.sampling_purpose` | +| `featureOfInterest` | string | `producedBy_hasFeatureOfInterest` | `SamplingEvent.has_feature_of_interest` | +| `responsibility` | multi-string | `producedBy_responsibility` | `SamplingEvent.responsibility[]` | +| `siteLabel` | string | `producedBy_samplingSite_label` | `SamplingSite.label` | +| `siteDescription` | text | `producedBy_samplingSite_description` | `SamplingSite.description` | +| `placeName` | string | `producedBy_samplingSite_placeName` | `SamplingSite.place_name[]` | +| `elevation` | float | `producedBy_samplingSite_location_elevationInMeters` | `GeospatialCoordLocation.elevation` | + +::: {.callout-note} +**Dropped from v0.2**: `resultTimeRange` (Solr `producedBy_resultTimeRange`, +a `date_range` field) was named in v0.1 but no shipped parquet carries +an interval type. It was a Solr-era remnant that never migrated. Query +a `resultTime` range with `time BETWEEN t1 AND t2` (§3.1) instead. +::: + +### 2.4 Spatial {#sec-spatial} + +| Dimension | Type | Solr field | PQG path | +|---|---|---|---| +| `latitude` | float | `producedBy_samplingSite_location_latitude` | `GeospatialCoordLocation.latitude` | +| `longitude` | float | `producedBy_samplingSite_location_longitude` | `GeospatialCoordLocation.longitude` | +| `bbox` | bbox | `producedBy_samplingSite_location_bb` | derived | +| `h3[resN]` | h3-index | `producedBy_samplingSite_location_h3_{0..13}` | `samples_wide.h3_res{N}` | + +**H3 tier convention.** Resolutions 4, 6, and 8 are the spec-recommended +tier breakpoints for zoom-adaptive visualization. Other resolutions MAY +be materialized but 4/6/8 are load-bearing. + +::: {.callout-important} +**H3 column availability across shipped parquet files (v0.2)**: + +- `wide_h3` ships three direct columns: `h3_res4`, `h3_res6`, `h3_res8`. +- `h3_summary_res{4,6,8}` tier files do NOT ship `h3_res{N}` columns — + they ship a single `h3_cell` (UBIGINT) plus a `resolution` (INTEGER) + column. Query them as `WHERE h3_cell = X AND resolution = N`. +- `lite` carries `h3_res8` (and `h3_res8_hex`) only — not res4 / res6. +- Plain `wide` and `narrow` do **not** carry H3 columns. To filter at + res 4 or res 6, query `wide_h3` or the appropriate `h3_summary` + tier file. + +See [PQG conformance matrix §3.4][cmatrix-3-4] for the full table. + +[cmatrix-3-4]: https://github.com/isamplesorg/pqg/blob/main/docs/conformance_matrix.md#34-spatial-query_spec-24 +::: + +### 2.5 Curation + +| Dimension | Type | Solr field | +|---|---|---| +| `curationLocation` | string | `curation_location` | +| `curationResponsibility` | string | `curation_responsibility` | +| `curationAccessConstraints` | string | `curation_accessContraints` | + +## 3. Filter grammar {#sec-grammar} + +A query is a conjunction (AND) of filters. Each binding is responsible +for translating the abstract filter into its dialect. + +### 3.1 Filter primitives + +```text +Filter := FieldFilter | TextFilter | SpatialFilter | TemporalFilter + +FieldFilter := dim IN (value, ...) + | dim = value + | dim >= value ( numeric / date only ) + | dim <= value + | dim CONTAINS token ( multi-string / keywords ) + +TextFilter := text MATCHES "phrase" + +SpatialFilter:= bbox WITHIN (min_lat, min_lon, max_lat, max_lon) + | h3 AT RES n IN (h3_cell, ...) + +TemporalFilter + := time BETWEEN t1 AND t2 +``` + +### 3.2 Full-text search semantics {#sec-text} + +`text MATCHES "phrase"` searches the aggregate of these fields (the +Solr `searchText` copy-field target, canonical list): + +- `source`, `label`, `description` +- `keywords` +- `producedBy_label`, `producedBy_description`, `producedBy_hasFeatureOfInterest`, + `producedBy_responsibility` +- `producedBy_samplingSite_label`, `producedBy_samplingSite_description`, + `producedBy_samplingSite_placeName` +- `registrant`, `samplingPurpose` +- `curation_label`, `curation_description`, `curation_location` + +Substrates that can't index all 15 fields MUST document which subset +they cover and surface the limitation in UI. (The current web Explorer +covers `label` + `description` + `place_name` only — a known gap.) + +Multi-term queries default to **AND** with relevance ranking where the +substrate supports it (Solr, DuckDB FTS). See PR #95 for web-side FTS +work. + +### 3.3 Cross-filter counts + +A faceted UI exposing a dimension SHOULD show, next to each facet value, +the count of records matching **the current query *excluding* that +dimension's own filter**. This lets users see the effect of selecting +additional values without shrinking the list to zero. + +Substrates may pre-compute these counts (see +`isamples_202601_facet_cross_filter.parquet` for the single-filter +cache) or compute them on the fly. + +## 4. Result projections {#sec-projections} + +### 4.1 Map / globe point + +Minimum projection for a point on a map: + +``` +{ pid, label, source, latitude, longitude } +``` + +This is what the web Explorer's "lite parquet" already provides. + +### 4.2 Sample card + +Projection for a clicked / selected sample: + +``` +{ + pid, label, source, + description, + latitude, longitude, placeName, elevation, + material, context, objectType, keywords, + resultTime, samplingPurpose, + registrant, responsibility, + curationLocation, curationResponsibility, + sourceRecordURL, + thumbnailURL // see §2.1; ships in `wide` today (OpenContext + // only), moving to per-source sidecars — issue #131 +} +``` + +Fields MAY be null. The sample card UI in every binding SHOULD handle +missing values gracefully. + +### 4.3 Facet counts + +``` +{ dimension, value, count }[] +``` + +## 5. Substrate bindings {#sec-bindings} + +### 5.1 DuckDB-WASM on parquet (web) + +| Spec | Binding | +|---|---| +| `source IN (…)` | `n IN (…)` on wide / narrow (column is `n` per PQG); `source IN (…)` on lite / sample_facets_v2 (alias exposed) | +| `material IN (…)` | `pid IN (SELECT pid FROM sample_facets WHERE material IN (…))` | +| `text MATCHES "q"` | `(label ILIKE '%q%' OR description ILIKE '%q%' OR place_name ILIKE '%q%')` — currently a subset of §3.2 | +| `bbox WITHIN (…)` | `latitude BETWEEN … AND … AND longitude BETWEEN … AND …` | +| `h3 AT RES 6 IN (…)` | `h3_res6 IN (…)` on `wide_h3`; OR `h3_cell IN (…) AND resolution = 6` on `h3_summary_res6` (see §2.4 note) | +| `time BETWEEN …` | `TRY_CAST(result_time AS TIMESTAMP) BETWEEN t1 AND t2` — `result_time` ships as VARCHAR in `lite`, `wide`, and `narrow` | + +**Canonical data URL base**: `https://data.isamples.org/` (Cloudflare +Worker in front of the R2 bucket). Two layers: + +- **Versioned** `/isamples_YYYYMM_.parquet` — 1-yr immutable cache, + safe to pin in papers, spec examples, or reproducibility notebooks. +- **Alias** `/current/` — 302 redirect with 5-minute cache; tracks + whatever the latest snapshot is. Use for "always fresh" consumers. + +Never reference the raw `pub-a18234d962364c22a50c787b7ca09fa5.r2.dev/...` +URL — it bypasses the Worker and defeats the alias layer. + +Data files: see [catalog in how-to-use](how-to-use.qmd#data-files). + +### 5.2 DuckDB / Ibis on parquet (Python) + +| Spec | Binding | +|---|---| +| Same DuckDB SQL as §5.1 | Same URLs under `https://data.isamples.org/` | +| Ibis expressions | `t.source.isin([...])` and so on | + +See `isamples-python/examples/basic/isamples_explorer.ipynb` for the +reference implementation. A `isamples_query.py` module extracting the +filter builder is planned. + +### 5.3 Apache Solr (if Central returns) + +| Spec | Binding | +|---|---| +| `source IN (a, b)` | `fq=source:(a OR b)` | +| `material IN (…)` | `fq=hasMaterialCategory:(…)` | +| `text MATCHES "q"` | `q=searchText:q` (relevance-ranked by default) | +| `bbox WITHIN (…)` | `fq={!field f=producedBy_samplingSite_location_rpt}Intersects(ENVELOPE(...))` | +| `time BETWEEN …` | `fq=producedBy_resultTime:[t1 TO t2]` | + +See `isamples_inabox/isb_web/isb_solr_query.py` for the full client. + +## 6. Versioning and compatibility {#sec-versioning} + +This spec uses semantic-ish versioning: + +- **Major** (1.0, 2.0): new required dimensions, renames, or grammar + changes that break existing clients. +- **Minor** (0.2, 0.3): new optional dimensions, clarifications, + additional binding rows. +- **Patch**: typo fixes. + +Breaking changes MUST be accompanied by a migration note and a sunset +window for the prior spec version. + +## 7. Open questions (for v0.3) {#sec-open} + +1. **`objectType` filter in the web Explorer.** Canonical vocabulary is + now `hasSampleObjectType` (resolved in v0.2; see §2.2). The + `sample_facets_v2` parquet carries `object_type` as a denormalized + URI string, so binding is straightforward. Which display labels + should the UI surface, and should `object_type` be added to `lite` + so specimen-type filters don't require a second file fetch? +2. **Text-search field coverage** in the web Explorer (currently 3 of + 15 post-v0.2). Which of the remaining 12 are worth indexing in a + browser FTS? See PR #95. +3. **Cross-filter cache shape** for multi-dimension filter combinations + (current cache handles single-filter only). +4. **Confidence thresholds** — should the spec define a default for + `*.Confidence` fields, or leave it per-client? +5. **H3 tier breakpoints** — when filters are active, what zoom level + triggers the switch from H3 clusters to individual points? The web + Explorer currently uses ~120 km; the Python notebooks use viewport + bounding box size. +6. **Sample-card thumbnail provenance** — `thumbnail_url` is now named + in §2.1 (v0.2) but lives in `wide` and is populated only for + OpenContext. Move to per-source sidecars per issue #131 / the + sidecar pattern memo. + +### Questions resolved in v0.2 + +- ~~**Specimen vs. objectType naming**~~ — resolved: adopt data-side + name `objectType` (Solr `hasSampleObjectType`) as canonical. See + §2.2 and conformance matrix §3.2. +- ~~**Time filter in lite parquet**~~ — resolved: `result_time` is + already present in `lite` (as VARCHAR). §5.1 binding now shows the + DuckDB cast. + +## Appendix A. Metadata model at a glance + +iSamples treats these as the core entity types (domain-agnostic): + +- `MaterialSampleRecord` — the sample itself +- `SamplingEvent` — the act of collection +- `SamplingSite` — the place +- `GeospatialCoordLocation` — lat/lon/elevation +- `MaterialSampleCuration` — curation metadata +- `IdentifiedConcept` — vocabulary terms (materials, contexts, specimens) +- `Agent` — people / institutions + +The canonical UML is in the +[isamplesorg-metadata](https://github.com/isamplesorg/metadata) repo. +PQG (the parquet property-graph binding) is specified in +[`pqg/docs/PQG_SPECIFICATION.md`](https://github.com/isamplesorg/isamples-python/blob/main/pqg/docs/PQG_SPECIFICATION.md). + +## Appendix B. Related documents + +- [`pqg/docs/conformance_matrix.md`](https://github.com/isamplesorg/pqg/blob/main/docs/conformance_matrix.md) + — which shipped parquet files cover which QUERY_SPEC dimensions + (companion to this spec; informed every v0.2 amendment) +- [`SERIALIZATIONS.md`](https://github.com/isamplesorg/isamplesorg.github.io/pull/143) (catalog of shipped parquet files, in `isamplesorg.github.io`) + — the three canonical parquet formats (export / narrow / wide) and + how they round-trip +- `QUERY_COMPARISON.md` — PQG traversal query alignment (Eric's Python + vs. the Observable JS, Oct 2025) +- `test_cesium_queries.js`, `test_python_js_alignment.py` — alignment + test harness at the monorepo root +- [Interactive Explorer](tutorials/progressive_globe.qmd) — the reference + web UI +- `isamples-python/examples/basic/isamples_explorer.ipynb` — the + reference Python UI +- `isamples_inabox/solr_schema_init/create_isb_core_schema.py` — the + authoritative Solr schema