Openconceptlab/ocl_issues#2505 | Fix US/British spelling mismatch in semantic concept search#868
Conversation
Adds get_spelling_variant() to generate US<->British spelling alternatives (leukemia/leukaemia, haem/hem, paed/ped, oedema/edema, -our/-or, -ise/-ize, etc.). In semantic search, additional kNN sub-queries are fired using the variant's embedding so that e.g. querying "leukemia" still retrieves "leukaemia" concepts when the index is large enough to push them out of the default top-50 candidates. The rescore query is also expanded to boost exact matches of either spelling, and a pre-existing crash risk (empty should-clause when no synonyms are provided) is fixed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…structure Models lexical variants as an OCL Source so they get versioning, release management, locale handling, editability, and discoverability through existing OCL infrastructure — replacing the hardcoded regex approach with a data-driven lookup that aligns with UMLS / SNOMED / OBO conventions. * core/common/lexical_variants.py — get_lexical_variants() and get_variant_terms() helpers with token-level lookup, dataclass result, and per-(source_uri, version) cache. Tokenization-first design avoids the regex false-positive problems (e.g. "themselves" no longer matches "hem", "hemisphere" no longer matches "haem"). * core/common/data/lexical-variants-en.json — OCL bulk import file seeding ocl/lexical-variants-en with 43 vetted whole-word spelling pairs across en-US and en-GB, plus a v1.0 Source Version. Load via the standard OCL bulk import path. * core/common/tests.py — LexicalVariantsTest covering tokenization, cache lifecycle, false-positive regressions (themselves, anthem, hemisphere, hemp, hemlock, remember), missing-source graceful degradation, and multi-token expansion. Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
… $match ConceptFuzzySearch.search() and MetadataToConceptsListView (the $match endpoint) now consume the dictionary helper instead of the inline regex. Variant expansion is OFF by default — clients opt in via the request body `variants` field. Same shape will be used for standard concept search when that wiring lands as a follow-up. Variants param accepted forms: * missing / null / false / "false" / "0" → disabled (default) * true / "true" / "1" → use DEFAULT_LEXICAL_VARIANTS_REPO * non-empty URI string → use that dictionary Source Sunny's kNN sub-query construction and rescore expansion are kept; only the variant source changes from hardcoded regex to dictionary lookup. The empty-synonyms rescore crash fix is preserved. * core/common/utils.py — remove get_spelling_variant() (replaced by the dictionary helper). * core/concepts/search.py — ConceptFuzzySearch.search() accepts variants_repo; gates variant kNN and rescore expansion on it being truthy. None means skip entirely. * core/concepts/views.py — MetadataToConceptsListView reads request.data.variants and normalizes it to a URI or None via _resolve_variants_repo before passing to ConceptFuzzySearch. Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
|
@snyaggarwal — heads up, I pushed two commits on top of yours that redirect this PR to a dictionary-driven approach (full rationale in the updated description above + the proposal doc). TL;DR — your kNN sub-query construction, rescore expansion, and the empty-synonyms crash fix are all kept. The variant source changed from inline regex to a dictionary lookup against Apologies for the in-flight redirect — the false-positive cases in the regex list ( Happy to walk through the diff or rework anything that doesn't sit right. |
…riants-en The OCL organization on prod has mnemonic OCL (uppercase), not ocl. Updates the bulk import file owner field, the DEFAULT_LEXICAL_VARIANTS_REPO constant, and test mock URIs to match. Co-Authored-By: Sunny Aggarwal <sny.aggarwal@gmail.com>
|
Two missing things:
@snyaggarwal to take on this |
…redis cache and making things class based from module based
paynejd
left a comment
There was a problem hiding this comment.
Reviewing the latest revision after the redis-cache refactor and use_lexical_variants field addition. Four things to address before merge:
1. LEXICAL_VARIANTS_CACHE_TIMEOUT env override is broken — os.environ.get returns a string when set, but django-redis's cache.set(timeout=...) expects an int. Suggestion inline.
2. use_lexical_variants MapProject flag is persistence-only. The field exists, the serializer surfaces it, but MetadataToConceptsListView.filter_queryset still only reads request.data.get('variants') — so the flag has no effect at $match time until oclmap is updated to read the project config and pass variants: true per request. That's consistent with how encoder_model flows (client passes it per call), so the field is correct as persistence; just call out the dependency in the PR body and file an oclmap follow-up ticket. Otherwise reviewers / Mapper users will reasonably assume flipping the flag does something on its own.
3. Fixture path changed. Sunny moved core/common/data/lexical-variants-en.json → core/lookup_fixtures/lexical-variants-en.json, but (a) import_lookup_values.py has a hardcoded list and does not auto-load this file, so the lookup_fixtures/ placement falsely implies auto-load, and (b) the PR body's deploy commands still reference the old core/common/data/... path — anyone running them on dev/qa/staging gets a 404. Either move the file back, or wire it into import_lookup_values and update the PR body.
4. Rebase needed. PR is CONFLICTING (8 commits behind master). Conflicts are in core/map_projects/{models,serializers,tests/tests}.py from master's prompt_output_locale config addition (c883bd0). Should be a clean resolve — just append use_lexical_variants after prompt_output_locale in each CONFIGURATION_FIELDS / serializer field list / prepare_object loop. Verify migration 0036_mapproject_use_lexical_variants.py doesn't collide with master's migration numbering after rebase.
The redis-cache refactor itself looks good — keying on (source.uri, version) cleanly invalidates released-version edits. HEAD edits will be stale up to the TTL (4 days default), which is fine for a 43-entry vetted dictionary; worth a one-line comment so future maintainers know. The is_core_user gate matches the encoder_model precedent.
Refs OpenConceptLab/ocl_issues#2505 (Problem B only — Problem A still open)
Refs OpenConceptLab/ocl_issues#2511 (Lexical Variant Dictionary infrastructure)
Heads up — approach changed mid-flight
Sunny — apologies for the redirect. After deeper review of #2505 with @paynejd, we landed on a substantially different architecture and felt it was faster to push the new approach directly than to iterate via comments. Your kNN sub-query construction, rescore expansion, and the empty-synonyms crash fix are all kept — only the variant source changed from inline regex to a dictionary-driven lookup. Happy to walk through the diff on a call if useful.
Three reasons for the redirect:
hem/haemmatchesthemselves,anthem,hemisphere,hemp,hemlock,remember— hard to fix defensively in regex without enumerating exclusions for every English word.ConceptFuzzySearch. Standard/concepts/?q=…has the same gap, and upcoming abbreviation-expansion work needs the same primitive. A dictionary-as-Source approach makes that reuse trivial.Tradeoff worth flagging
This approach puts pressure on oclapi2 to do caching well. The MVP scale is trivial (~40 entries, in-memory dict), but as dictionary types and sizes grow — especially abbreviation dictionaries (UMLS LRABR is ~250k rows) and once variants are enabled by default on high-QPS endpoints — the per-token DB lookup will need to graduate beyond the MVP's module-level cache. Likely Redis-backed token-indexed cache for >1k-entry dictionaries; ES-backed lookup for very large ones. Plan to revisit when (a) a dictionary exceeds ~5k entries, or (b) variants are enabled by default on a high-QPS endpoint. Captured in #2511.
What this PR does
core/common/lexical_variants.py—get_lexical_variants(text, source_uri)andget_variant_terms(text, source_uri). Tokenizes input (lowercase + whitespace + ASCII punctuation strip), looks each token up against the dictionary's Names, returns sibling Names from the matching Concept. Module-level cache keyed on(source_uri, version). Returns[]and never raises if the dictionary is missing — graceful degradation.core/common/data/lexical-variants-en.json— OCL bulk import file creatingOCL/lexical-variants-en(Source + 43 vetted Concepts + Source Versionv1.0). Each Concept has en-US and en-GB Names withname_type=Fully Specified. False-positive-prone substring entries dropped.$match:MetadataToConceptsListViewreadsvariantsfrom request body, passes the resolved URI (orNone) toConceptFuzzySearch.search(..., variants_repo=...). The kNN sub-query construction and rescore expansion are unchanged in shape; only the variant source changed.get_spelling_variant()fromcore/common/utils.py.API:
variantsrequest body fieldLexical variant expansion is OFF by default. Clients opt in. Same shape will apply to standard concept search when that wiring lands as a follow-up.
null/false/"false"/"0"true/"true"/"1"OCL/lexical-variants-en)"/orgs/myorg/sources/my-dictionary/")Tests
core/common/tests.pyLexicalVariantsTestcovers:themselves,anthem,hemisphere,hemp,hemlock,rememberreturn no variantsDeployment
The dictionary Source has already been seeded on prod (
OCL/lexical-variants-env1.0, 43 concepts, public read) so the wiring will work as soon as this PR merges and deploys. For dev / qa / staging environments, run:Verify with:
ocl repo get OCL lexical-variants-en(should show 43 concepts,source_type: "Lexical Variants",extras.dictionary_kind: "lexical_variant", versionv1.0released).The helper degrades gracefully (returns
[], never raises) if the Source doesn't exist yet — so a deploy gap on any environment won't break search.Out of scope (deferred follow-ups)
?variants=...on standard concept search (ConceptListView) — touchesCustomESSearch/get_raw_search_stringlayer, smaller blast radius as a separate PR$lexical-variantsHTTP operationapply_score()highlight) — entirely separate work; this PR no longer auto-closes that part of the ticketArchitecture / design references
$lexical-variantsoperation, abbreviation dictionary type, caching strategy, etc.)