Skip to content

Add SourceConceptId uniqueness validation to validateUsagiFile#70

Merged
javier-gracia-tabuenca-tuni merged 4 commits intodevelopmentfrom
copilot/add-duplicate-source-concept-id-validation
Mar 23, 2026
Merged

Add SourceConceptId uniqueness validation to validateUsagiFile#70
javier-gracia-tabuenca-tuni merged 4 commits intodevelopmentfrom
copilot/add-duplicate-source-concept-id-validation

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 23, 2026

validateUsagiFile did not detect cases where the same ADD_INFO:sourceConceptId was assigned to multiple distinct source codes — a constraint violation since each source concept must own a unique concept ID.

Changes

  • R/validateUsagiFile.R — Added uniqueness check in the C&CR validation block using the validate:: package. Deduplicates to distinct (sourceCode, sourceConceptId) pairs first (to avoid false positives from valid multi-mapped codes where one source code legitimately maps to multiple target concepts), then uses is_unique() to detect sourceConceptIds shared across different source codes:

    usagiDistinct <- usagiTibble |>
        dplyr::filter(!is.na(`ADD_INFO:sourceConceptId`)) |>
        dplyr::distinct(sourceCode, `ADD_INFO:sourceConceptId`)
    
    distinctValidationRules <- validate::validator(
        SourceConceptId.is.not.unique = is_unique(`ADD_INFO:sourceConceptId`)
    )
    distinctValidations <- validate::confront(usagiDistinct, distinctValidationRules)

    Affected rows are flagged with "ERROR: SourceConceptId is not unique".

  • inst/testdata/VOCABULARIES/ICD10fi/ICD10fi_with_errors.usagi.csv

    • Fixed pre-existing data issue: rows A17.8+G63.1 and A17.8+G63.2 were unintentionally sharing sourceConceptId=2000500119 with A17.8+G63.0; assigned unique IDs 2000599997/2000599998.
    • Added row A99.0 ([SourceConceptId is not unique]Test duplicate sourceConceptId) that reuses sourceConceptId=2000500124 (also used by A18.0+M68.0) to deliberately trigger the new rule.
  • tests/testthat/test-validateUsagiFile.R — Added assertions verifying the new validation step appears in the summary, flags exactly the intended row, sets the correct ADD_INFO:validationMessages, and sets mappingStatus to FLAGGED.

Original prompt

This section details on the original issue you should resolve

<issue_title>Usagi validation is not detecting duplicated source concept ids</issue_title>
<issue_description>SourceConceptId must be unique
this means that in the usagi file
SourceConceptId + SourceCode must be unique
this has to be added to the rules in the validate usagi function
also a test must be added to corresponding file test
and a test row to the usagi with error ICD10 file in order to trigger the test </issue_description>

Comments on the Issue (you are @copilot in this section)


⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

Co-authored-by: javier-gracia-tabuenca-tuni <54809193+javier-gracia-tabuenca-tuni@users.noreply.github.com>
Agent-Logs-Url: https://github.com/FinOMOP/ROMOPMappingTools/sessions/7ce96e01-5b9b-4d67-9044-42baa2b415e2
Copilot AI changed the title [WIP] Add validation for duplicated source concept ids in Usagi Add SourceConceptId uniqueness validation to validateUsagiFile Mar 23, 2026
Comment thread R/validateUsagiFile.R Outdated
usagiTibble <- result$fileTibble
validationLogR6 <- result$validationLogR6

# Check SourceConceptId is unique (each sourceConceptId should belong to only one sourceCode)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why dont you do this in the upper section using the validate:: package
just like
SourceConceptId.SouceCodes.are.unique = is_unique(ADD_INFO:sourceConceptId, sourceCode)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in commit 3517032 to use validate:: — however, is_unique(sourceConceptId, sourceCode) directly on the full tibble would incorrectly flag valid multi-mapped codes (the valid file has 420 source codes that appear multiple times with the same sourceConceptId but different conceptIds).

The fix deduplicates to distinct (sourceCode, sourceConceptId) pairs first, then uses is_unique(sourceConceptId) on that deduplicated tibble:

usagiDistinct <- usagiTibble |>
    dplyr::filter(!is.na(`ADD_INFO:sourceConceptId`)) |>
    dplyr::distinct(sourceCode, `ADD_INFO:sourceConceptId`)

distinctValidationRules <- validate::validator(
    SourceConceptId.is.not.unique = is_unique(`ADD_INFO:sourceConceptId`)
)
distinctValidations <- validate::confront(usagiDistinct, distinctValidationRules)

…icated tibble

Co-authored-by: javier-gracia-tabuenca-tuni <54809193+javier-gracia-tabuenca-tuni@users.noreply.github.com>
Agent-Logs-Url: https://github.com/FinOMOP/ROMOPMappingTools/sessions/26984a07-d241-421c-8b78-ee643cf070e7
@javier-gracia-tabuenca-tuni
Copy link
Copy Markdown
Member

@copilot can you look at the error from the actions and fix it

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 23, 2026

@javier-gracia-tabuenca-tuni The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 1c05da9f-f863-44c5-ad7d-4c5f6186af2a

2 similar comments
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 23, 2026

@javier-gracia-tabuenca-tuni The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 1c05da9f-f863-44c5-ad7d-4c5f6186af2a

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 23, 2026

@javier-gracia-tabuenca-tuni The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 1c05da9f-f863-44c5-ad7d-4c5f6186af2a

@FinOMOP FinOMOP deleted a comment from Copilot AI Mar 23, 2026
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 23, 2026

@javier-gracia-tabuenca-tuni The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 1c05da9f-f863-44c5-ad7d-4c5f6186af2a

1 similar comment
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 23, 2026

@javier-gracia-tabuenca-tuni The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 1c05da9f-f863-44c5-ad7d-4c5f6186af2a

@javier-gracia-tabuenca-tuni
Copy link
Copy Markdown
Member

@copilot

actions return this error

|
|======================================================================| 100%
Executing SQL took 0.00365 secs
Inserting data took 0.0109 secs
[ FAIL 1 | WARN 0 | SKIP 0 | PASS 248 ]

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test-validateUsagiFile.R:196:3'): test validateUsagiFile returns errors with the errored usagi file ──
Expected dplyr::pull(...) to equal "ERROR: SourceConceptId is not unique".
Differences:
actual vs expected

  • "ERROR OUTDATED conceptName: conceptName for conceptId 4121541 is different in the target vocabularies | ERROR: SourceConceptId is not unique | ERROR: A99.0 is not a valid concept code in vocabulary ICD10"
  • "ERROR: SourceConceptId is not unique"

fix it

… valid A18.8+E35.0 row

Co-authored-by: javier-gracia-tabuenca-tuni <54809193+javier-gracia-tabuenca-tuni@users.noreply.github.com>
Agent-Logs-Url: https://github.com/FinOMOP/ROMOPMappingTools/sessions/4322906e-3901-4f87-aacc-a2d929dab135
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 23, 2026

@copilot

actions return this error
...

Fixed in commit 3df863e. The A99.0 row was triggering two extra validation errors:

  1. OUTDATED conceptName — conceptId 4121541 is used by another row (A17.8+G63.1) with the wrong name "Neutropenia", and the outdated-concepts join propagates that error to all rows with the same conceptId, including A99.0.
  2. Invalid parent concept codeA99.0 doesn't exist in the ICD10 vocabulary in the test DB.

Replaced A99.0 with A18.8+E35.0 which uses:

  • A valid ICD10fi source code with parent codes (A18|A18.8|E35.0) all present in the test DB
  • A clean conceptId 141777 ("Tuberculosis of thyroid gland") not used anywhere else in the error file, matching the test DB name exactly

@javier-gracia-tabuenca-tuni javier-gracia-tabuenca-tuni marked this pull request as ready for review March 23, 2026 09:18
@javier-gracia-tabuenca-tuni javier-gracia-tabuenca-tuni merged commit ef2cb52 into development Mar 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants