docs: add retriever SDG toolkit dev note by shan-nvidia · Pull Request #666 · NVIDIA-NeMo/DataDesigner

shan-nvidia · 2026-05-15T20:24:14Z

📋 Summary

Adds a new dev note for the data-designer-retrieval-sdg toolkit, explaining why retriever synthetic data generation matters and how the toolkit turns documents into retriever training and BEIR evaluation artifacts.

🔗 Related Issue

N/A

🔄 Changes

Adds the Retriever SDG Toolkit dev note to MkDocs and Fern.
Adds a pipeline SVG showing document chunking, grounded QA generation, deduplication/judging, and conversion outputs.
Adds Steve Han to the dev-note author registries and uses that author for the new post.
Adds the new post to the MkDocs and Fern Dev Notes navigation/index.

🧪 Testing

.venv/bin/mkdocs build passes
make check-fern-docs-locally passes
Unit tests added/updated: N/A - docs-only change
E2E tests added/updated: N/A - docs-only change

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated: N/A - dev note only

Signed-off-by: Steve Han <sthan@nvidia.com>

github-actions · 2026-05-15T20:24:25Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

github-actions · 2026-05-15T20:27:10Z

Review: PR #666 — docs: add retriever SDG toolkit dev note

Summary

Docs-only PR adding a new "Retriever SDG Toolkit" dev note to both the MkDocs and Fern documentation sites. Changes:

New post: docs/devnotes/posts/retrieval-sdg-toolkit.md (MkDocs) and fern/versions/latest/pages/devnotes/posts/retrieval-sdg-toolkit.mdx (Fern), with parallel content adapted to each engine's syntax.
New pipeline diagram: pipeline.svg mirrored under docs/devnotes/posts/assets/retrieval-sdg-toolkit/ and fern/assets/retrieval-sdg-toolkit/ (the two files are byte-identical — verified).
New author entry sthan (Steve Han) added to all three author registries: docs/devnotes/.authors.yml, fern/components/devnotes/.authors.yml, and fern/components/devnotes/authors-data.ts.
Navigation/index updates: top of mkdocs.yml Dev Notes (after index), top of Fern latest.yml Dev Notes section, and a new lead BlogCard in fern/versions/latest/pages/devnotes/index.mdx.

PR is +940 / -0 across 10 files. No code is touched.

Findings

Consistency with existing dev note conventions — good

MkDocs frontmatter uses date: + authors: and  excerpt marker — matches vlm-long-document-understanding.md.
Fern frontmatter uses title: / description: + <Authors ids={[...]} /> and {/* more */} — matches the Fern equivalent of the VLM post.
Author registration is mirrored across all three registries (yml × 2 + ts), which is the established pattern.
New post is placed first in both MkDocs nav and Fern nav, consistent with the "most recent → oldest" comment in mkdocs.yml. Date 2026-05-14 is one day before today (2026-05-15), so the ordering is correct.

Slug / filename mismatch — worth a note

The Fern file is retrieval-sdg-toolkit.mdx but its frontmatter sets slug: retriever-sdg-toolkit ("retrieval" vs "retriever"). The BlogCard href /dev-notes/retriever-sdg-toolkit matches the slug, so the link is not broken — but readers grepping for the URL by filename will be tripped up. Either:

Rename the file to retriever-sdg-toolkit.mdx (and adjust the path in latest.yml) so filename matches the slug, OR
Drop the explicit slug: and let it derive from filename (this would change the URL to /dev-notes/retrieval-sdg-toolkit; update the BlogCard href accordingly).

Either is fine; the asymmetry is the smell. The MkDocs post uses a third, longer slug (retriever-sdg-toolkit-from-documents-to-training-data), so MkDocs and Fern URLs already diverge. Probably tolerable since the two sites are separate properties, but a single canonical slug would be tidier.

External links — plausible but unverifiable from CI

The post links to several external resources that I cannot reach from this runner:

https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg
https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/embed
https://github.com/NVIDIA-NeMo/Nemotron/tree/preview/rerank-finetune-recipe-v1/src/nemotron/recipes/rerank
https://github.com/NVIDIA-NeMo/Automodel
https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1
https://huggingface.co/blog/nvidia/domain-specific-embedding-finetune

The reranking-recipe link uses the preview/rerank-finetune-recipe-v1 branch rather than main, which is fragile — branches like this are commonly squashed/deleted after merge. Recommend confirming with @shan-nvidia that the branch will remain valid for at least the lifetime of this post, or pinning to a commit SHA. The other URLs follow standard nvidia-nemo/HF naming and are plausible.

Code-snippet accuracy

The post documents APIs from the external data-designer-retrieval-sdg plugin (not in this repo), so I cannot statically verify them. Worth spot-checking with the plugin author:

from data_designer_retrieval_sdg.seed_source import DocumentChunkerSeedSource
from data_designer_retrieval_sdg import DocumentChunkerSeedSource, build_qa_generation_pipeline
from data_designer_retrieval_sdg.config import EmbeddingDedupColumnConfig
CLI flags: --input-dir, --output-dir, --num-files, --num-pairs, --batch-size, --preview, --corpus-id, --quality-threshold, --min-complexity, --min-hops, --similarity-threshold, --multi-doc
Constructor kwargs: path, file_pattern, recursive, file_extensions, min_text_length, sentences_per_chunk, num_sections, multi_doc, bundle_size, bundle_strategy, max_docs_per_bundle
data-designer plugin install data-designer-retrieval-sdg — verify this is the actual install command exposed by the plugin catalog.

The author has noted mkdocs build and make check-fern-docs-locally pass; that catches structural issues but not API drift against the external package. A quick pip install-and-import smoke test against the linked plugin repo would close that gap.

Asset duplication

pipeline.svg is committed twice (docs/devnotes/posts/assets/... and fern/assets/...), byte-identical. This matches the project's existing pattern (other dev notes do the same), so not a blocker — just a maintenance tax: any future tweak has to land in both places. Out of scope for this PR.

Minor copy notes

The post uses ASCII hyphens for em-dashes ("If your users ask questions that span multiple documents - for example..."). The VLM post uses real em-dashes ("—"). Minor stylistic divergence; pick one for consistency.
"Compatibility metadata and installation through the Data Designer plugin catalog." — unclear referent. Worth a one-line mention of which catalog/registry, since this is the first dev note that introduces the plugin-catalog concept end-to-end for users.

Security / sensitive content

Nothing concerning. No secrets, no internal hostnames, no embedded directives that look like injection attempts in the diff.

Verdict

Approve with minor revisions suggested. The PR is a clean, well-structured docs addition that follows existing dev-note conventions for both MkDocs and Fern. The pipeline SVG is well-crafted and accessible (has <title>/<desc>). Two recommended changes before merge:

Pin or verify the preview/rerank-finetune-recipe-v1 branch link — branch URLs rot quickly.
Reconcile the Fern filename / slug (retrieval-sdg-toolkit.mdx vs slug: retriever-sdg-toolkit) so future maintainers don't grep past the file.

Optional: align em-dash style with neighboring posts and consider deduplicating pipeline.svg via a build-time copy (broader project decision, not blocking).

greptile-apps · 2026-05-15T20:27:25Z

Greptile Summary

This PR adds a new dev note for the data-designer-retrieval-sdg toolkit, wiring it into both the MkDocs and Fern documentation systems along with a new author entry and a pipeline SVG diagram.

Adds docs/devnotes/posts/retrieval-sdg-toolkit.md and fern/versions/latest/pages/devnotes/posts/retriever-sdg-toolkit.mdx with matching content explaining the four-stage document-to-retriever-data pipeline (chunk → generate QA → deduplicate/judge → convert).
Registers Steve Han (sthan) as an author across the MkDocs YAML, Fern YAML, and Fern TypeScript registries; inserts the new page into both navigation configs and the Fern dev-notes index card grid.
Adds an accessible SVG pipeline diagram (with role="img", aria-labelledby, <title>, and <desc>) duplicated into both the MkDocs and Fern asset trees.

Confidence Score: 5/5

Documentation-only change with no runtime code; both build systems verified locally by the author.

All changed files are documentation assets, navigation config, and author registry entries. Cross-references between the MkDocs and Fern systems (file paths, slugs, author IDs, asset URLs) are internally consistent, and the PR checklist confirms both mkdocs build and make check-fern-docs-locally pass.

No files require special attention.

Important Files Changed

Filename	Overview
docs/devnotes/posts/retrieval-sdg-toolkit.md	New MkDocs dev note covering the four-stage retriever SDG pipeline; content, code samples, and frontmatter are all consistent and well-structured.
fern/versions/latest/pages/devnotes/posts/retriever-sdg-toolkit.mdx	Fern-formatted counterpart of the dev note; slug, author reference, and image path are correct and consistent with the Fern asset layout.
fern/versions/latest/pages/devnotes/index.mdx	Adds a BlogCard for the new post at the top of the dev-notes index; href, image src, date, and author ID all align with the new page and asset paths.
fern/components/devnotes/authors-data.ts	Adds the sthan author entry to the TypeScript registry; name, description, and avatar URL match the YAML author files exactly.
docs/devnotes/.authors.yml	Adds sthan author to the MkDocs authors registry; entry is well-formed and consistent with the Fern YAML and TypeScript counterparts.
fern/components/devnotes/.authors.yml	Adds sthan author to the Fern authors YAML; mirrors the MkDocs authors file correctly.
mkdocs.yml	Inserts the new dev note at the top of the Dev Notes nav section (most-recent-first order); file path reference matches the added markdown file.
fern/versions/latest.yml	Adds the Retriever SDG Toolkit page to the Fern navigation; path reference matches the new MDX file.
docs/devnotes/posts/assets/retrieval-sdg-toolkit/pipeline.svg	New SVG pipeline diagram for MkDocs; includes accessible title/desc elements and aria-labelledby attribute.
fern/assets/retrieval-sdg-toolkit/pipeline.svg	Identical SVG copied for the Fern docs asset path; content is identical to the MkDocs copy.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Source Documents\nDocs / policies / tickets / manuals] --> B[Stage 1: Bundle Docs\nsingle + multi-doc groups]
    A --> C[Stage 1: Chunk Docs\nstable segment IDs]
    B --> D[Stage 2: Extract Artifacts\nconcepts / entities / links]
    C --> E[Stage 2: Generate QA\ngrounded multi-hop questions]
    D --> F[Stage 3: Deduplicate\nnear-duplicate queries]
    E --> G[Stage 3: Judge Quality\nrelevance / support / clarity]
    F --> H[Stage 4: Convert\ntrain/val, BEIR qrels, AutoModel data]
    G --> H
    H --> I[train.json / val.json / corpus/]
    H --> J[eval_beir/ corpus.jsonl / qrels/test.tsv]

_{Reviews (5): Last reviewed commit: "docs: fix retriever SDG pipeline flow or..." | Re-trigger Greptile}

Signed-off-by: Steve Han <sthan@nvidia.com>

github-actions · 2026-05-15T20:40:22Z

MkDocs preview: https://c430f2c2.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-666.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

shan-nvidia · 2026-05-15T20:51:16Z

I have read the DCO document and I hereby sign the DCO.

nabinchha · 2026-05-18T20:04:39Z

@@ -0,0 +1,339 @@
+---
+date: 2026-05-14


You'll want this to be the date this will be published!

Signed-off-by: Steve Han <sthan@nvidia.com>

johnnygreco · 2026-05-18T21:24:20Z

+
+The new [`data-designer-retrieval-sdg`](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg) toolkit fills that gap: start with a directory of documents, generate synthetic query-positive examples with NeMo Data Designer, filter them, and export them for retriever fine-tuning and BEIR-style evaluation.
+
+<!-- more -->


nit: can we try moving this up so that there is a shorter "abstract" on the index page? This is the text that will appear above "Continue reading".

johnnygreco · 2026-05-18T21:26:03Z

+This is not just a demo package. The same toolkit produced the [Retrieval-Synthetic-NVDocs-v1](https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1) dataset from NVIDIA public documentation, and it powers the bootstrap SDG stage for both the NeMo [embedding fine-tune recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/embed) and [reranking fine-tune recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/e6e8a3281a11b8e1b7b47af098bbf54416c68d47/src/nemotron/recipes/rerank). It is now available as a standalone tool for generating high-quality, complex, multi-document, multi-hop retrieval data compatible with [AutoModel](https://github.com/NVIDIA-NeMo/Automodel).
+
+This post walks through what the toolkit does, why the generated labels matter, and how to make your first small run useful before you scale it up.
+
+---
+
+## **From Documents to Retriever Data**
+


I'm wondering if this combined with the "If you are building a RAG system..." narrative at the top can be combined into an intro / context setting section before going straight to the contents of the plugin and the diagram.

Signed-off-by: Steve Han <sthan@nvidia.com>

johnnygreco · 2026-05-18T21:30:09Z

+
+The hard part is not asking an LLM to write questions about a document. The hard part is keeping every generated question tied to the exact chunk, document, or multi-hop evidence set that a retriever should recover. Many RAG tutorials stop at chunk, embed, retrieve, and prompt. Fine-tuning recipes often begin once labeled query-passage pairs already exist. The gap in between is where developers lose the most time.
+
+The new [`data-designer-retrieval-sdg`](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg) toolkit fills that gap: start with a directory of documents, generate synthetic query-positive examples with NeMo Data Designer, filter them, and export them for retriever fine-tuning and BEIR-style evaluation.


Can we call this a "plugin" rather than "toolkit" throughout. You can still say "toolkit" if you prefer that noun, but maybe something like "the plugin contains a retrieval SDG toolkit" to help make clear that this is a Data Designer plugin.

johnnygreco · 2026-05-18T21:32:12Z

+| `document-chunker` | seed reader | Turns text files into sentence chunks with stable segment IDs, so each query can point back to the passages that answer it. |
+| `embedding-dedup` | column generator | Removes near-duplicate generated questions before judging and export, so the training data has more variety. |
+
+It also ships a normal Python API and a CLI:


We'll need to update this Dev Note when we figure out the long-term pattern for CLI-like functionality. Okay to keep this here, but we can't forget to update this later!

johnnygreco · 2026-05-18T21:36:30Z

+## **Why This Belongs in a Plugin**
+
+A blog recipe can teach the workflow. A plugin makes the workflow reusable.


IMO this "Why This Belongs in a Plugin" framing feels more like our internal discussions rather than how we should speak about it here. What do you think about framing in more about how Data Designer plugins unlock custom use cases?

docs: add retriever SDG toolkit dev note

3d01277

Signed-off-by: Steve Han <sthan@nvidia.com>

shan-nvidia requested a review from a team as a code owner May 15, 2026 20:24

shan-nvidia temporarily deployed to agentic-ci May 15, 2026 20:24 — with GitHub Actions Inactive

docs: resolve retriever SDG dev note feedback

1dbd9f7

Signed-off-by: Steve Han <sthan@nvidia.com>

Merge branch 'main' into codex-sthan-retrieval-sdg-devnote

a6d4b81

github-actions Bot mentioned this pull request May 18, 2026

Agentic CI: Issue & PR Triage Tracker #562

Open

nabinchha reviewed May 18, 2026

View reviewed changes

docs: restyle retriever SDG pipeline diagram

7d3dc0d

Signed-off-by: Steve Han <sthan@nvidia.com>

johnnygreco reviewed May 18, 2026

View reviewed changes

docs: fix retriever SDG pipeline flow order

76a4ff9

Signed-off-by: Steve Han <sthan@nvidia.com>

johnnygreco reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add retriever SDG toolkit dev note#666

docs: add retriever SDG toolkit dev note#666
shan-nvidia wants to merge 5 commits into
mainfrom
codex-sthan-retrieval-sdg-devnote

shan-nvidia commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

greptile-apps Bot commented May 15, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

github-actions Bot commented May 15, 2026 •

edited

Loading

Uh oh!

shan-nvidia commented May 15, 2026

Uh oh!

nabinchha May 18, 2026

Uh oh!

johnnygreco May 18, 2026

Uh oh!

johnnygreco May 18, 2026

Uh oh!

johnnygreco May 18, 2026

Uh oh!

johnnygreco May 18, 2026

Uh oh!

johnnygreco May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		The new [`data-designer-retrieval-sdg`](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg) toolkit fills that gap: start with a directory of documents, generate synthetic query-positive examples with NeMo Data Designer, filter them, and export them for retriever fine-tuning and BEIR-style evaluation.

		<!-- more -->


		The hard part is not asking an LLM to write questions about a document. The hard part is keeping every generated question tied to the exact chunk, document, or multi-hop evidence set that a retriever should recover. Many RAG tutorials stop at chunk, embed, retrieve, and prompt. Fine-tuning recipes often begin once labeled query-passage pairs already exist. The gap in between is where developers lose the most time.

		The new [`data-designer-retrieval-sdg`](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg) toolkit fills that gap: start with a directory of documents, generate synthetic query-positive examples with NeMo Data Designer, filter them, and export them for retriever fine-tuning and BEIR-style evaluation.

		## Why This Belongs in a Plugin

		A blog recipe can teach the workflow. A plugin makes the workflow reusable.

Conversation

shan-nvidia commented May 15, 2026

📋 Summary

🔗 Related Issue

🔄 Changes

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

Review: PR #666 — docs: add retriever SDG toolkit dev note

Summary

Findings

Consistency with existing dev note conventions — good

Slug / filename mismatch — worth a note

External links — plausible but unverifiable from CI

Code-snippet accuracy

Asset duplication

Minor copy notes

Security / sensitive content

Verdict

Uh oh!

greptile-apps Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shan-nvidia commented May 15, 2026

Uh oh!

nabinchha May 18, 2026

Choose a reason for hiding this comment

Uh oh!

johnnygreco May 18, 2026

Choose a reason for hiding this comment

Uh oh!

johnnygreco May 18, 2026

Choose a reason for hiding this comment

Uh oh!

johnnygreco May 18, 2026

Choose a reason for hiding this comment

Uh oh!

johnnygreco May 18, 2026

Choose a reason for hiding this comment

Uh oh!

johnnygreco May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 15, 2026 •

edited

Loading

greptile-apps Bot commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading