Skip to content

docs: add retriever SDG toolkit dev note#666

Open
shan-nvidia wants to merge 5 commits into
mainfrom
codex-sthan-retrieval-sdg-devnote
Open

docs: add retriever SDG toolkit dev note#666
shan-nvidia wants to merge 5 commits into
mainfrom
codex-sthan-retrieval-sdg-devnote

Conversation

@shan-nvidia
Copy link
Copy Markdown

📋 Summary

Adds a new dev note for the data-designer-retrieval-sdg toolkit, explaining why retriever synthetic data generation matters and how the toolkit turns documents into retriever training and BEIR evaluation artifacts.

🔗 Related Issue

N/A

🔄 Changes

  • Adds the Retriever SDG Toolkit dev note to MkDocs and Fern.
  • Adds a pipeline SVG showing document chunking, grounded QA generation, deduplication/judging, and conversion outputs.
  • Adds Steve Han to the dev-note author registries and uses that author for the new post.
  • Adds the new post to the MkDocs and Fern Dev Notes navigation/index.

🧪 Testing

  • .venv/bin/mkdocs build passes
  • make check-fern-docs-locally passes
  • Unit tests added/updated: N/A - docs-only change
  • E2E tests added/updated: N/A - docs-only change

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated: N/A - dev note only

Signed-off-by: Steve Han <sthan@nvidia.com>
@shan-nvidia shan-nvidia requested a review from a team as a code owner May 15, 2026 20:24
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@github-actions
Copy link
Copy Markdown
Contributor

Review: PR #666 — docs: add retriever SDG toolkit dev note

Summary

Docs-only PR adding a new "Retriever SDG Toolkit" dev note to both the MkDocs and Fern documentation sites. Changes:

  • New post: docs/devnotes/posts/retrieval-sdg-toolkit.md (MkDocs) and fern/versions/latest/pages/devnotes/posts/retrieval-sdg-toolkit.mdx (Fern), with parallel content adapted to each engine's syntax.
  • New pipeline diagram: pipeline.svg mirrored under docs/devnotes/posts/assets/retrieval-sdg-toolkit/ and fern/assets/retrieval-sdg-toolkit/ (the two files are byte-identical — verified).
  • New author entry sthan (Steve Han) added to all three author registries: docs/devnotes/.authors.yml, fern/components/devnotes/.authors.yml, and fern/components/devnotes/authors-data.ts.
  • Navigation/index updates: top of mkdocs.yml Dev Notes (after index), top of Fern latest.yml Dev Notes section, and a new lead BlogCard in fern/versions/latest/pages/devnotes/index.mdx.

PR is +940 / -0 across 10 files. No code is touched.

Findings

Consistency with existing dev note conventions — good

  • MkDocs frontmatter uses date: + authors: and <!-- more --> excerpt marker — matches vlm-long-document-understanding.md.
  • Fern frontmatter uses title: / description: + <Authors ids={[...]} /> and {/* more */} — matches the Fern equivalent of the VLM post.
  • Author registration is mirrored across all three registries (yml × 2 + ts), which is the established pattern.
  • New post is placed first in both MkDocs nav and Fern nav, consistent with the "most recent → oldest" comment in mkdocs.yml. Date 2026-05-14 is one day before today (2026-05-15), so the ordering is correct.

Slug / filename mismatch — worth a note

The Fern file is retrieval-sdg-toolkit.mdx but its frontmatter sets slug: retriever-sdg-toolkit ("retrieval" vs "retriever"). The BlogCard href /dev-notes/retriever-sdg-toolkit matches the slug, so the link is not broken — but readers grepping for the URL by filename will be tripped up. Either:

  • Rename the file to retriever-sdg-toolkit.mdx (and adjust the path in latest.yml) so filename matches the slug, OR
  • Drop the explicit slug: and let it derive from filename (this would change the URL to /dev-notes/retrieval-sdg-toolkit; update the BlogCard href accordingly).

Either is fine; the asymmetry is the smell. The MkDocs post uses a third, longer slug (retriever-sdg-toolkit-from-documents-to-training-data), so MkDocs and Fern URLs already diverge. Probably tolerable since the two sites are separate properties, but a single canonical slug would be tidier.

External links — plausible but unverifiable from CI

The post links to several external resources that I cannot reach from this runner:

  • https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg
  • https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/embed
  • https://github.com/NVIDIA-NeMo/Nemotron/tree/preview/rerank-finetune-recipe-v1/src/nemotron/recipes/rerank
  • https://github.com/NVIDIA-NeMo/Automodel
  • https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1
  • https://huggingface.co/blog/nvidia/domain-specific-embedding-finetune

The reranking-recipe link uses the preview/rerank-finetune-recipe-v1 branch rather than main, which is fragile — branches like this are commonly squashed/deleted after merge. Recommend confirming with @shan-nvidia that the branch will remain valid for at least the lifetime of this post, or pinning to a commit SHA. The other URLs follow standard nvidia-nemo/HF naming and are plausible.

Code-snippet accuracy

The post documents APIs from the external data-designer-retrieval-sdg plugin (not in this repo), so I cannot statically verify them. Worth spot-checking with the plugin author:

  • from data_designer_retrieval_sdg.seed_source import DocumentChunkerSeedSource
  • from data_designer_retrieval_sdg import DocumentChunkerSeedSource, build_qa_generation_pipeline
  • from data_designer_retrieval_sdg.config import EmbeddingDedupColumnConfig
  • CLI flags: --input-dir, --output-dir, --num-files, --num-pairs, --batch-size, --preview, --corpus-id, --quality-threshold, --min-complexity, --min-hops, --similarity-threshold, --multi-doc
  • Constructor kwargs: path, file_pattern, recursive, file_extensions, min_text_length, sentences_per_chunk, num_sections, multi_doc, bundle_size, bundle_strategy, max_docs_per_bundle
  • data-designer plugin install data-designer-retrieval-sdg — verify this is the actual install command exposed by the plugin catalog.

The author has noted mkdocs build and make check-fern-docs-locally pass; that catches structural issues but not API drift against the external package. A quick pip install-and-import smoke test against the linked plugin repo would close that gap.

Asset duplication

pipeline.svg is committed twice (docs/devnotes/posts/assets/... and fern/assets/...), byte-identical. This matches the project's existing pattern (other dev notes do the same), so not a blocker — just a maintenance tax: any future tweak has to land in both places. Out of scope for this PR.

Minor copy notes

  • The post uses ASCII hyphens for em-dashes ("If your users ask questions that span multiple documents - for example..."). The VLM post uses real em-dashes ("—"). Minor stylistic divergence; pick one for consistency.
  • "Compatibility metadata and installation through the Data Designer plugin catalog." — unclear referent. Worth a one-line mention of which catalog/registry, since this is the first dev note that introduces the plugin-catalog concept end-to-end for users.

Security / sensitive content

Nothing concerning. No secrets, no internal hostnames, no embedded directives that look like injection attempts in the diff.

Verdict

Approve with minor revisions suggested. The PR is a clean, well-structured docs addition that follows existing dev-note conventions for both MkDocs and Fern. The pipeline SVG is well-crafted and accessible (has <title>/<desc>). Two recommended changes before merge:

  1. Pin or verify the preview/rerank-finetune-recipe-v1 branch link — branch URLs rot quickly.
  2. Reconcile the Fern filename / slug (retrieval-sdg-toolkit.mdx vs slug: retriever-sdg-toolkit) so future maintainers don't grep past the file.

Optional: align em-dash style with neighboring posts and consider deduplicating pipeline.svg via a build-time copy (broader project decision, not blocking).

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR adds a new dev note for the data-designer-retrieval-sdg toolkit, wiring it into both the MkDocs and Fern documentation systems along with a new author entry and a pipeline SVG diagram.

  • Adds docs/devnotes/posts/retrieval-sdg-toolkit.md and fern/versions/latest/pages/devnotes/posts/retriever-sdg-toolkit.mdx with matching content explaining the four-stage document-to-retriever-data pipeline (chunk → generate QA → deduplicate/judge → convert).
  • Registers Steve Han (sthan) as an author across the MkDocs YAML, Fern YAML, and Fern TypeScript registries; inserts the new page into both navigation configs and the Fern dev-notes index card grid.
  • Adds an accessible SVG pipeline diagram (with role="img", aria-labelledby, <title>, and <desc>) duplicated into both the MkDocs and Fern asset trees.

Confidence Score: 5/5

Documentation-only change with no runtime code; both build systems verified locally by the author.

All changed files are documentation assets, navigation config, and author registry entries. Cross-references between the MkDocs and Fern systems (file paths, slugs, author IDs, asset URLs) are internally consistent, and the PR checklist confirms both mkdocs build and make check-fern-docs-locally pass.

No files require special attention.

Important Files Changed

Filename Overview
docs/devnotes/posts/retrieval-sdg-toolkit.md New MkDocs dev note covering the four-stage retriever SDG pipeline; content, code samples, and frontmatter are all consistent and well-structured.
fern/versions/latest/pages/devnotes/posts/retriever-sdg-toolkit.mdx Fern-formatted counterpart of the dev note; slug, author reference, and image path are correct and consistent with the Fern asset layout.
fern/versions/latest/pages/devnotes/index.mdx Adds a BlogCard for the new post at the top of the dev-notes index; href, image src, date, and author ID all align with the new page and asset paths.
fern/components/devnotes/authors-data.ts Adds the sthan author entry to the TypeScript registry; name, description, and avatar URL match the YAML author files exactly.
docs/devnotes/.authors.yml Adds sthan author to the MkDocs authors registry; entry is well-formed and consistent with the Fern YAML and TypeScript counterparts.
fern/components/devnotes/.authors.yml Adds sthan author to the Fern authors YAML; mirrors the MkDocs authors file correctly.
mkdocs.yml Inserts the new dev note at the top of the Dev Notes nav section (most-recent-first order); file path reference matches the added markdown file.
fern/versions/latest.yml Adds the Retriever SDG Toolkit page to the Fern navigation; path reference matches the new MDX file.
docs/devnotes/posts/assets/retrieval-sdg-toolkit/pipeline.svg New SVG pipeline diagram for MkDocs; includes accessible title/desc elements and aria-labelledby attribute.
fern/assets/retrieval-sdg-toolkit/pipeline.svg Identical SVG copied for the Fern docs asset path; content is identical to the MkDocs copy.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Source Documents\nDocs / policies / tickets / manuals] --> B[Stage 1: Bundle Docs\nsingle + multi-doc groups]
    A --> C[Stage 1: Chunk Docs\nstable segment IDs]
    B --> D[Stage 2: Extract Artifacts\nconcepts / entities / links]
    C --> E[Stage 2: Generate QA\ngrounded multi-hop questions]
    D --> F[Stage 3: Deduplicate\nnear-duplicate queries]
    E --> G[Stage 3: Judge Quality\nrelevance / support / clarity]
    F --> H[Stage 4: Convert\ntrain/val, BEIR qrels, AutoModel data]
    G --> H
    H --> I[train.json / val.json / corpus/]
    H --> J[eval_beir/ corpus.jsonl / qrels/test.tsv]
Loading

Reviews (5): Last reviewed commit: "docs: fix retriever SDG pipeline flow or..." | Re-trigger Greptile

Signed-off-by: Steve Han <sthan@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

MkDocs preview: https://c430f2c2.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-666.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

@shan-nvidia
Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@@ -0,0 +1,339 @@
---
date: 2026-05-14
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want this to be the date this will be published!

Signed-off-by: Steve Han <sthan@nvidia.com>

The new [`data-designer-retrieval-sdg`](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg) toolkit fills that gap: start with a directory of documents, generate synthetic query-positive examples with NeMo Data Designer, filter them, and export them for retriever fine-tuning and BEIR-style evaluation.

<!-- more -->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we try moving this up so that there is a shorter "abstract" on the index page? This is the text that will appear above "Continue reading".

Comment on lines +20 to +27
This is not just a demo package. The same toolkit produced the [Retrieval-Synthetic-NVDocs-v1](https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1) dataset from NVIDIA public documentation, and it powers the bootstrap SDG stage for both the NeMo [embedding fine-tune recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/embed) and [reranking fine-tune recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/e6e8a3281a11b8e1b7b47af098bbf54416c68d47/src/nemotron/recipes/rerank). It is now available as a standalone tool for generating high-quality, complex, multi-document, multi-hop retrieval data compatible with [AutoModel](https://github.com/NVIDIA-NeMo/Automodel).

This post walks through what the toolkit does, why the generated labels matter, and how to make your first small run useful before you scale it up.

---

## **From Documents to Retriever Data**

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this combined with the "If you are building a RAG system..." narrative at the top can be combined into an intro / context setting section before going straight to the contents of the plugin and the diagram.

Signed-off-by: Steve Han <sthan@nvidia.com>

The hard part is not asking an LLM to write questions about a document. The hard part is keeping every generated question tied to the exact chunk, document, or multi-hop evidence set that a retriever should recover. Many RAG tutorials stop at chunk, embed, retrieve, and prompt. Fine-tuning recipes often begin once labeled query-passage pairs already exist. The gap in between is where developers lose the most time.

The new [`data-designer-retrieval-sdg`](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg) toolkit fills that gap: start with a directory of documents, generate synthetic query-positive examples with NeMo Data Designer, filter them, and export them for retriever fine-tuning and BEIR-style evaluation.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this a "plugin" rather than "toolkit" throughout. You can still say "toolkit" if you prefer that noun, but maybe something like "the plugin contains a retrieval SDG toolkit" to help make clear that this is a Data Designer plugin.

| `document-chunker` | seed reader | Turns text files into sentence chunks with stable segment IDs, so each query can point back to the passages that answer it. |
| `embedding-dedup` | column generator | Removes near-duplicate generated questions before judging and export, so the training data has more variety. |

It also ships a normal Python API and a CLI:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to update this Dev Note when we figure out the long-term pattern for CLI-like functionality. Okay to keep this here, but we can't forget to update this later!

Comment on lines +247 to +249
## **Why This Belongs in a Plugin**

A blog recipe can teach the workflow. A plugin makes the workflow reusable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this "Why This Belongs in a Plugin" framing feels more like our internal discussions rather than how we should speak about it here. What do you think about framing in more about how Data Designer plugins unlock custom use cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants