Skip to content

FEAT Add VLGuard multimodal safety dataset loader#1447

Open
romanlutz wants to merge 9 commits intomicrosoft:mainfrom
romanlutz:romanlutz/vlguard-dataset
Open

FEAT Add VLGuard multimodal safety dataset loader#1447
romanlutz wants to merge 9 commits intomicrosoft:mainfrom
romanlutz:romanlutz/vlguard-dataset

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

Summary

Adds support for the VLGuard dataset (ICML 2024), a vision-language safety benchmark that evaluates whether multimodal models refuse unsafe content while remaining helpful on safe content.

What is VLGuard?

VLGuard contains ~2,000 image-instruction pairs across 4 categories (Privacy, Risky Behavior, Deception, Hateful Speech) and 8 subcategories (Personal Data, Professional Advice, Political, Sexually
Explicit, Violence, Disinformation, Discrimination by Sex, Discrimination by Race).

It supports three evaluation subsets:

  • unsafes — unsafe images with instructions (tests whether the model refuses to describe unsafe visual content)
  • safe_unsafes — safe images with unsafe instructions (tests whether the model refuses unsafe text prompts)
  • safe_safes — safe images with safe instructions (tests whether the model remains helpful)

Usage

from pyrit.datasets.seed_datasets.remote import _VLGuardDataset, VLGuardCategory, VLGuardSubset

Load unsafe image examples (default)

loader = VLGuardDataset(token="hf...")
dataset = await loader.fetch_dataset()

Load safe images with unsafe instructions, filtered to Privacy category

loader = VLGuardDataset(
subset=VLGuardSubset.SAFE_UNSAFES,
categories=[VLGuardCategory.PRIVACY],
token="hf
...",
)
dataset = await loader.fetch_dataset()

Note: This is a gated dataset on HuggingFace. Users must accept the terms at https://huggingface.co/datasets/ys-zong/VLGuard and provide a HuggingFace token.

Changes

  • pyrit/datasets/seed_datasets/remote/vlguard_dataset.py — new dataset loader
  • pyrit/datasets/seed_datasets/remote/init.py — register exports
  • tests/unit/datasets/test_vlguard_dataset.py — 14 unit tests
  • doc/code/datasets/1_loading_datasets.ipynb — regenerated to show VLGuard in dataset list

romanlutz and others added 2 commits March 10, 2026 05:33
Add support for the VLGuard dataset (ICML 2024) which contains image-instruction
pairs for evaluating vision-language model safety across 4 categories (Privacy,
Risky Behavior, Deception, Hateful Speech) with 8 subcategories.

Supports three evaluation subsets:
- unsafes: unsafe images with instructions (tests refusal)
- safe_unsafes: safe images with unsafe instructions (tests refusal)
- safe_safes: safe images with safe instructions (tests helpfulness)

Downloads from HuggingFace (gated dataset, requires token and terms acceptance).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanlutz romanlutz force-pushed the romanlutz/vlguard-dataset branch from cac3cad to 255dd50 Compare March 10, 2026 13:24
Comment thread pyrit/datasets/seed_datasets/remote/vlguard_dataset.py Outdated
Comment thread pyrit/datasets/seed_datasets/remote/vlguard_dataset.py Outdated
Comment thread pyrit/datasets/seed_datasets/remote/vlguard_dataset.py
romanlutz and others added 7 commits April 22, 2026 07:22
- Add brief explainer for each VLGuardCategory enum member
- Add VLGuardSubcategory enum for the 8 subcategories
- Add clarifying comment on max_examples * 2 check
- Fix Optional -> | None per style guide

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Cover edge cases (invalid instr-resp, missing image field, no extractable
instruction) and both cache/download paths in _download_dataset_files_async.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The actual dataset uses 'harmful_category' and 'harmful_subcategory' (not
'category'/'subcategory'), with lowercase values. Also the fourth category
is 'discrimination' not 'Hateful Speech', and subcategories include 'sex',
'race', and 'other'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@ValbuenaVC ValbuenaVC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Two minor comments that aren't blocking

DISCRIMINATION = "discrimination"


class VLGuardSubcategory(Enum):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do all subcategories apply to all categories? If not, we should document this

Returns:
tuple[list[dict], Path]: Tuple of (metadata list, image directory path).
"""
from huggingface_hub import hf_hub_download
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why is this import down here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants