What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

[📖 Paper] [🤗 ImagineAgent-7B] [🤗 ImagineAgent-7B-COT-SFT] [🤗 Dataset]

👀 About OV-HOI

OV-HOI is a novel framework that leverages tool-augmented reinforcement learning for open-vocabulary human-object interaction (HOI) detection.

Unlike prior methods that treat human-object interactions as monolithic entities, OV-HOI innovatively decomposes interactions into discriminative spatial-temporal primitives while dynamically invoking domain-specific tools for cross-modal reasoning, thereby enabling robust zero-shot generalization.

🎯 Core Task

Input:

Image containing human-object interactions
Text query for open-vocabulary recognition
Pre-defined action and object vocabulary

Output:

Detected human-object pairs with interaction labels
Reasoning chain with tool invocation

Reasoning Process:

Human-Object Localization: Detect human and object instances in the image using detection tools
Spatial Relation Analysis: Analyze spatial relationships between human and object (distance, orientation, contact)
Action Reasoning: Infer possible interaction types based on detected entities and their relationships
Scoring & Selection: Compare inferred interactions with text queries and select highest-scoring matches

🏗️ Architecture

Pipeline of the proposed method. This work adopts a two-stage training process. The first stage introduces an innovative tool-augmented CoT dataset for SFT, which decomposes HOI into contextual sub-components with multi-tool integration (human detection, object detection, spatial analysis). The second stage proposes a hierarchical reward framework for GRPO optimization, which incorporates accuracy, format, tool-usage efficiency, and spatial relation relevance for reliable HOI detection.

📍 Features

Qwen2.5-VL Base Model: Leverages state-of-the-art vision-language capabilities
Tool-Augmented Reasoning: Dynamically invokes human detection, object detection, and spatial analysis tools
Hierarchical Reward Design: GRPO with rewards balancing accuracy, format, tool-usage, and spatial relation relevance
Cold-Start SFT: Supervised fine-tuning before RL to stabilize tool invocation
Multi-Dataset Support: Trained and evaluated on HICO-DET and SWIG-HOI benchmarks

📂 Project Structure

OV_HOI/
├── AFile/                       # Data storage
│   ├── datasets/                # Raw image data
│   ├── json/                    # HICO-DET and SWIG-HOI evaluation files
│   └── model/                   # Trained model checkpoints
├── AScripts/
│   ├── bagel/                   # BAGEL model integration
│   │   ├── eval/                # Evaluation scripts
│   │   ├── modeling/            # Model architecture
│   │   └── train/               # Training scripts
│   └── eval/                    # Benchmark evaluation
│       ├── hico/                # HICO-DET evaluation
│       └── swig/                # SWIG-HOI evaluation
├── src_HICO/
│   ├── r1-v/                    # Main training framework
│   │   ├── src/open_r1/
│   │   │   ├── trainer/         # GRPO trainers
│   │   │   └── oss_grpo_*.py    # GRPO training scripts
│   │   └── local_scripts/       # Data preparation
│   ├── qwen-vl-utils/           # Qwen-VL utilities
│   ├── eval_bench.py            # Benchmark evaluation
│   ├── generate_cot_vllm.py     # CoT data generation
│   └── inference_example.py      # Inference example
├── requirements.txt
└── README.md

🔍 Dataset

Training Data

Dataset	Samples	Description
`hico_train.json`	~10K	HICO-DET training set with reasoning annotations
`swig_train.json`	~8K	SWIG-HOI training set with reasoning annotations

Evaluation Data

Dataset	Images	HOI Classes	Description
`hico_test.json`	~9.6K	600	HICO-DET benchmark
`swig_test.json`	~5K	144	SWIG-HOI benchmark

Download Datasets

Follow the instructions from HICO-DET and SWIG-HOI to download the original datasets. Place all downloaded data in:

AFile/datasets/
├── hico/
│   ├── images/
│   └── annotations/
└── swig/
    ├── images/
    └── annotations/

📐 Set up

# Clone the repository
git clone [Anonymized Repository]
cd OV_HOI

# Create conda environment
conda create -n ov-hoi python=3.10
conda activate ov-hoi

# Install dependencies
pip install -r requirements.txt

# Install flash-attention
pip install flash_attn --no-build-isolation

🚀 Training

Prerequisites: BAGEL Tool Setup

Before running the training, you need to set up the BAGEL model for generative imagination tools. Follow the official BAGEL repository to configure the model.

The BAGEL scripts for running the model are located in:

AScripts/bagel/edit_final_parallel_train.sh - Training with image editing
AScripts/bagel/edit_final_parallel_test.sh - Testing with image editing
AScripts/bagel/out_paint_parallel_train.sh - Training with outpainting
AScripts/bagel/out_paint_parallel_train.sh - Testing with outpainting

Training Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

HICO-DET

cd src_HICO

# Run SFT training (cold-start phase)
bash ./scripts/run_sft_video_4subtol.sh

SWIG-HOI

Coming Soon

Stage 2: Reinforcement Learning (GRPO)

HICO-DET

# Run GRPO training with hierarchical rewards
bash ./scripts/run_grpo_video_4subtol_tool.sh

SWIG-HOI

Coming Soon

🔮 Inference & Evaluation

HICO-DET Evaluation

cd AScripts/eval/hico

# Run inference
CUDA_VISIBLE_DEVICES=0,1,2,3 python final_oss.py \
  --model-path /path/to/model \
  --json-file-path AFile/json/hico/test_hico_converted.json \
  --original-video-path /path/to/hico/images \
  --output-path output/model_predictions.json \
  --tensor-parallel-size 4

# Evaluate results
python evaluate_hoi_predictions.py \
  --predictions_file output/model_predictions.json \
  --anno_file AFile/json/hico/test_hico_ann_all.json \
  --output_dir evaluation_results \
  --zero_shot_type rare_first

SWIG-HOI Evaluation

cd AScripts/eval/swig

# Run inference
CUDA_VISIBLE_DEVICES=0,1,2,3 python final_oss.py \
  --model-path /path/to/model \
  --json-file-path AFile/json/swig/swig_test_converted.json \
  --original-video-path /path/to/swig/images \
  --output-path output/swig/model_predictions.json \
  --tensor-parallel-size 4

# Evaluate results
python evaluate_hoi_predictions.py \
  --predictions_file output/swig/model_predictions.json \
  --anno_file AFile/json/swig/swig_test_1000.json \
  --output_dir evaluation_results

Evaluation Output Format

# HICO-DET evaluation output
# Zero-shot mAP: XX.XX
# Seen mAP: XX.XX
# Unseen mAP: XX.XX
# Full mAP: XX.XX

🔧 Implementation Details

Models

Base Model: Qwen2.5-VL-3B-Instruct / Qwen2.5-VL-7B-Instruct
Training Samples: ~10,000 images (reused for SFT and RL)
Hardware: 8 × NVIDIA H20 GPUs (90GB memory each)
Batch Size: 8
Learning Rate: 5e-7
Training Iterations: 600 (1 epoch)
Rollouts per Sample: 4

Hierarchical Reward Design

The total reward integrates four components:

$$R(\tau) = R_{\text{acc}}(\tau) + R_{\text{format}}(\tau) + \mathbb{I}_{R_{\text{acc}}(\tau) > 0} \cdot \left(R_{\text{tool}}(\tau) + R_{\text{spatial}}(\tau) \right)$$

Accuracy Reward ($R_{\text{acc}}$): Evaluates correctness of final HOI prediction
Format Reward ($R_{\text{format}}$): Penalizes unstructured or incomplete reasoning chains
Tool-Usage Reward ($R_{\text{tool}}$): Activated only when correct answers accompany valid tool invocations
Spatial Relation Reward ($R_{\text{spatial}}$): Hierarchical weighting prioritizing semantically salient spatial relations

📜 Citation

If you find this work helpful for your research, please consider citing:

@article{yuan2026if,
  title={What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation},
  author={Yuan, Zhenlong and Qu, Xiangyan and Tang, Jing and Chen, Rui and Sun, Lei and Chen, Ruidong and Yu, Hongwei and Qian, Chengxuan and Chu, Xiangxiang and Li, Shuo and others},
  journal={arXiv preprint arXiv:2602.11499},
  year={2026}
}

🤝 Acknowledgements

We sincerely appreciate the contributions of the open-source community:

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
AFile		AFile
AScripts		AScripts
images		images
src_HICO		src_HICO
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

👀 About OV-HOI

🎯 Core Task

🏗️ Architecture

📍 Features

📂 Project Structure

🔍 Dataset

Training Data

Evaluation Data

Download Datasets

📐 Set up

🚀 Training

Prerequisites: BAGEL Tool Setup

Training Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

HICO-DET

SWIG-HOI

Stage 2: Reinforcement Learning (GRPO)

HICO-DET

SWIG-HOI

🔮 Inference & Evaluation

HICO-DET Evaluation

SWIG-HOI Evaluation

Evaluation Output Format

🔧 Implementation Details

Models

Hierarchical Reward Design

📜 Citation

🤝 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages