[๐ Paper] [๐ค ImagineAgent-7B] [๐ค ImagineAgent-7B-COT-SFT] [๐ค Dataset]
OV-HOI is a novel framework that leverages tool-augmented reinforcement learning for open-vocabulary human-object interaction (HOI) detection.
Unlike prior methods that treat human-object interactions as monolithic entities, OV-HOI innovatively decomposes interactions into discriminative spatial-temporal primitives while dynamically invoking domain-specific tools for cross-modal reasoning, thereby enabling robust zero-shot generalization.
Input:
- Image containing human-object interactions
- Text query for open-vocabulary recognition
- Pre-defined action and object vocabulary
Output:
- Detected human-object pairs with interaction labels
- Reasoning chain with tool invocation
Reasoning Process:
- Human-Object Localization: Detect human and object instances in the image using detection tools
- Spatial Relation Analysis: Analyze spatial relationships between human and object (distance, orientation, contact)
- Action Reasoning: Infer possible interaction types based on detected entities and their relationships
- Scoring & Selection: Compare inferred interactions with text queries and select highest-scoring matches
Pipeline of the proposed method. This work adopts a two-stage training process. The first stage introduces an innovative tool-augmented CoT dataset for SFT, which decomposes HOI into contextual sub-components with multi-tool integration (human detection, object detection, spatial analysis). The second stage proposes a hierarchical reward framework for GRPO optimization, which incorporates accuracy, format, tool-usage efficiency, and spatial relation relevance for reliable HOI detection.
- Qwen2.5-VL Base Model: Leverages state-of-the-art vision-language capabilities
- Tool-Augmented Reasoning: Dynamically invokes human detection, object detection, and spatial analysis tools
- Hierarchical Reward Design: GRPO with rewards balancing accuracy, format, tool-usage, and spatial relation relevance
- Cold-Start SFT: Supervised fine-tuning before RL to stabilize tool invocation
- Multi-Dataset Support: Trained and evaluated on HICO-DET and SWIG-HOI benchmarks
OV_HOI/
โโโ AFile/ # Data storage
โ โโโ datasets/ # Raw image data
โ โโโ json/ # HICO-DET and SWIG-HOI evaluation files
โ โโโ model/ # Trained model checkpoints
โโโ AScripts/
โ โโโ bagel/ # BAGEL model integration
โ โ โโโ eval/ # Evaluation scripts
โ โ โโโ modeling/ # Model architecture
โ โ โโโ train/ # Training scripts
โ โโโ eval/ # Benchmark evaluation
โ โโโ hico/ # HICO-DET evaluation
โ โโโ swig/ # SWIG-HOI evaluation
โโโ src_HICO/
โ โโโ r1-v/ # Main training framework
โ โ โโโ src/open_r1/
โ โ โ โโโ trainer/ # GRPO trainers
โ โ โ โโโ oss_grpo_*.py # GRPO training scripts
โ โ โโโ local_scripts/ # Data preparation
โ โโโ qwen-vl-utils/ # Qwen-VL utilities
โ โโโ eval_bench.py # Benchmark evaluation
โ โโโ generate_cot_vllm.py # CoT data generation
โ โโโ inference_example.py # Inference example
โโโ requirements.txt
โโโ README.md
| Dataset | Samples | Description |
|---|---|---|
hico_train.json |
~10K | HICO-DET training set with reasoning annotations |
swig_train.json |
~8K | SWIG-HOI training set with reasoning annotations |
| Dataset | Images | HOI Classes | Description |
|---|---|---|---|
hico_test.json |
~9.6K | 600 | HICO-DET benchmark |
swig_test.json |
~5K | 144 | SWIG-HOI benchmark |
Follow the instructions from HICO-DET and SWIG-HOI to download the original datasets. Place all downloaded data in:
AFile/datasets/
โโโ hico/
โ โโโ images/
โ โโโ annotations/
โโโ swig/
โโโ images/
โโโ annotations/
# Clone the repository
git clone [Anonymized Repository]
cd OV_HOI
# Create conda environment
conda create -n ov-hoi python=3.10
conda activate ov-hoi
# Install dependencies
pip install -r requirements.txt
# Install flash-attention
pip install flash_attn --no-build-isolationBefore running the training, you need to set up the BAGEL model for generative imagination tools. Follow the official BAGEL repository to configure the model.
The BAGEL scripts for running the model are located in:
AScripts/bagel/edit_final_parallel_train.sh- Training with image editingAScripts/bagel/edit_final_parallel_test.sh- Testing with image editingAScripts/bagel/out_paint_parallel_train.sh- Training with outpaintingAScripts/bagel/out_paint_parallel_train.sh- Testing with outpainting
cd src_HICO
# Run SFT training (cold-start phase)
bash ./scripts/run_sft_video_4subtol.shComing Soon
# Run GRPO training with hierarchical rewards
bash ./scripts/run_grpo_video_4subtol_tool.shComing Soon
cd AScripts/eval/hico
# Run inference
CUDA_VISIBLE_DEVICES=0,1,2,3 python final_oss.py \
--model-path /path/to/model \
--json-file-path AFile/json/hico/test_hico_converted.json \
--original-video-path /path/to/hico/images \
--output-path output/model_predictions.json \
--tensor-parallel-size 4
# Evaluate results
python evaluate_hoi_predictions.py \
--predictions_file output/model_predictions.json \
--anno_file AFile/json/hico/test_hico_ann_all.json \
--output_dir evaluation_results \
--zero_shot_type rare_firstcd AScripts/eval/swig
# Run inference
CUDA_VISIBLE_DEVICES=0,1,2,3 python final_oss.py \
--model-path /path/to/model \
--json-file-path AFile/json/swig/swig_test_converted.json \
--original-video-path /path/to/swig/images \
--output-path output/swig/model_predictions.json \
--tensor-parallel-size 4
# Evaluate results
python evaluate_hoi_predictions.py \
--predictions_file output/swig/model_predictions.json \
--anno_file AFile/json/swig/swig_test_1000.json \
--output_dir evaluation_results# HICO-DET evaluation output
# Zero-shot mAP: XX.XX
# Seen mAP: XX.XX
# Unseen mAP: XX.XX
# Full mAP: XX.XX- Base Model: Qwen2.5-VL-3B-Instruct / Qwen2.5-VL-7B-Instruct
- Training Samples: ~10,000 images (reused for SFT and RL)
- Hardware: 8 ร NVIDIA H20 GPUs (90GB memory each)
- Batch Size: 8
- Learning Rate: 5e-7
- Training Iterations: 600 (1 epoch)
- Rollouts per Sample: 4
The total reward integrates four components:
-
Accuracy Reward (
$R_{\text{acc}}$ ): Evaluates correctness of final HOI prediction -
Format Reward (
$R_{\text{format}}$ ): Penalizes unstructured or incomplete reasoning chains -
Tool-Usage Reward (
$R_{\text{tool}}$ ): Activated only when correct answers accompany valid tool invocations -
Spatial Relation Reward (
$R_{\text{spatial}}$ ): Hierarchical weighting prioritizing semantically salient spatial relations
If you find this work helpful for your research, please consider citing:
@article{yuan2026if,
title={What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation},
author={Yuan, Zhenlong and Qu, Xiangyan and Tang, Jing and Chen, Rui and Sun, Lei and Chen, Ruidong and Yu, Hongwei and Qian, Chengxuan and Chu, Xiangxiang and Li, Shuo and others},
journal={arXiv preprint arXiv:2602.11499},
year={2026}
}We sincerely appreciate the contributions of the open-source community:

