Skip to content

AMAP-ML/ImagineAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

arXiv Paper License

[๐Ÿ“– Paper] [๐Ÿค— ImagineAgent-7B] [๐Ÿค— ImagineAgent-7B-COT-SFT] [๐Ÿค— Dataset]

๐Ÿ‘€ About OV-HOI

OV-HOI is a novel framework that leverages tool-augmented reinforcement learning for open-vocabulary human-object interaction (HOI) detection.

Unlike prior methods that treat human-object interactions as monolithic entities, OV-HOI innovatively decomposes interactions into discriminative spatial-temporal primitives while dynamically invoking domain-specific tools for cross-modal reasoning, thereby enabling robust zero-shot generalization.


๐ŸŽฏ Core Task

Input:

  • Image containing human-object interactions
  • Text query for open-vocabulary recognition
  • Pre-defined action and object vocabulary

Output:

  • Detected human-object pairs with interaction labels
  • Reasoning chain with tool invocation

Reasoning Process:

  1. Human-Object Localization: Detect human and object instances in the image using detection tools
  2. Spatial Relation Analysis: Analyze spatial relationships between human and object (distance, orientation, contact)
  3. Action Reasoning: Infer possible interaction types based on detected entities and their relationships
  4. Scoring & Selection: Compare inferred interactions with text queries and select highest-scoring matches

๐Ÿ—๏ธ Architecture

Pipeline of the proposed method. This work adopts a two-stage training process. The first stage introduces an innovative tool-augmented CoT dataset for SFT, which decomposes HOI into contextual sub-components with multi-tool integration (human detection, object detection, spatial analysis). The second stage proposes a hierarchical reward framework for GRPO optimization, which incorporates accuracy, format, tool-usage efficiency, and spatial relation relevance for reliable HOI detection.


๐Ÿ“ Features

  • Qwen2.5-VL Base Model: Leverages state-of-the-art vision-language capabilities
  • Tool-Augmented Reasoning: Dynamically invokes human detection, object detection, and spatial analysis tools
  • Hierarchical Reward Design: GRPO with rewards balancing accuracy, format, tool-usage, and spatial relation relevance
  • Cold-Start SFT: Supervised fine-tuning before RL to stabilize tool invocation
  • Multi-Dataset Support: Trained and evaluated on HICO-DET and SWIG-HOI benchmarks

๐Ÿ“‚ Project Structure

OV_HOI/
โ”œโ”€โ”€ AFile/                       # Data storage
โ”‚   โ”œโ”€โ”€ datasets/                # Raw image data
โ”‚   โ”œโ”€โ”€ json/                    # HICO-DET and SWIG-HOI evaluation files
โ”‚   โ””โ”€โ”€ model/                   # Trained model checkpoints
โ”œโ”€โ”€ AScripts/
โ”‚   โ”œโ”€โ”€ bagel/                   # BAGEL model integration
โ”‚   โ”‚   โ”œโ”€โ”€ eval/                # Evaluation scripts
โ”‚   โ”‚   โ”œโ”€โ”€ modeling/            # Model architecture
โ”‚   โ”‚   โ””โ”€โ”€ train/               # Training scripts
โ”‚   โ””โ”€โ”€ eval/                    # Benchmark evaluation
โ”‚       โ”œโ”€โ”€ hico/                # HICO-DET evaluation
โ”‚       โ””โ”€โ”€ swig/                # SWIG-HOI evaluation
โ”œโ”€โ”€ src_HICO/
โ”‚   โ”œโ”€โ”€ r1-v/                    # Main training framework
โ”‚   โ”‚   โ”œโ”€โ”€ src/open_r1/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ trainer/         # GRPO trainers
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ oss_grpo_*.py    # GRPO training scripts
โ”‚   โ”‚   โ””โ”€โ”€ local_scripts/       # Data preparation
โ”‚   โ”œโ”€โ”€ qwen-vl-utils/           # Qwen-VL utilities
โ”‚   โ”œโ”€โ”€ eval_bench.py            # Benchmark evaluation
โ”‚   โ”œโ”€โ”€ generate_cot_vllm.py     # CoT data generation
โ”‚   โ””โ”€โ”€ inference_example.py      # Inference example
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ README.md

๐Ÿ” Dataset

Training Data

Dataset Samples Description
hico_train.json ~10K HICO-DET training set with reasoning annotations
swig_train.json ~8K SWIG-HOI training set with reasoning annotations

Evaluation Data

Dataset Images HOI Classes Description
hico_test.json ~9.6K 600 HICO-DET benchmark
swig_test.json ~5K 144 SWIG-HOI benchmark

Download Datasets

Follow the instructions from HICO-DET and SWIG-HOI to download the original datasets. Place all downloaded data in:

AFile/datasets/
โ”œโ”€โ”€ hico/
โ”‚   โ”œโ”€โ”€ images/
โ”‚   โ””โ”€โ”€ annotations/
โ””โ”€โ”€ swig/
    โ”œโ”€โ”€ images/
    โ””โ”€โ”€ annotations/

๐Ÿ“ Set up

# Clone the repository
git clone [Anonymized Repository]
cd OV_HOI

# Create conda environment
conda create -n ov-hoi python=3.10
conda activate ov-hoi

# Install dependencies
pip install -r requirements.txt

# Install flash-attention
pip install flash_attn --no-build-isolation

๐Ÿš€ Training

Prerequisites: BAGEL Tool Setup

Before running the training, you need to set up the BAGEL model for generative imagination tools. Follow the official BAGEL repository to configure the model.

The BAGEL scripts for running the model are located in:

  • AScripts/bagel/edit_final_parallel_train.sh - Training with image editing
  • AScripts/bagel/edit_final_parallel_test.sh - Testing with image editing
  • AScripts/bagel/out_paint_parallel_train.sh - Training with outpainting
  • AScripts/bagel/out_paint_parallel_train.sh - Testing with outpainting

Training Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

HICO-DET
cd src_HICO

# Run SFT training (cold-start phase)
bash ./scripts/run_sft_video_4subtol.sh
SWIG-HOI

Coming Soon

Stage 2: Reinforcement Learning (GRPO)

HICO-DET
# Run GRPO training with hierarchical rewards
bash ./scripts/run_grpo_video_4subtol_tool.sh
SWIG-HOI

Coming Soon


๐Ÿ”ฎ Inference & Evaluation

HICO-DET Evaluation

cd AScripts/eval/hico

# Run inference
CUDA_VISIBLE_DEVICES=0,1,2,3 python final_oss.py \
  --model-path /path/to/model \
  --json-file-path AFile/json/hico/test_hico_converted.json \
  --original-video-path /path/to/hico/images \
  --output-path output/model_predictions.json \
  --tensor-parallel-size 4

# Evaluate results
python evaluate_hoi_predictions.py \
  --predictions_file output/model_predictions.json \
  --anno_file AFile/json/hico/test_hico_ann_all.json \
  --output_dir evaluation_results \
  --zero_shot_type rare_first

SWIG-HOI Evaluation

cd AScripts/eval/swig

# Run inference
CUDA_VISIBLE_DEVICES=0,1,2,3 python final_oss.py \
  --model-path /path/to/model \
  --json-file-path AFile/json/swig/swig_test_converted.json \
  --original-video-path /path/to/swig/images \
  --output-path output/swig/model_predictions.json \
  --tensor-parallel-size 4

# Evaluate results
python evaluate_hoi_predictions.py \
  --predictions_file output/swig/model_predictions.json \
  --anno_file AFile/json/swig/swig_test_1000.json \
  --output_dir evaluation_results

Evaluation Output Format

# HICO-DET evaluation output
# Zero-shot mAP: XX.XX
# Seen mAP: XX.XX
# Unseen mAP: XX.XX
# Full mAP: XX.XX

๐Ÿ”ง Implementation Details

Models

  • Base Model: Qwen2.5-VL-3B-Instruct / Qwen2.5-VL-7B-Instruct
  • Training Samples: ~10,000 images (reused for SFT and RL)
  • Hardware: 8 ร— NVIDIA H20 GPUs (90GB memory each)
  • Batch Size: 8
  • Learning Rate: 5e-7
  • Training Iterations: 600 (1 epoch)
  • Rollouts per Sample: 4

Hierarchical Reward Design

The total reward integrates four components:

$$R(\tau) = R_{\text{acc}}(\tau) + R_{\text{format}}(\tau) + \mathbb{I}_{R_{\text{acc}}(\tau) > 0} \cdot \left(R_{\text{tool}}(\tau) + R_{\text{spatial}}(\tau) \right)$$

  • Accuracy Reward ($R_{\text{acc}}$): Evaluates correctness of final HOI prediction
  • Format Reward ($R_{\text{format}}$): Penalizes unstructured or incomplete reasoning chains
  • Tool-Usage Reward ($R_{\text{tool}}$): Activated only when correct answers accompany valid tool invocations
  • Spatial Relation Reward ($R_{\text{spatial}}$): Hierarchical weighting prioritizing semantically salient spatial relations

๐Ÿ“œ Citation

If you find this work helpful for your research, please consider citing:

@article{yuan2026if,
  title={What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation},
  author={Yuan, Zhenlong and Qu, Xiangyan and Tang, Jing and Chen, Rui and Sun, Lei and Chen, Ruidong and Yu, Hongwei and Qian, Chengxuan and Chu, Xiangxiang and Li, Shuo and others},
  journal={arXiv preprint arXiv:2602.11499},
  year={2026}
}

๐Ÿค Acknowledgements

We sincerely appreciate the contributions of the open-source community:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages