EmbodiedDataTransfer is a small workflow project for:
- downloading and exporting LeRobot-style datasets episode by episode
- running NVIDIA Cosmos
edge/distilledgeneration on exported robot videos - generating multiple trajectory variants per episode with different seeds
- appending generated variants back into a local augmented dataset
- optionally uploading the augmented dataset to Hugging Face
The repository now exposes two ways to use the workflow:
- high-level shell scripts in scripts/README.md
- lower-level Python CLI commands in cli.py
If you want one command that runs the whole workflow on the full dataset, use:
DATA_PARALLEL=true NUM_TRAJECTORIES=4 GPU_IDS=0,1,2,3,4,5,6,7 UPLOAD=true HF_TOKEN=hf_xxx HF_REPO=Miical/so101-30episodes-augmented ./scripts/full_pipeline_episode.shThat command will:
- download and export the dataset into
data/episode_exports - run Cosmos generation for all exported episodes
- append all generated variants into the local augmented dataset
- upload the augmented dataset to Hugging Face
If you prefer the lower-level CLI instead of wrapper scripts, the same workflow is built from:
PYTHONPATH=src python3 -m embodied_data_transfer process ...
PYTHONPATH=src python3 -m embodied_data_transfer run ...
PYTHONPATH=src python3 -m embodied_data_transfer append ...Typical setup:
cd /file_system/liujincheng/Projects/EmbodiedDataTransfer
python3 -m venv .venv
source .venv/bin/activate
pip install -e .If your environment already uses the project venv, the scripts in scripts/ can be run directly.
EmbodiedDataTransfer/
├── prompts/
├── scripts/
├── src/
│ └── embodied_data_transfer/
│ ├── __main__.py
│ ├── cli.py
│ ├── common.py
│ ├── dataset_processing.py
│ ├── cosmos_workflow.py
│ ├── augmentation.py
│ └── dataset_workflow.py
├── tests/
├── pyproject.toml
└── README.md
- cli.py: command-line entrypoints and argument parsing
- common.py: shared path, naming, and metadata helpers
- dataset_processing.py: dataset download, inspection, and episode export
- cosmos_workflow.py: Cosmos spec generation, single-run execution, and data-parallel scheduling
- augmentation.py: appending generated trajectories into LeRobot datasets and optional upload
- dataset_workflow.py: compatibility facade that re-exports the main workflow functions
The recommended entrypoints are:
Shared defaults live in scripts/common.sh.
Before using the scripts, make sure you point them at your local Cosmos checkout instead of relying on the example default paths.
Common environment variables:
DATASET_ID=Miical/so101-30episodes
COSMOS_ROOT=/path/to/cosmos-transfer2.5
COSMOS_PYTHON=/path/to/cosmos-transfer2.5/.venv/bin/python
PROMPT_PATH=/file_system/liujincheng/Projects/EmbodiedDataTransfer/prompts/single_arm_scene_tuning_en.txt
HF_HOME=/file_system/liujincheng/models/cosmos_model_cache
NUM_TRAJECTORIES=4
GPU_IDS=0,1,2,3,4,5,6,7
HF_REPO=Miical/so101-30episodes-augmentedCOSMOS_ROOT should point to your Cosmos repository, and COSMOS_PYTHON should point to the Python interpreter in that repository environment.
Downloads dataset metadata and videos, then exports one directory per episode.
./scripts/process_dataset.shResult:
- raw dataset cache under
data/hf_raw - exported episodes under
data/episode_exports/<dataset_name>/episode_XXX
Runs Cosmos generation.
Default behavior:
EPISODE_ID=allDATA_PARALLEL=falseNUM_TRAJECTORIES=1
Run the full dataset:
NUM_TRAJECTORIES=4 ./scripts/run.shRun one episode:
EPISODE_ID=3 NUM_TRAJECTORIES=4 ./scripts/run.shRun the full dataset with data parallel scheduling across GPUs:
DATA_PARALLEL=true NUM_TRAJECTORIES=4 GPU_IDS=0,1,2,3,4,5,6,7 ./scripts/run.shWhat it does:
- uses
seed,seed + 1,seed + 2, ... for different trajectory variants - writes outputs into
cosmos_edge_distilled/variants/variant_XXX
Appends generated variants into the local augmented dataset, with optional upload.
Default behavior:
EPISODE_ID=allUPLOAD=false
Append everything locally:
./scripts/append.shAppend one episode only:
EPISODE_ID=3 ./scripts/append.shAppend everything and upload:
UPLOAD=true HF_TOKEN=hf_xxx HF_REPO=Miical/so101-30episodes-augmented ./scripts/append.shRuns the whole workflow:
./scripts/process_dataset.sh
EPISODE_ID=... ./scripts/run.sh
EPISODE_ID=... UPLOAD=... ./scripts/append.shDefault behavior:
- if you pass no argument,
EPISODE_ID=all - if you pass one argument, it is treated as the episode id
Run everything for the whole dataset:
DATA_PARALLEL=true NUM_TRAJECTORIES=4 GPU_IDS=0,1,2,3,4,5,6,7 ./scripts/full_pipeline_episode.shRun everything for one episode:
DATA_PARALLEL=true NUM_TRAJECTORIES=4 GPU_IDS=0,1,2,3,4,5,6,7 ./scripts/full_pipeline_episode.sh 3Run everything and upload at the end:
DATA_PARALLEL=true NUM_TRAJECTORIES=4 GPU_IDS=0,1,2,3,4,5,6,7 UPLOAD=true HF_TOKEN=hf_xxx HF_REPO=Miical/so101-30episodes-augmented ./scripts/full_pipeline_episode.shIf you want to bypass the shell scripts, the Python CLI exposes these commands:
inspectprocessrunappend
The CLI entrypoints live in cli.py, while the workflow logic is split across the modules listed above.
Print rows grouped by episode from the source dataset.
PYTHONPATH=src python3 -m embodied_data_transfer inspect \
Miical/so101-30episodes \
--split train \
--cache-dir data/huggingfaceDownload the dataset and export episode directories.
PYTHONPATH=src python3 -m embodied_data_transfer process \
Miical/so101-30episodes \
--split train \
--cache-dir data/huggingface \
--raw-dir data/hf_raw \
--export-dir data/episode_exportsUnified generation entrypoint.
Important parameters:
--episode-id 3runs one episode--episode-id allruns all exported episodes--data-parallelenables multi-GPU scheduling--gpu-ids 0,1,2,3selects available GPUs--num-trajectories 4generates four variants per episode
Run one episode:
PYTHONPATH=src python3 -m embodied_data_transfer run \
Miical/so101-30episodes \
--episode-id 3 \
--export-dir data/episode_exports \
--cosmos-root /root/code/cosmos-transfer2.5 \
--cosmos-python /root/code/cosmos-transfer2.5/.venv/bin/python \
--prompt-path /file_system/liujincheng/Projects/EmbodiedDataTransfer/prompts/single_arm_scene_tuning_en.txt \
--hf-home /file_system/liujincheng/models/cosmos_model_cache \
--cosmos-model edge/distilled \
--num-steps 4 \
--seed 1 \
--num-trajectories 4 \
--nproc-per-node 1 \
--master-port 12341Run the full dataset with data parallel scheduling:
PYTHONPATH=src python3 -m embodied_data_transfer run \
Miical/so101-30episodes \
--episode-id all \
--export-dir data/episode_exports \
--cosmos-root /root/code/cosmos-transfer2.5 \
--cosmos-python /root/code/cosmos-transfer2.5/.venv/bin/python \
--prompt-path /file_system/liujincheng/Projects/EmbodiedDataTransfer/prompts/single_arm_scene_tuning_en.txt \
--hf-home /file_system/liujincheng/models/cosmos_model_cache \
--cosmos-model edge/distilled \
--num-steps 4 \
--seed 1 \
--num-trajectories 4 \
--data-parallel \
--gpu-ids 0,1,2,3,4,5,6,7 \
--master-port 12341Unified append entrypoint.
Important parameters:
--episode-id 3appends one episode--episode-id allappends all exported episodes--uploaduploads after append--hf-reposelects the target Hugging Face dataset repo
Append everything locally:
PYTHONPATH=src python3 -m embodied_data_transfer append \
Miical/so101-30episodes \
--episode-id all \
--export-dir data/episode_exports \
--raw-dir data/hf_raw \
--target-dir data/augmented_datasets \
--cosmos-model edge/distilledAppend everything and upload:
PYTHONPATH=src python3 -m embodied_data_transfer append \
Miical/so101-30episodes \
--episode-id all \
--export-dir data/episode_exports \
--raw-dir data/hf_raw \
--target-dir data/augmented_datasets \
--cosmos-model edge/distilled \
--upload \
--hf-repo Miical/so101-30episodes-augmented \
--hf-token-env HF_TOKENAppend only one episode and upload:
PYTHONPATH=src python3 -m embodied_data_transfer append \
Miical/so101-30episodes \
--episode-id 3 \
--export-dir data/episode_exports \
--raw-dir data/hf_raw \
--target-dir data/augmented_datasets \
--cosmos-model edge/distilled \
--upload \
--hf-repo Miical/so101-30episodes-augmented \
--hf-token-env HF_TOKENrunandappendboth default toepisode-id=allin the wrapper scripts.- multi-trajectory generation uses incrementing seeds to make variants different.
- generated Cosmos outputs are stored per episode and per variant.
- if you use a SOCKS proxy with Cosmos downloads, make sure your Cosmos checkout includes the
socksiofix in itscheckpoint_db.py.
- high-level script examples: scripts/README.md
- prompt templates: prompts/README.md
- CLI implementation: cli.py
- dataset download and export: dataset_processing.py
- Cosmos generation workflow: cosmos_workflow.py
- dataset append and upload: augmentation.py
- compatibility re-exports: dataset_workflow.py