Dataset
Download Pre-Built Dataset (Recommended)
python scripts/setup/download_assets.py --only data
Then train directly with the shard directory:
python train_mimic/scripts/train.py --motion_file data/datasets/seed/train
For custom dataset construction, read on.
Custom Dataset Construction
Data pipeline: typed source YAML -> preprocess/filter -> shard-only training data
python train_mimic/scripts/data/build_dataset.py \
--spec train_mimic/configs/datasets/twist2_full.yaml
Output Structure
data/datasets/<dataset>/
├── clips/ # Optional; only for per-clip intermediates
│ └── <source>/...
├── train/
│ └── shard_*.npz
├── val/
│ └── shard_*.npz
├── manifest_resolved.csv
└── build_info.json
- If the spec contains
bvhornpzsources, the builder retains/generatesclips/ - If the spec is all
pklorseed_csvsources, the builder takes a batch path producing split-level shards directly
YAML Spec Format
Example (train_mimic/configs/datasets/twist2_full.yaml):
name: twist2_full
target_fps: 30
val_percent: 5
hash_salt: ""
preprocess:
normalize_root_xy: true
ground_align: clip_min_foot
sources:
- name: OMOMO_g1_GMR
type: pkl
input: data/twist2_retarget_pkl/OMOMO_g1_GMR
- name: lafan1_v1
type: bvh
input: data/lafan1_bvh
bvh_format: lafan1
Field Reference
| Field | Description |
|---|---|
name | Dataset name, maps to output directory |
target_fps | Target frame rate for resampling |
val_percent | Validation split percentage (hash-based on clip_id) |
hash_salt | Optional split salt |
preprocess.normalize_root_xy | Normalize root body first-frame xy to origin |
preprocess.ground_align | none / clip_min_foot |
preprocess.min_frames | Minimum clip length |
preprocess.max_root_lin_vel | Root linear velocity filter threshold |
preprocess.min_peak_body_height | Minimum peak body height |
preprocess.max_all_off_ground_s | Max duration all feet off ground |
sources[].name | Source name (used for clips subdirectory) |
sources[].type | bvh / pkl / npz / seed_csv |
sources[].input | Input file or directory |
sources[].weight | Optional sampling weight (default 1.0) |
sources[].bvh_format | Required for BVH: lafan1 / hc_mocap / nokov |
sources[].robot_name | BVH only, default unitree_g1 |
sources[].max_frames | BVH only, 0 = full length |
Conversion Rules
All sources are converted to standard training shards. Each clip goes through preprocess/filter before writing to shards:
bvh -> retarget pkl -> npz clippkl -> npz clip(or direct batch shard for pkl-only datasets)npz -> validate + copy/reuse
Each shard contains: clip_starts, clip_lengths, clip_fps, clip_weights.
Common Commands
# Force rebuild
python train_mimic/scripts/data/build_dataset.py \
--spec train_mimic/configs/datasets/twist2_full.yaml --force
# Parallel processing
python train_mimic/scripts/data/build_dataset.py \
--spec train_mimic/configs/datasets/twist2_full.yaml --jobs 8
# Custom output root
python train_mimic/scripts/data/build_dataset.py \
--spec train_mimic/configs/datasets/twist2_full.yaml \
--output_root /tmp/my_datasets
# Print build report
python train_mimic/scripts/data/build_dataset.py \
--spec train_mimic/configs/datasets/twist2_full.yaml --json
Batch Ingest to NPZ Clips
Convert raw data to standard NPZ clips without merging:
python train_mimic/scripts/data/ingest_motion.py \
--type bvh --input data/lafan1_bvh \
--output data/datasets/lafan1_v1/clips/lafan1_v1 \
--source lafan1_v1 --bvh_format lafan1 --jobs 8
Check Clip FK Consistency
python train_mimic/scripts/data/check_motion_npz_fk.py \
--npz data/datasets/<dataset>/clips/<source>/<clip>.npz
Recommended thresholds: pos_max < 1e-3 m, quat_mean < 0.05 rad, quat_p95 < 0.10 rad.
Re-shard
Split large shards for distribution:
python train_mimic/scripts/data/split_shards.py \
--input data/datasets/seed/train \
--output data/datasets/seed/train_small_shards \
--max_size_gb 2
Each shard is self-contained with full clip metadata.