Training Troubleshooting
Common training issues and solutions.
For training workflow, see Training Tutorial. For data preparation, see Dataset Reference.
Issue 1: Mean Episode Length = 1.00 (Robot Terminates on First Step)
Symptoms
Mean episode length: 1.00Episode_Termination/anchor_posnear total parallel environment countMetrics/motion/error_anchor_pos> 0.5 mMetrics/motion/error_body_rotvery large (close to pi)
Root Cause
Usually not a PPO hyperparameter issue, but motion NPZ labels inconsistent with MuJoCo FK:
- Body position coordinate error: Local coordinates used as world coordinates
- Body order error: Using PKL's 38-body order instead of mjlab G1's 30-body order
- Body orientation/angular velocity error: All bodies approximated to root orientation
The current convert_pkl_to_npz.py fixes these issues.
Quick Diagnosis
python train_mimic/scripts/data/check_motion_npz_fk.py \
--npz data/datasets/<dataset>/clips/<source>/<clip>.npz
Expected thresholds: pos_max < 1e-3 m, quat_mean < 0.05 rad, quat_p95 < 0.10 rad.
If check fails, regenerate data and run a smoke test:
python train_mimic/scripts/train.py \
--num_envs 64 --max_iterations 100 \
--motion_file data/datasets/<dataset>/train
Expected: Mean episode length significantly > 1, error_anchor_pos starts decreasing.
Issue 2: Episode Length Not Growing
Symptoms
After 1000+ iterations, Mean episode length stays low (< 3) with no upward trend.
Possible Causes
- Poor retargeting quality (unreachable target poses)
- Tracking reward weight too low vs regularization
- Learning rate too high/low, clip_param mismatch
- Termination thresholds too strict
Diagnosis Steps
- Visualize reference motion with
play.py - Check reward distribution - tracking reward should dominate
- Temporarily increase
bad_anchor_posthreshold (0.25m -> 0.5m) - Compare with mjlab's built-in G1 tracking task
Issue 3: Slow Training
Symptoms
Training speed < 1000 steps/s (expected 1500-2000 on RTX 4090).
Solutions
- Increase
--num_envsto 4096 (needs 24 GB VRAM) - Disable
--videoduring training - Use TensorBoard instead of W&B (default)
Issue 4: nefc overflow - please increase njmax
Symptoms
nefc overflow - please increase njmax to 257
Root Cause
MuJoCo constraint buffer insufficient. When the robot falls or has many contacts, active constraints exceed njmax. The mjlab training default is sim.njmax=250.
Solution
Already fixed in the repository. The env builder in train_mimic/tasks/tracking/config/env.py overrides:
self.sim.njmax = 500
self.sim.nconmax = 150_000
If warnings persist at higher values, increase to njmax = 800.
Only modifying the robot XML is insufficient - the simulation-level njmax in mjlab takes precedence.
Issue 5: Benchmark Video Problems
Video has only 1 frame
Ensure num_eval_steps >= video_length:
python train_mimic/scripts/benchmark.py \
--checkpoint logs/rsl_rl/g1_general_tracking/<run>/model_30000.pt \
--motion_file data/datasets/<dataset>/val \
--num_envs 1 --num_eval_steps 2000 \
--video --video_length 600
EGL/OpenGL errors
Install OpenGL/EGL dependencies:
conda install -c conda-forge libopengl libglx libegl libglvnd pyopengl
If GPU EGL is unavailable, try CPU rendering:
MUJOCO_GL=osmesa PYOPENGL_PLATFORM=osmesa \
python train_mimic/scripts/benchmark.py ... --video
Issue 6: Foot Sliding in Sim2Sim (Benchmark OK but ONNX Inference Slides)
Root Cause
Sim2sim configuration parameters mismatch with training environment:
default_anglesmismatch (critical): Different joint defaults cause action offset and observation errors- Missing joint armature: Training environment has non-zero armature; zero armature causes overshoot
- condim mismatch: Different collision parameters between training and sim2sim
Diagnosis
from mjlab.asset_zoo.robots import get_g1_robot_cfg
cfg = get_g1_robot_cfg()
print(cfg.init_state.joint_pos) # Must match g1.yaml default_angles
Solution
Update teleopit/configs/robot/g1.yaml and g1_mjlab.xml to match training environment values (default angles, armature, condim).
This fix also affects the sim2real path since default_angles is shared by rl_policy.py and observation.py.