Training Troubleshooting

Common training issues and solutions.

info

For training workflow, see Training Tutorial. For data preparation, see Dataset Reference.

Issue 1: Mean Episode Length = 1.00 (Robot Terminates on First Step)

Symptoms

Mean episode length: 1.00
Episode_Termination/anchor_pos near total parallel environment count
Metrics/motion/error_anchor_pos > 0.5 m
Metrics/motion/error_body_rot very large (close to pi)

Root Cause

Usually not a PPO hyperparameter issue, but motion NPZ labels inconsistent with MuJoCo FK:

Body position coordinate error: Local coordinates used as world coordinates
Body order error: Using PKL's 38-body order instead of mjlab G1's 30-body order
Body orientation/angular velocity error: All bodies approximated to root orientation

The current convert_pkl_to_npz.py fixes these issues.

Quick Diagnosis

python train_mimic/scripts/data/check_motion_npz_fk.py \
    --npz data/datasets/<dataset>/clips/<source>/<clip>.npz

Expected thresholds: pos_max < 1e-3 m, quat_mean < 0.05 rad, quat_p95 < 0.10 rad.

If check fails, regenerate data and run a smoke test:

python train_mimic/scripts/train.py \
    --num_envs 64 --max_iterations 100 \
    --motion_file data/datasets/<dataset>/train

Expected: Mean episode length significantly > 1, error_anchor_pos starts decreasing.

Issue 2: Episode Length Not Growing

Symptoms

After 1000+ iterations, Mean episode length stays low (< 3) with no upward trend.

Possible Causes

Poor retargeting quality (unreachable target poses)
Tracking reward weight too low vs regularization
Learning rate too high/low, clip_param mismatch
Termination thresholds too strict

Diagnosis Steps

Visualize reference motion with play.py
Check reward distribution - tracking reward should dominate
Temporarily increase bad_anchor_pos threshold (0.25m -> 0.5m)
Compare with mjlab's built-in G1 tracking task

Issue 3: Slow Training

Symptoms

Training speed < 1000 steps/s (expected 1500-2000 on RTX 4090).

Solutions

Increase --num_envs to 4096 (needs 24 GB VRAM)
Disable --video during training
Use TensorBoard instead of W&B (default)

Issue 4: `nefc overflow - please increase njmax`

Symptoms

nefc overflow - please increase njmax to 257

Root Cause

MuJoCo constraint buffer insufficient. When the robot falls or has many contacts, active constraints exceed njmax. The mjlab training default is sim.njmax=250.

Solution

Already fixed in the repository. The env builder in train_mimic/tasks/tracking/config/env.py overrides:

self.sim.njmax = 500
self.sim.nconmax = 150_000

If warnings persist at higher values, increase to njmax = 800.

note

Only modifying the robot XML is insufficient - the simulation-level njmax in mjlab takes precedence.

Issue 5: Benchmark Video Problems

Video has only 1 frame

Ensure num_eval_steps >= video_length:

python train_mimic/scripts/benchmark.py \
    --checkpoint logs/rsl_rl/g1_general_tracking/<run>/model_30000.pt \
    --motion_file data/datasets/<dataset>/val \
    --num_envs 1 --num_eval_steps 2000 \
    --video --video_length 600

EGL/OpenGL errors

Install OpenGL/EGL dependencies:

conda install -c conda-forge libopengl libglx libegl libglvnd pyopengl

If GPU EGL is unavailable, try CPU rendering:

MUJOCO_GL=osmesa PYOPENGL_PLATFORM=osmesa \
    python train_mimic/scripts/benchmark.py ... --video

Issue 6: Foot Sliding in Sim2Sim (Benchmark OK but ONNX Inference Slides)

Root Cause

Sim2sim configuration parameters mismatch with training environment:

default_angles mismatch (critical): Different joint defaults cause action offset and observation errors
Missing joint armature: Training environment has non-zero armature; zero armature causes overshoot
condim mismatch: Different collision parameters between training and sim2sim

Diagnosis

from mjlab.asset_zoo.robots import get_g1_robot_cfg
cfg = get_g1_robot_cfg()
print(cfg.init_state.joint_pos)  # Must match g1.yaml default_angles

Solution

Update teleopit/configs/robot/g1.yaml and g1_mjlab.xml to match training environment values (default angles, armature, condim).

This fix also affects the sim2real path since default_angles is shared by rl_policy.py and observation.py.

Issue 1: Mean Episode Length = 1.00 (Robot Terminates on First Step)​

Symptoms​

Root Cause​

Quick Diagnosis​

Issue 2: Episode Length Not Growing​

Symptoms​

Possible Causes​

Diagnosis Steps​

Issue 3: Slow Training​

Symptoms​

Solutions​

Issue 4: nefc overflow - please increase njmax​

Symptoms​

Root Cause​

Solution​

Issue 5: Benchmark Video Problems​

Video has only 1 frame​

EGL/OpenGL errors​

Issue 6: Foot Sliding in Sim2Sim (Benchmark OK but ONNX Inference Slides)​

Root Cause​

Diagnosis​

Solution​

Issue 1: Mean Episode Length = 1.00 (Robot Terminates on First Step)

Symptoms

Root Cause

Quick Diagnosis

Issue 2: Episode Length Not Growing

Symptoms

Possible Causes

Diagnosis Steps

Issue 3: Slow Training

Symptoms

Solutions

Issue 4: `nefc overflow - please increase njmax`

Symptoms

Root Cause

Solution

Issue 5: Benchmark Video Problems

Video has only 1 frame

EGL/OpenGL errors

Issue 6: Foot Sliding in Sim2Sim (Benchmark OK but ONNX Inference Slides)

Root Cause

Diagnosis

Solution