Robot imitation learning is often hindered by the high cost of collecting large-scale, real-world data. This challenge is especially significant for low-cost robots designed for home use, as they must be both user-friendly and affordable. To address this, we propose the EasyMimic framework, a low-cost and replicable solution that enables robots to quickly learn manipulation policies from human video demonstrations captured with standard RGB cameras. Our method first extracts 3D hand trajectories from the videos. An action alignment module then maps these trajectories to the gripper control space of a low-cost robot. To bridge the human-to-robot domain gap, we introduce a simple and user-friendly hand visual augmentation strategy. We then use a co-training method, fine-tuning a model on both the processed human data and a small amount of robot data, enabling rapid adaptation to new tasks. Experiments on the low-cost LeRobot platform demonstrate that EasyMimic achieves high performance across various manipulation tasks. It significantly reduces the reliance on expensive robot data collection, offering a practical path for bringing intelligent robots into homes.
Figure 1: Overview of our proposed low-cost framework. It efficiently enables robot imitation learning from human videos with minimal computational resources.
Demo video showcasing EasyMimic performing various manipulation tasks.
Our method first extracts 3D hand trajectories from the RGB videos. The physical alignment module then maps these trajectories to the gripper control space of the robot, ensuring precise motion transfer from human demonstrations to robot execution.
Figure 2: Physical Alignment. The physical alignment module maps extracted human hand trajectories to the robot's gripper control space.
Figure 3: Co-Training Strategy. Overview of the co-training process leveraging both human and robot data for robust policy learning.
We employ a co-training strategy that fine-tunes the policy on both the processed human data and a small amount of robot data. This enables the robot to rapidly adapt to new tasks and generalize effectively across different manipulation scenarios.
Our experimental platform uses a 6-DoF so100-plus robotic arm equipped with a two-finger gripper. The vision system includes two monocular RGB cameras: one fixed above the robot's base for a top-down global view, and another mounted on the wrist for an end-effector-centric first-person view. We evaluate our method on four tabletop manipulation tasks:
The robot picks up a toy duck and places it into a bowl. Success (1.0) = grasping (0.5) + placing (0.5).
Robot pulls open a drawer and pushes it closed. Success (1.0) = pulling (0.5) + pushing (0.5).
Stack a cube on a block, then a pyramid on the cube. Each stack awards 0.5 points.
Execute natural language instructions specifying target object and goal (e.g., "pick up the pink duck...").
We use the pre-trained Gr00T N1.5-3B foundation VLA model. The policy network outputs absolute actions (6-DoF end-effector poses + gripper states). Training: 5,000 gradient steps, AdamW optimizer (lr=1e-4), batch size 32, on a single NVIDIA RTX 4090 GPU.
For each task: 100 human video demonstrations vs. 20 robot teleoperation trajectories. Human data collection rate is substantially higher (up to 12.5 demos/min) compared to robot teleoperation (max 2 demos/min), highlighting efficiency.
| Strategy | Pick | Pull | Stack | LC | Avg. |
|---|---|---|---|---|---|
| Robot-Only (10 traj.) | 0.30 | 0.30 | 0.15 | 0.30 | 0.26 |
| Robot-Only (20 traj.) | 0.60 | 0.70 | 0.35 | 0.40 | 0.51 |
| Pretrain-Finetune | 0.80 | 0.90 | 0.50 | 0.80 | 0.75 |
| EasyMimic (Ours) | 1.00 | 0.90 | 0.70 | 0.90 | 0.88 |
Table 2: Performance evaluation. EasyMimic achieves the best performance (0.88 avg), surpassing baselines significantly while maintaining data efficiency.
Comparison Analysis: Training with scarce robot data (10-20 trajectories) yields limited performance (0.26-0.51). Incorporating human data boosts performance significantly. EasyMimic outperforms Pretrain-Finetune by 0.13 points and Robot-Only (10 traj) by 0.62 points, showing that co-training effectively leverages both human and robot data strengths.
In language-conditioned tasks involving distinct object-goal combinations (e.g., pink/green duck into bowl/block), EasyMimic achieves a success rate of 0.90, compared to just 0.40 for the robot-only baseline. This demonstrates substantial value in utilizing human demonstrations for complex instruction following.
Figure 4(a): Varying human data (fixed 10 robot traj).
Figure 4(b): Varying robot data (fixed 50 human videos).
Increasing human demonstrations consistently improves performance, with optimal gains up to ~50 demos. For robot data, performance saturates quickly after 10-20 trajectories when complemented by human data. This confirms that high performance can be achieved with minimal expensive robot data if sufficient human data is available.
Both Action Alignment (AA) and Visual Alignment (VA) are critical. VA-Partial (masking only fingers) performs poorly, indicating full-hand augmentation is necessary.
Independent action heads prevent interference between human/robot data. Combining EasyMimic with pre-training yields the best results.
Figure 5: Failure Modes. (a) Premature release, (b) Imprecise grasp, (c) Collision, (d) Unstable placement.
Figure 6: Visual Alignment. Comparison of original human hand, partial masking, and full masking (VA-Full).
Tested on unseen objects (Green Duck, Pink Cube) after training only on Pink Duck.
@inproceedings{zhang2026easymimic,
title={EasyMimic: A Low-Cost Robot Imitation Learning Framework from Human Videos},
author={Tao Zhang and Song Xia and Ye Wang and Qin Jin},
booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
year={2026}
}