EasyMimic [ICRA 2026]: A Low-Cost Framework for Robot Imitation Learning from Human Videos

Abstract

Robot imitation learning is often hindered by the high cost of collecting large-scale, real-world data. This challenge is especially significant for low-cost robots designed for home use, as they must be both user-friendly and affordable. To address this, we propose the EasyMimic framework, a low-cost and replicable solution that enables robots to quickly learn manipulation policies from human video demonstrations captured with standard RGB cameras. Our method first extracts 3D hand trajectories from the videos. An action alignment module then maps these trajectories to the gripper control space of a low-cost robot. To bridge the human-to-robot domain gap, we introduce a simple and user-friendly hand visual augmentation strategy. We then use a co-training method, fine-tuning a model on both the processed human data and a small amount of robot data, enabling rapid adaptation to new tasks. Experiments on the low-cost LeRobot platform demonstrate that EasyMimic achieves high performance across various manipulation tasks. It significantly reduces the reliance on expensive robot data collection, offering a practical path for bringing intelligent robots into homes.

Experimental Results

In this section, we introduce the manipulation tasks and experimental setups. We then present a comparison with baseline methods and provide a detailed analysis to validate the effectiveness of our approach.

1. Experimental Setup

Hardware and Tasks

Our experimental platform uses a 6-DoF so100-plus robotic arm equipped with a two-finger gripper. The vision system includes two monocular RGB cameras: one fixed above the robot's base for a top-down global view, and another mounted on the wrist for an end-effector-centric first-person view. We evaluate our method on four tabletop manipulation tasks:

Pick and Place (Pick):
The robot picks up a toy duck and places it into a bowl. Success (1.0) = grasping (0.5) + placing (0.5).
Pull and Push (Pull):
Robot pulls open a drawer and pushes it closed. Success (1.0) = pulling (0.5) + pushing (0.5).
Stacking (Stack):
Stack a cube on a block, then a pyramid on the cube. Each stack awards 0.5 points.
Language Conditioned (LC):
Execute natural language instructions specifying target object and goal (e.g., "pick up the pink duck...").

Model & Training

We use the pre-trained Gr00T N1.5-3B foundation VLA model. The policy network outputs absolute actions (6-DoF end-effector poses + gripper states). Training: 5,000 gradient steps, AdamW optimizer (lr=1e-4), batch size 32, on a single NVIDIA RTX 4090 GPU.

Data Collection

For each task: 100 human video demonstrations vs. 20 robot teleoperation trajectories. Human data collection rate is substantially higher (up to 12.5 demos/min) compared to robot teleoperation (max 2 demos/min), highlighting efficiency.

2. Main Results

Comparison of Training Strategies

Strategy	Pick	Pull	Stack	LC	Avg.
Robot-Only (10 traj.)	0.30	0.30	0.15	0.30	0.26
Robot-Only (20 traj.)	0.60	0.70	0.35	0.40	0.51
Pretrain-Finetune	0.80	0.90	0.50	0.80	0.75
EasyMimic (Ours)	1.00	0.90	0.70	0.90	0.88

Table 2: Performance evaluation. EasyMimic achieves the best performance (0.88 avg), surpassing baselines significantly while maintaining data efficiency.

Comparison Analysis: Training with scarce robot data (10-20 trajectories) yields limited performance (0.26-0.51). Incorporating human data boosts performance significantly. EasyMimic outperforms Pretrain-Finetune by 0.13 points and Robot-Only (10 traj) by 0.62 points, showing that co-training effectively leverages both human and robot data strengths.

Language Condition Performance

In language-conditioned tasks involving distinct object-goal combinations (e.g., pink/green duck into bowl/block), EasyMimic achieves a success rate of 0.90, compared to just 0.40 for the robot-only baseline. This demonstrates substantial value in utilizing human demonstrations for complex instruction following.

3. Further Analysis

Effect of Data Scale

Figure 4(a): Varying human data (fixed 10 robot traj).

Figure 4(b): Varying robot data (fixed 50 human videos).

Increasing human demonstrations consistently improves performance, with optimal gains up to ~50 demos. For robot data, performance saturates quickly after 10-20 trajectories when complemented by human data. This confirms that high performance can be achieved with minimal expensive robot data if sufficient human data is available.

Ablation: Alignment Strategies

EasyMimic (Full) 0.87
w/o Action Alignment (AA) 0.60 (-0.27)
w/o Visual Alignment (VA) 0.40 (-0.47)
VA-Partial Masking 0.27 (-0.60)

Both Action Alignment (AA) and Visual Alignment (VA) are critical. VA-Partial (masking only fingers) performs poorly, indicating full-hand augmentation is necessary.

Ablation: Pretraining & Heads

EasyMimic (Indep. Heads) 0.87
Shared Action Head 0.47 (-0.40)
w/o Pretraining (EasyMimic) 0.53
w/o Pretraining (Robot-Only) 0.15

Independent action heads prevent interference between human/robot data. Combining EasyMimic with pre-training yields the best results.

Case Analysis & Visual Alignment

Figure 5: Failure Modes. (a) Premature release, (b) Imprecise grasp, (c) Collision, (d) Unstable placement.

Figure 6: Visual Alignment. Comparison of original human hand, partial masking, and full masking (VA-Full).

Zero-Shot Generalization

Tested on unseen objects (Green Duck, Pink Cube) after training only on Pink Duck.

Unseen Object

Avg. Score

Robot-Only

0.35

EasyMimic

0.65

EasyMimic:
A Low-Cost Robot Imitation Learning Framework from Human Videos