EasyMimic: A Low-Cost Robot Imitation Learning Framework from Human Videos

Accepted by ICRA 2026
1 AIM3 Lab, Renmin University of China
*Equal contribution, Project lead, Corresponding author

Abstract

Robot imitation learning is often hindered by the high cost of collecting large-scale, real-world data. This challenge is especially significant for low-cost robots designed for home use, as they must be both user-friendly and affordable. To address this, we propose the EasyMimic framework, a low-cost and replicable solution that enables robots to quickly learn manipulation policies from human video demonstrations captured with standard RGB cameras. Our method first extracts 3D hand trajectories from the videos. An action alignment module then maps these trajectories to the gripper control space of a low-cost robot. To bridge the human-to-robot domain gap, we introduce a simple and user-friendly hand visual augmentation strategy. We then use a co-training method, fine-tuning a model on both the processed human data and a small amount of robot data, enabling rapid adaptation to new tasks. Experiments on the low-cost LeRobot platform demonstrate that EasyMimic achieves high performance across various manipulation tasks. It significantly reduces the reliance on expensive robot data collection, offering a practical path for bringing intelligent robots into homes.

Low-Cost Framework Overview

Low Cost Framework

Figure 1: Overview of our proposed low-cost framework. It efficiently enables robot imitation learning from human videos with minimal computational resources.

Demo Video

Demo video showcasing EasyMimic performing various manipulation tasks.

Methodology

Part 1

Physical Alignment

Our method first extracts 3D hand trajectories from the RGB videos. The physical alignment module then maps these trajectories to the gripper control space of the robot, ensuring precise motion transfer from human demonstrations to robot execution.

Physical Alignment

Figure 2: Physical Alignment. The physical alignment module maps extracted human hand trajectories to the robot's gripper control space.

Part 2

Co-Training Strategy

Co-Training Strategy

Figure 3: Co-Training Strategy. Overview of the co-training process leveraging both human and robot data for robust policy learning.

We employ a co-training strategy that fine-tunes the policy on both the processed human data and a small amount of robot data. This enables the robot to rapidly adapt to new tasks and generalize effectively across different manipulation scenarios.

Experimental Results

In this section, we introduce the manipulation tasks and experimental setups. We then present a comparison with baseline methods and provide a detailed analysis to validate the effectiveness of our approach.

1. Experimental Setup

Hardware and Tasks

Our experimental platform uses a 6-DoF so100-plus robotic arm equipped with a two-finger gripper. The vision system includes two monocular RGB cameras: one fixed above the robot's base for a top-down global view, and another mounted on the wrist for an end-effector-centric first-person view. We evaluate our method on four tabletop manipulation tasks:

  • Pick and Place (Pick):

    The robot picks up a toy duck and places it into a bowl. Success (1.0) = grasping (0.5) + placing (0.5).

  • Pull and Push (Pull):

    Robot pulls open a drawer and pushes it closed. Success (1.0) = pulling (0.5) + pushing (0.5).

  • Stacking (Stack):

    Stack a cube on a block, then a pyramid on the cube. Each stack awards 0.5 points.

  • Language Conditioned (LC):

    Execute natural language instructions specifying target object and goal (e.g., "pick up the pink duck...").

Model & Training

We use the pre-trained Gr00T N1.5-3B foundation VLA model. The policy network outputs absolute actions (6-DoF end-effector poses + gripper states). Training: 5,000 gradient steps, AdamW optimizer (lr=1e-4), batch size 32, on a single NVIDIA RTX 4090 GPU.

Data Collection

For each task: 100 human video demonstrations vs. 20 robot teleoperation trajectories. Human data collection rate is substantially higher (up to 12.5 demos/min) compared to robot teleoperation (max 2 demos/min), highlighting efficiency.

2. Main Results

Comparison of Training Strategies

Strategy Pick Pull Stack LC Avg.
Robot-Only (10 traj.) 0.30 0.30 0.15 0.30 0.26
Robot-Only (20 traj.) 0.60 0.70 0.35 0.40 0.51
Pretrain-Finetune 0.80 0.90 0.50 0.80 0.75
EasyMimic (Ours) 1.00 0.90 0.70 0.90 0.88

Table 2: Performance evaluation. EasyMimic achieves the best performance (0.88 avg), surpassing baselines significantly while maintaining data efficiency.

Comparison Analysis: Training with scarce robot data (10-20 trajectories) yields limited performance (0.26-0.51). Incorporating human data boosts performance significantly. EasyMimic outperforms Pretrain-Finetune by 0.13 points and Robot-Only (10 traj) by 0.62 points, showing that co-training effectively leverages both human and robot data strengths.

Language Condition Performance

In language-conditioned tasks involving distinct object-goal combinations (e.g., pink/green duck into bowl/block), EasyMimic achieves a success rate of 0.90, compared to just 0.40 for the robot-only baseline. This demonstrates substantial value in utilizing human demonstrations for complex instruction following.

3. Further Analysis

Effect of Data Scale

Effect of Human Data

Figure 4(a): Varying human data (fixed 10 robot traj).

Effect of Robot Data

Figure 4(b): Varying robot data (fixed 50 human videos).

Increasing human demonstrations consistently improves performance, with optimal gains up to ~50 demos. For robot data, performance saturates quickly after 10-20 trajectories when complemented by human data. This confirms that high performance can be achieved with minimal expensive robot data if sufficient human data is available.

Ablation: Alignment Strategies

  • EasyMimic (Full) 0.87
  • w/o Action Alignment (AA) 0.60 (-0.27)
  • w/o Visual Alignment (VA) 0.40 (-0.47)
  • VA-Partial Masking 0.27 (-0.60)

Both Action Alignment (AA) and Visual Alignment (VA) are critical. VA-Partial (masking only fingers) performs poorly, indicating full-hand augmentation is necessary.

Ablation: Pretraining & Heads

  • EasyMimic (Indep. Heads) 0.87
  • Shared Action Head 0.47 (-0.40)
  • w/o Pretraining (EasyMimic) 0.53
  • w/o Pretraining (Robot-Only) 0.15

Independent action heads prevent interference between human/robot data. Combining EasyMimic with pre-training yields the best results.

Case Analysis & Visual Alignment

Failure Case Study

Figure 5: Failure Modes. (a) Premature release, (b) Imprecise grasp, (c) Collision, (d) Unstable placement.

Visual Alignment Module

Figure 6: Visual Alignment. Comparison of original human hand, partial masking, and full masking (VA-Full).

Zero-Shot Generalization

Tested on unseen objects (Green Duck, Pink Cube) after training only on Pink Duck.

Unseen Object
Avg. Score
Robot-Only
0.35
EasyMimic
0.65

Citation

@inproceedings{zhang2026easymimic,
      title={EasyMimic: A Low-Cost Robot Imitation Learning Framework from Human Videos}, 
      author={Tao Zhang and Song Xia and Ye Wang and Qin Jin},
      booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
      year={2026}
}