Home Case Studies Egocentric Data Collection for Humanoid Robot Training

Data Collection

Egocentric Data Collection for Humanoid Robot Training

Open egocentric datasets give you 2D video with no depth, no pose, no tactile signal. Humanoid training requires all three. How do you build a multimodal setup that captures what open data structurally cannot?
How to build multimodal data collection with real 3D spatial data, body skeleton, and tactile information — while keeping the pipeline scalable.

The Problem

The client, a humanoid robot developer, needed training data that captures real first-person human interaction with objects. Existing open datasets did not meet this requirement: 2D video without depth information or camera pose is suitable only for pretraining, not for final model fine-tuning.

The primary objective was collecting first-person video with depth, camera intrinsics, and extrinsics — the data consistently missing from open datasets.

Solution

Hardware Setup

Existing egocentric datasets are flat 2D video: no depth, no camera pose, often no metadata. They work for pretraining but not for final training. The setup needed to solve this problem and remain scalable.

The foundation is the Pico 4 Ultra, a VR headset with a stereo camera. It provides what standard egocentric capture cannot:

a depth map and metric distance to objects, rather than a flat image
camera position in space for every frame
lens intrinsic parameters, without which full scene reconstruction is not possible

The setup is fully autonomous — no power connection or external computer required. Without that autonomy, half of the field scenarios become impractical from the start.

Body and Hand Capture

First-person video carries no information about what the body is doing. For humanoids this is critical: the torso and legs influence how a person ultimately grasps an object. To capture the full skeleton, motion trackers are connected to the headset.

Tracker placement:

one on each hand and foot
one on the waist
algorithms build a full-body skeleton in real time from six points

Hand tracking is a harder problem: the built-in AI palm tracking works in real time but loses accuracy under occlusion, when an object blocks part of the hand. To compensate, an additional monocular camera is mounted on each wrist. The final setup has three cameras in total: the stereo camera on the headset and two monocular cameras on the wrists.

Tactile Layer

Visual and kinematic signal alone is not sufficient. Real robots have tactile sensors on their manipulators, and training data needs to reflect that.

Problems that cannot be addressed without tactile data:

Heterogeneous objects with an off-center mass are visually indistinguishable from uniform ones
Load distribution across grip points differs when handling such objects — the model needs this information
Surface texture and stiffness determine the grasping strategy but are not visible in video

Tactile gloves measuring contact-point pressure are included in the setup. This is the data layer teams typically add after the fact, when the model is already failing on real objects. It is built in from the start here.

Phase	Input	Scope of Work	Quality Control
Setup Selection & Configuration	Client requirements, target interaction scenarios	Hardware selection, stereo camera calibration, tracker synchronization	Depth map accuracy, intrinsic and extrinsic correctness
Motion Tracking Setup	Motion trackers, Pico 4 Ultra headset	Tracker placement on joints, skeleton construction configuration, palm tracking calibration	Skeleton stability, hand tracking quality under occlusion
Tactile Layer Integration	Tactile gloves, objects of varying mass and texture	Pressure sensor integration into the setup, synchronization with video and kinematic streams	Pressure map accuracy, timestamp synchronization
Pilot Collection	Test recording sessions	Full stream verification in combination: video, depth, skeleton, tactile	Identification of systematic tracking errors, scenario coverage check
Main Data Collection	Actors, target objects and spaces	Recording manipulation scenarios: grasping, transfer, interaction with heterogeneous objects	Stream completeness, tracking stability across the session
Validation & Annotation	Raw recordings from all streams	Timestamp synchronization check, tracking failure filtering, dataset packaging	Stream consistency, absence of artifacts in depth map and skeleton
Final Delivery	Validated multi-stream dataset	Packaging with format documentation, handoff to client	Compatibility with client training pipeline, metadata completeness

Weeks 1–2

Setup Design

Week 3

Pilot Collection

Weeks 4–6

Main Data Collection

Week 7

Validation & Delivery

The Results

A reusable collection setup suitable for field conditions without a fixed power source
A synchronized multi-stream dataset with 3D coordinates, depth, and tactile data
Full-body skeleton and hand tracking in a format compatible with simulation environments
Documented format structure for integration into the client’s training pipeline

Tactile data is what teams usually think of at the end, when the model is already failing on real objects. We decided to build this layer in from the start — because collecting it afterward is a separate project, no less complex than the first."

Martinian Letunovsky: Head of IT Operations

Similar Cases

Image Annotation

Urban Image Annotation for Waste Detection

AI meets urban planning: our dataset enabled the automation of waste collection, reducing costs and improving municipal services.
Lean more
Data Collection

Video Data Collection for Street Weapon Detection

From zero to 99% model accuracy in 28 days: we sourced, staged, and annotated video footage for urban weapon detection systems.
Lean more
NLP Annotation services

Mathematical Reasoning Validation for AI

3,500 math problems, three difficulty levels, every solution step checked, not just the final answer. We brought in olympiad students and university instructors to stress-test model logic.
Lean more
Data Collection

Image Data Collection for Biometric System

We built a reliable dataset for biometric system testing — fast, compliant, and ready for integration.
Lean more
Data Collection

Multiview Emotion Capture for AI Training

Capturing emotion at scale required more than cameras. We built a system that made it consistent, synchronized, and repeatable.
Lean more

Ready to get started?

Tell us what you need — we’ll reply within 24h with a free estimate

What service are you looking for? *

What service are you looking for?

Data Labeling

AI Model Testing

Data Collection

Ready-made Datasets

Human Moderation

Medicine

Other

What's your budget range? *

What's your budget range?

< $5,000

$5,000 – $25,000

$25,000 – $50,000

$50,000 – $100,000

$100,000+

Not sure yet

Where did you hear about Unidata? *

Where did you hear about Unidata?

Google LinkedIn Kaggle / Hugging Face / Github Referral (colleague, partner, client) G2 ChatGPT / AI assistant Other

I agree to the Terms of Service and Privacy Policy. By submitting my contact information, I consent to receive emails, messages, and calls from Unidata and its affiliates.

Andrew: Head of Client Success

— I'll guide you through every step, from your first
message to full project delivery

Thank you for your
message

It has been successfully sent!

We use cookies to enhance your experience, personalize content, ads, and analyze traffic. By clicking 'Accept All', you agree to our Cookie Policy.

Egocentric Data Collection for Humanoid Robot Training

The Problem

Solution

Hardware Setup

Body and Hand Capture

Tactile Layer

The Results

Similar Cases

Urban Image Annotation for Waste Detection

Video Data Collection for Street Weapon Detection

Mathematical Reasoning Validation for AI

Image Data Collection for Biometric System

Multiview Emotion Capture for AI Training

Ready to get started?

Thank you for your message

Ready to get started?

Thank you for your
message