The Problem
The client, a humanoid robot developer, needed training data that captures real first-person human interaction with objects. Existing open datasets did not meet this requirement: 2D video without depth information or camera pose is suitable only for pretraining, not for final model fine-tuning.
The primary objective was collecting first-person video with depth, camera intrinsics, and extrinsics — the data consistently missing from open datasets.
Solution
Hardware Setup
Existing egocentric datasets are flat 2D video: no depth, no camera pose, often no metadata. They work for pretraining but not for final training. The setup needed to solve this problem and remain scalable.
The foundation is the Pico 4 Ultra, a VR headset with a stereo camera. It provides what standard egocentric capture cannot:
- a depth map and metric distance to objects, rather than a flat image
- camera position in space for every frame
- lens intrinsic parameters, without which full scene reconstruction is not possible
The setup is fully autonomous — no power connection or external computer required. Without that autonomy, half of the field scenarios become impractical from the start.
Body and Hand Capture
First-person video carries no information about what the body is doing. For humanoids this is critical: the torso and legs influence how a person ultimately grasps an object. To capture the full skeleton, motion trackers are connected to the headset.
Tracker placement:
- one on each hand and foot
- one on the waist
- algorithms build a full-body skeleton in real time from six points
Hand tracking is a harder problem: the built-in AI palm tracking works in real time but loses accuracy under occlusion, when an object blocks part of the hand. To compensate, an additional monocular camera is mounted on each wrist. The final setup has three cameras in total: the stereo camera on the headset and two monocular cameras on the wrists.
Tactile Layer
Visual and kinematic signal alone is not sufficient. Real robots have tactile sensors on their manipulators, and training data needs to reflect that.
Problems that cannot be addressed without tactile data:
- Heterogeneous objects with an off-center mass are visually indistinguishable from uniform ones
- Load distribution across grip points differs when handling such objects — the model needs this information
- Surface texture and stiffness determine the grasping strategy but are not visible in video
Tactile gloves measuring contact-point pressure are included in the setup. This is the data layer teams typically add after the fact, when the model is already failing on real objects. It is built in from the start here.
| Phase | Input | Scope of Work | Quality Control |
|---|---|---|---|
| Setup Selection & Configuration | Client requirements, target interaction scenarios | Hardware selection, stereo camera calibration, tracker synchronization | Depth map accuracy, intrinsic and extrinsic correctness |
| Motion Tracking Setup | Motion trackers, Pico 4 Ultra headset | Tracker placement on joints, skeleton construction configuration, palm tracking calibration | Skeleton stability, hand tracking quality under occlusion |
| Tactile Layer Integration | Tactile gloves, objects of varying mass and texture | Pressure sensor integration into the setup, synchronization with video and kinematic streams | Pressure map accuracy, timestamp synchronization |
| Pilot Collection | Test recording sessions | Full stream verification in combination: video, depth, skeleton, tactile | Identification of systematic tracking errors, scenario coverage check |
| Main Data Collection | Actors, target objects and spaces | Recording manipulation scenarios: grasping, transfer, interaction with heterogeneous objects | Stream completeness, tracking stability across the session |
| Validation & Annotation | Raw recordings from all streams | Timestamp synchronization check, tracking failure filtering, dataset packaging | Stream consistency, absence of artifacts in depth map and skeleton |
| Final Delivery | Validated multi-stream dataset | Packaging with format documentation, handoff to client | Compatibility with client training pipeline, metadata completeness |
The Results
- A reusable collection setup suitable for field conditions without a fixed power source
- A synchronized multi-stream dataset with 3D coordinates, depth, and tactile data
- Full-body skeleton and hand tracking in a format compatible with simulation environments
- Documented format structure for integration into the client’s training pipeline