The Problem
Humanoid robot developers keep running into the same wall: egocentric video data is scarce, expensive to collect, and slow to accumulate — nowhere near the volume needed to meaningfully advance training. Simulation solves the scale problem: you can spin up hundreds of parallel environments and generate millions of iterations in a short time. But those simulation environments need to be populated with realistic spaces and objects the robot will actually interact with.
Building environments by hand is costly and slow — it requires designers, developers, and virtual environment specialists. The goal was to collect simulation data directly from the real world, without a full production team.
The work split into two parallel streams:
Space scanning
Photorealistic 3D models of apartments and rooms, ready to load into simulation environments like IsaacSim — where a robot can be placed and its object interactions recorded.
Object scanning
3D models of everyday items — mugs, boxes, tools — to populate simulation scenes with geometrically accurate, properly textured objects.
Solution
Space Scanning
Building environments manually without a large team is not realistic. Instead, we scan real rooms and load them directly into the simulator. The setup uses a 360-degree camera with an integrated lidar. Lidar provides metric accuracy; the camera provides photorealistic textures.
Coverage is monitored in real time:
- the operator walks through the space with the scanner
- the software flags zones with insufficient coverage
- data is uploaded to volumetric reconstruction software
Environments are static — drawers and cabinet doors do not open. For most object manipulation scenarios on surfaces, this is not a meaningful limitation. More importantly, scans are tied to the same spaces where egocentric footage is recorded. That connection is the direct integration point between the two data types.
Object Scanning
Lidar is not suitable for individual objects: resolution is insufficient for fine details and textures. The method of choice is photogrammetry with reconstruction via 3D Gaussian Splatting.
The pipeline works as follows:
- approximately 150 shots from different angles under controlled lighting
- processing in ColMap: computing the position of each frame relative to the object
- reconstruction in 3DGS: a precise model with realistic textures as output
The hardest part is lighting. Three parameters are in constant tension, and there is no universal solution.
Here is what happens when each one is off:
- High ISO reduces the need for light but introduces grain — ColMap stops correctly matching points between frames
- Wide aperture lets in more light but narrows depth of field — part of the object goes out of focus and reconstruction in those zones degrades
- Long exposure requires the camera to be completely still — any movement interferes with frame matching
After testing phone cameras, we switched to DSLRs: the larger sensor produces acceptable results in low light without a critical increase in ISO. Shooting parameters were calibrated separately for each object class.
Integration with Egocentric Data
Egocentric recordings and scans of the same spaces form a unified dataset where real data and simulation point to the same environment.
This gives the client capabilities that are unavailable when purchasing the two separately:
- reproduce a scenario from egocentric video in simulation — with the same geometry and the same objects
- adapt data to the physics of a specific robot: different grip, different height, different degrees of freedom
- collect additional data in simulation without another field visit — adjust lighting, object placement, trajectories
A single egocentric data collection session becomes a scalable source of simulation data.
| Phase | Input | Scope of Work | Quality Control |
|---|---|---|---|
| Preparation & Calibration | Client requirements, list of target spaces and objects | Lidar scanning and photogrammetry setup, shooting parameters for lighting conditions | Coverage accuracy, ISO / aperture / shutter balance |
| Pilot Scanning | Test room, set of objects | Trial scans of spaces and objects, identifying problem areas: dark corners, reflective surfaces, fine details | Texture quality, geometry completeness, absence of reconstruction artifacts |
| Space Scanning | 360-degree camera with integrated lidar | Walk-through with real-time coverage monitoring, upload to volumetric reconstruction software | Metric dimensional accuracy, photorealistic textures, no uncovered zones |
| Object Scanning | DSLR camera, interaction objects | ~150 frames from different angles, ColMap processing, 3DGS pipeline reconstruction | Full object in focus, correct frame matching in ColMap |
| Reconstruction & Processing | Raw scanning and photogrammetry data | Building final 3D models of spaces and objects, geometry and scale verification | GPU / RAM resources, no degradation in underlit zones |
| Egocentric Integration | Room scans, egocentric video from the same spaces | Aligning scans with recordings, test scene loading in IsaacSim, format compatibility check | Geometry match between scan and real space from video |
| Final Delivery | Validated 3D models and linked dataset | Packaging with format documentation, handoff with instructions for simulator import | IsaacSim compatibility, metadata completeness, pipeline reproducibility |
The Results
- Photorealistic 3D scenes of real spaces, ready to load into IsaacSim
- A library of 3D objects with accurate geometry and textures for populating simulation environments
- A linked dataset: egocentric video tied to scans of the same spaces
- A reproducible scanning pipeline that does not require a large team of virtual environment specialists