The Task
The client requested the collection of 750 unique audio recordings of children's laughter, crying, and speech within one month. Each child could participate only once, eliminating the possibility of using the same actors multiple times. Strict quality and diversity requirements added complexity to the task.
The Solution
To ensure an efficient data collection process, we divided it into several stages:
Dataset design and methodology:
- Defined the target age range and prioritized ethnic and regional groups
- Developed an age-verification approach combining visual assessment and metadata analysis
- Created clear, standardized instructions for participants and crowd platforms, including capture examples
Data Collection Approach:
- A pilot phase using the Yandex.Toloka platform proved to be too slow.
- We switched to an in-house collection strategy, engaging parents through social media and childcare institutions.
- To verify the authenticity of the audio, we required submissions in video format to confirm that the laughter, crying, and speech genuinely belonged to a child and that there were no repeated participants.
Data collection
- Leveraged established crowd platforms and tested new sources to expand geographic coverage
- Designed simple, engaging tasks to encourage complete and high-quality photo sets
- Provided fair compensation to reduce drop-off and incomplete submissions
- Monitored incoming data in real time to address quality issues early
Validation and quality control
- Combined automated checks with manual expert review to confirm age and photo ownership
- Applied multi-layer validation, with multiple reviewers cross-checking each submission
- Minimized inconsistencies and labeling errors, achieving a very low inaccuracy rate
- Delivered a clean, production-ready dataset suitable for model training and research
The Results
- Achieved high confidence in age accuracy and metadata reliability
- Identified consistent patterns of facial development across diverse ethnic and regional groups
- Enabled training for face recognition, anti-fraud systems, and academic research
Biometric spoofing resilience is built through repeated real-world attack attempts, not static datasets. System performance improves when diverse participants continuously test its limits under varied conditions.
- Lucy Mamedoff
- Data Collection Project Manager