We built and scaled a highly selective image description project to support the training of a large multimodal model focused on cultural nuance and contextual accuracy. This was not mass labeling. It required structured training, multi-stage examinations, and strict manual validation. Within two months, we scaled the team to 120 qualified annotators while maintaining an average quality level of around 90 percent.
The Task
The objective was to produce high-precision image descriptions aligned with detailed guidelines spanning over 30 pages. The model required culturally accurate, fact-based, and visually grounded descriptions.
This was not about describing generic cats or landscapes. The images reflected specific cultural contexts, local cuisine, traditional clothing, vehicles, environments, and symbolic elements. The model needed to understand nuance. For example, if an image showed a traditional dish with sauce on top, the description had to reflect that exact relationship rather than listing the components separately.
Subjective language was strictly prohibited. Phrases like pleasant atmosphere or beautiful scene were not acceptable. Annotators had to describe only what was visible, while maintaining balance between factual reference information and direct visual description.
Each image included a predefined main object. Annotators were required to prioritize it and expand meaningfully on it. In some cases, they researched contextual background to ensure precision, while carefully avoiding excessive reference content that could distort model learning.
The Selection and Training Process
We implemented a multi-stage hiring funnel:
- Initial logic and language screening
A general reasoning and language clarity test to filter baseline candidates. - Training phase
Candidates studied instructions and reviewed sample image descriptions. - Two examination stages
- Writing descriptions from scratch
- Editing and improving pre-generated descriptions
- Writing descriptions from scratch
Each stage included approximately two case sets and was fully reviewed manually by our validation team.
The exams were intentionally demanding. The pass rate averaged around 50 percent, which allowed us to maintain high standards without slowing down delivery.
We refined the funnel in parallel with production. Early edge cases and instruction inconsistencies were identified through candidate feedback and validator review. We updated guidelines, clarified ambiguous scenarios, and continuously optimized the onboarding flow.
At steady state, approximately 30 new candidates entered the exam stage daily, creating a predictable and scalable recruitment pipeline.
Quality Control
Every submission was manually validated.
The target metric was 95 percent quality. The working average stabilized around 90 percent, which is strong performance for high-complexity descriptive tasks.
Extended training and rigorous exams significantly reduced production-stage errors. Rather than correcting issues later, we filtered for quality upfront.
Challenges
The main complexity lay in balancing three dimensions:
- Strict visual grounding without subjective interpretation
- Cultural specificity without overloading descriptions with background facts
- Emphasis on the main object while maintaining contextual integrity
Additionally, early versions of the guidelines contained ambiguities and edge cases. Instead of treating them as blockers, we used validator feedback loops to refine documentation and align expectations.
This iterative approach allowed us to stabilize processes quickly while continuing to scale.
Stage Overview
| Stage | Input | Workflow Scope | Main Quality Checks |
|---|---|---|---|
| Initial Screening | Candidate applications | Logic and language test | Clarity, baseline reasoning |
| Training | Guidelines + sample images | Instruction study and review | Understanding of constraints |
| Exam Stage 1 | Raw images | Independent description writing | Object focus, factual accuracy |
| Exam Stage 2 | Pre-generated descriptions | Editing and refinement | Precision, compliance with rules |
| Production | Approved annotators | Ongoing image description | Manual validation, cultural consistency |
| Guideline Optimization | Validator feedback | Documentation updates | Edge-case clarity |
The Results
- 120 qualified annotators onboarded in under two months
- Around 50 percent exam pass rate ensuring selective hiring
- Average quality level around 90 percent
- Stable, scalable pipeline with continuous improvement mechanisms
Strong production quality is built before production starts. Rigorous training and demanding exams reduce downstream errors and allow us to scale without sacrificing precision. Cultural nuance cannot be crowdsourced casually. It must be structured, validated, and continuously refined.
- Albina Romanova
- Head of Speech Labeling & Data Generation