The Challenge
The client needed validation of math problems and solutions generated by an AI model. Each task had a layered structure: verifying the problem statement, analyzing the step-by-step solution, and producing a final correctness assessment.
Experts were required to classify each problem statement as correct or flawed, assign tags with reasoned commentary, walk through every step of the solution, identify and explain errors, and deliver a final verdict — highlighting model strengths where the solution held up.
Problems were segmented by difficulty: easy / medium / hard.
The core challenges were significant: the cost of an error at any stage was high, solutions were often long and complex, cognitive load varied sharply across difficulty levels, and the operational team had no in-house mathematical expertise to fall back on.
The Solution
Expert Recruitment
Standard annotator pools were not an option. We recruited math olympiad students, university instructors, and tutors through university partnerships and referral networks. Every candidate completed a test assignment that mirrored real project tasks. Selection prioritized quality of reasoning, not speed — which allowed us to build a compact but high-caliber team.
Quality Control
The focus was on making expert thinking transparent. Every solution required detailed written commentary, errors were localized to specific steps, and final assessments had to be fully argued. Consistency across evaluation stages was monitored throughout. This meant the process of reasoning was evaluated, not just the outcome.
Managing Difficulty Levels
Segmentation by difficulty became a core operational lever.
- Easy problems corresponded to advanced high school mathematics.
- Medium involved greater complexity and heavier cognitive load.
- Hard required deep mathematical knowledge at the level of upper-year university coursework.
One finding reshaped the project's planning: medium-level problems — not the hardest ones — consumed the most time per task, due to the volume of text and the number of solution steps. This directly influenced workload planning, time estimation, and how tasks were distributed across the team.
Process Management
Validation followed a fixed sequential structure: problem statement analysis, step-by-step solution review, error identification and classification, and final validation. Detailed action tracking tools were in place throughout. Regular error review sessions involved the strongest experts on the team, and quality criteria were refined iteratively as patterns emerged.
The Results
- 3,500 math problems validated
- A structured approach to reasoning verification developed, with documented error patterns in model logic
- A training dataset built on step-level validation
- Consistent output quality maintained across high-complexity tasks
We were essentially building a reasoning audit system. Experts weren't just checking math, but also stress-testing the model's logical consistency across long solution chains. What stood out was that medium-level problems turned out to be the most expensive in terms of attention and time.
- Vladislav Barsukov
- Head of SLM&LLM Annotation