The Challenge
The client needed annotation and validation of financial queries and model-generated responses. The material included complex financial cases with calculations, specialized terminology where domain understanding mattered as much as language level, and multi-step model solutions.
Experts were required to:
- assess the correctness of each query (both linguistically and economically)
- validate every step of the model's response
- identify errors in calculations and logic
- evaluate terminology accuracy
- deliver a final verdict on each answer.
The CFA component added further constraints: tasks were in Russian, structured at examination level comparable to an international certification standard, and required narrower specialization than the FinQA track.
A financial analysis pilot with even deeper domain requirements is currently in progress.
Key Challenges
The candidate pool was extremely narrow — the role required both economics expertise and specialized vocabulary. Hiring conversion ran at roughly 20–25%. The operational team had no in-house domain knowledge, which made independent validation of expert decisions impossible and created heavy reliance on the client for interpreting task requirements.
Designing the test assignment presented a separate problem: it could not be created without involving domain experts from the outset.
The Solution
Expert Recruitment
Candidates were drawn from economists, finance and analytics professionals, and specialists with verified English proficiency. The test assignment was developed with a domain expert and modeled real project cases. Selection criteria prioritized quality of reasoning and command of the subject area over throughput.
This produced a core team of 8 experts for FinQA, which was later expanded to 14 for the CFA track.
Workflow Organization
It was clear from the start that the process needed a reliable mechanism for resolving edge cases, given that the operational team could not adjudicate expert decisions independently.
The solution was a centralized document where experts logged ambiguous cases with examples. These were escalated to the client, and responses were distributed back to the full team. For time-sensitive issues, direct communication channels were used.
Annotation Process
Each task followed a fixed sequence: query review covering both language and economic meaning, step-by-step response validation, analysis of calculations and logic, terminology check, and final assessment. Quality control used a three-annotator overlap per task, with tag and score comparison to ensure inter-annotator consistency.
Scaling Expertise
On the CFA track, the initial pool was deliberately narrow given the certification-level subject matter. Senior experts trained the broader team, which made it possible to scale without compromising quality. The financial analysis pilot confirmed that deep within-domain specialization is a prerequisite, not an option, for projects of this type.
| Stage | Input | Workflow Scope | Main Quality Checks |
|---|---|---|---|
| Project Setup | Client requirements, financial task formats | Task design, evaluation criteria, annotation guidelines | Task logic consistency, evaluation clarity |
| Expert Onboarding | Candidate pool (finance background) | Recruitment, testing, interviews, onboarding | Expertise depth, language precision |
| Annotation Execution | Financial Q&A tasks (FinQA, CFA-like) | Step-by-step validation of answers, calculations, reasoning | Calculation accuracy, logical consistency |
| Multi-Review Process | Annotated tasks | Cross-review by 3 experts, disagreement resolution | Consensus alignment, error detection |
| Validation & Analysis | Reviewed datasets | Error classification, pattern analysis, guideline refinement | Result consistency, systematic error control |
| Reporting & Iteration | Validated financial datasets | Weekly reporting, feedback loops, quality improvement | Trend accuracy, continuous quality alignment |
The Results
- A working expert annotation model for financial AI delivered
- A process built and sustained without in-house domain expertise on the operations side
- Specialist knowledge successfully scaled across the team
- Stable inter-annotator consistency achieved in a multi-annotator setup
- Quality positively assessed by the client
Financial reasoning in AI is not built on volume, but on the consistency of expert judgment. Models improve when every answer is challenged, validated, and aligned across multiple reviewers.
- Vladislav Barsukov
- Head of SLM&LLM Annotation