SUMMARY
We executed two distinct multilingual GenAI activities. The first was an LLM evaluation, where we assessed the performance of two LLMs in generating accurate, relevant responses across English, French, and Spanish. The second involved human-led content generation, where trained linguists created multilingual text from image prompts by following structured guidelines. With no existing benchmarks, we built separate workflows and rubrics for each track, delivering model comparison insights and high-quality human-generated data for GenAI training and analysis.
THE CHALLENGE
- Automated metrics miss context and tone risking inaccurate or culturally inappropriate outputs
in multilingual markets. - Undetected hallucinations and biases can lead to misinformation, brand risk, and
customer trust issues.
SOLUTION
- Designed structured rubrics for both text and image-based LLM outputs
- Onboarded trained linguists with multilingual capabilities
- Set up tool for text reviews and Oprimes Survey module for image tasks
- Created detailed rating guidelines and qualitative tagging logic
- Executed multi-stage workflow: rubric finalization full-scale review
KEY OUTCOMES
- Evaluated 540+ tasks per LLM across 3 languages with structured detailed scoring.
- Generated 1000+ questions from images through human-generated multilingual content.
- LLM comparison results:
LLM 1: ~93% overall pass rate
LLM 2: ~83% overall pass rate
- Actionable insights and language-specific improvement areas