GenAI - Boosting Multilingual LLM Accuracy: From Cultural Nuance to 93% Pass Rates

SUMMARY

We executed two distinct multilingual GenAI activities. The first was an LLM evaluation, where we assessed the performance of two LLMs in generating accurate, relevant responses across English, French, and Spanish. The second involved human-led content generation, where trained linguists created multilingual text from image prompts by following structured guidelines. With no existing benchmarks, we built separate workflows and rubrics for each track, delivering model comparison insights and high-quality human-generated data for GenAI training and analysis.

THE CHALLENGE

  • Automated metrics miss context and tone risking inaccurate or culturally inappropriate outputs
    in multilingual markets.
  • Undetected hallucinations and biases can lead to misinformation, brand risk, and
    customer trust issues.

SOLUTION

  • Designed structured rubrics for both text and image-based LLM outputs
  • Onboarded trained linguists with multilingual capabilities
  • Set up tool for text reviews and Oprimes Survey module for image tasks
  • Created detailed rating guidelines and qualitative tagging logic
  • Executed multi-stage workflow: rubric finalization full-scale review

KEY OUTCOMES

  • Evaluated 540+ tasks per LLM across 3 languages with structured detailed scoring.
  • Generated 1000+ questions from images through human-generated multilingual content.
  • LLM comparison results:

LLM 1: ~93% overall pass rate

LLM 2: ~83% overall pass rate

  • Actionable insights and language-specific improvement areas