VMMU Project Icon VMMU: A Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark

*Equal contribution
1KAIST 2MBZUAI
TL;DR

We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark with 2.5k multimodal questions across 7 tasks. Despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Analysis shows the primary sources of failure are multimodal grounding and reasoning over text and visual evidence, rather than OCR.

Overview

Vision Language Models (VLMs) have made rapid progress on multimodal benchmarks, demonstrating strong performance on complex visual and textual reasoning tasks in English. However, it remains unclear how well these models perform in low-resource languages, especially when language understanding, visual grounding, and reasoning are all required to complete a task.

We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark designed to evaluate how vision language models interpret and reason over visual and textual information beyond English. VMMU consists of 2.5k multimodal questions across 7 tasks, covering a diverse range of problem contexts, including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning.

All questions require genuine multimodal integration, rather than reliance on text-only cues or OCR-based shortcuts. We evaluate a diverse set of state-of-the-art (SOTA) proprietary VLMs on VMMU and find that despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Further analysis shows that the primary sources of failure are not OCR, but instead multimodal grounding and reasoning over text and visual evidence.

Example of a Vietnamese multimodal exam question from our ViExam benchmark

Example of a Vietnamese multimodal exam question from ViExam, combining text and visual elements.

5 Key Research Findings

Proprietary VLMs Score Relatively Low

Non-thinking proprietary VLMs achieve only 50-71% accuracy on VMMU, while thinking VLMs perform significantly better at 73-86%. This indicates substantial room for improvement in Vietnamese multimodal understanding.

OCR Is Not the Bottleneck

Across 5 proprietary SOTA VLMs, Vietnamese embedded-text transcription is consistently strong (mean BLEU 89.01%, F1 94.30%, CER 6.59%, WER 9.33%). Most remaining errors are attributable to multimodal grounding and downstream reasoning, not OCR failures.

Disentangling Text from Visual Evidence Improves Performance

When the question and options are given as text instead of being rendered within the image (Split-MM), accuracy increases for every model (+6 percentage points on average), showing that separating text from visual evidence reduces interference and improves multimodal reasoning.

English Translation Does Not Help

Translating the Vietnamese text into English consistently reduces accuracy for all models (-2 percentage points on average). This suggests that Vietnamese language understanding is not the primary bottleneck, and translation introduces mismatches that degrade grounded multimodal reasoning.

Removing Visual Evidence Reveals Possible Reliance on Priors

When visual evidence is removed, accuracy drops significantly (-21.27 points); yet VLMs still perform above random chance (25.92%), suggesting substantial reliance on text-only priors and exam heuristics rather than genuine visual grounding.

Additional Findings

A.1 Image-Required Questions Are Harder

Questions that truly require image understanding are significantly more challenging than text-sufficient questions, demonstrating the benchmark's focus on visual reasoning.

A.2 Answer Shuffling Has Minor Effect

Shuffling answer options has a minor impact on model performance, indicating models are not heavily reliant on answer position biases.

A.3 Multiple Runs Show Consistent Rankings

Running evaluations multiple times does not change model rankings or main conclusions, confirming the robustness of the benchmark results.

A.4 Non-Thinking VLMs Are Overconfident

Vision-language models without reasoning capabilities tend to be overconfident in their incorrect predictions.

A.5 SOTA VLMs Follow Answer Format Reliably

State-of-the-art vision-language models consistently follow the required answer format, demonstrating strong instruction-following capabilities.

A.6 Open-Source VLMs Significantly Underperform

Open-source vision-language models show substantial performance gaps compared to proprietary models on VMMU benchmark tasks.

A.7 Weak OCR Drives Open-Source Failures

The primary weakness of open-source VLMs is their inferior embedded-text OCR capabilities, which significantly impacts overall performance.

How We Built ViExam: The Data Curation Pipeline

To ensure the quality and validity of our benchmark, we developed a systematic 3-stage pipeline to collect, filter, and classify exam questions. This process was crucial for building a dataset that genuinely tests multimodal reasoning, moving beyond simple text-only questions found in previous benchmarks.

VMMU's data curation pipeline
Our three-stage data curation process: (1) Sourcing raw exam PDFs and converting them to images, (2) an automated pipeline to detect and classify questions, and (3) a final manual verification loop by native Vietnamese speakers.

Our 3-Stage Process Explained:

Stage 1: Data Sourcing & Conversion. We began by automatically crawling thousands of exam papers from popular Vietnamese educational websites. Each paper was then converted into high-resolution images, preparing them for analysis.

Stage 2: Automated Classification. Next, our custom pipeline used OCR to identify individual questions on each page. A specialized analyzer then scanned the content of each question, automatically distinguishing multimodal (image-based) questions from text-only ones.

Stage 3: Human Verification. Finally, to ensure maximum accuracy, every single question was manually reviewed by a team of native Vietnamese speakers. They used a custom-built tool to correct any classification errors and validate the final dataset.

Dataset at a Glance

Our ViExam benchmark features 2,548 questions across 7 diverse domains.

Mathematics Icon

Mathematics

456 Questions

Physics Icon

Physics

361 Questions

Chemistry Icon

Chemistry

302 Questions

Biology Icon

Biology

341 Questions

Geography Icon

Geography

481 Questions

Driving Test Icon

Driving Test

367 Questions

IQ Test Icon

IQ Test

240 Questions

Overall Performance

SOTA proprietary VLMs (68.70%) significantly outperform human average (66.54%) and open-source VLMs (35.53%). Human performance is based on official Vietnamese national high school graduation exam results (VnExpress, 2024).
Model a. Math b. Physics c. Chemistry d. Biology e. Geography f. Driving g. IQ Mean
Human (Average) 64.50 66.70 66.80 62.80 71.90 66.54
Human (Best) 98.00 100.0 100.0 100.0 100.0 99.60
Random baseline 25.00 24.66 24.38 24.67 25.00 33.24 24.46 25.92
Open-source VLMs
Aya-Vision-8B 10.75 8.03 3.64 3.23 2.49 26.98 12.50 9.66
Aya-Vision-32B 12.94 14.40 16.89 17.30 21.41 32.97 24.58 20.07
Gemma-3-4B 26.32 17.45 22.19 22.29 27.65 40.60 21.67 25.45
Gemma-3-27B 47.37 30.47 38.74 30.79 47.40 43.87 35.00 39.09
Qwen-2.5-VL-32B 61.18 49.17 44.70 42.82 71.31 54.50 41.25 52.13
Qwen-2.5-VL-72B 58.33 49.86 48.34 40.47 64.66 53.95 46.67 51.75
Llama-4-Scout 59.65 44.88 45.03 38.71 69.02 49.05 37.08 49.06
Llama-4-Maverick 42.54 29.36 32.78 27.57 53.22 53.95 29.17 38.37
Mistral-Small-3.2-24B 34.65 30.28 28.48 28.15 30.98 39.78 30.83 31.88
Mistral-Medium-3 44.74 36.57 37.75 29.33 36.59 45.78 33.75 37.78
Mean 39.85 31.05 31.85 28.06 42.47 44.14 31.25 35.53
SOTA proprietary VLMs
Gemini-2.5-Flash 82.46 67.04 78.15 63.05 85.24 71.39 52.08 71.34
Sonnet-4.0 64.25 41.00 53.31 44.87 48.44 58.04 44.17 50.58
GPT-4.1 46.27 43.21 44.37 44.87 69.85 66.21 46.25 51.58
o3 84.87 68.98 82.78 67.16 88.98 74.66 50.42 73.98
Gemini-3.0-Pro 92.54 81.16 91.72 80.94 94.59 91.28 72.08 86.33
GPT-5.0 91.23 76.45 85.10 72.73 89.40 75.20 58.75 78.41
Mean 76.94 62.97 72.57 62.27 79.42 72.80 53.96 68.70

Conclusion

Primary Contribution

This study documents systematic challenges in VLMs' Vietnamese multimodal reasoning through comprehensive evaluation across 7 academic domains, demonstrating that failures stem from multimodal integration rather than basic text recognition limitations.

Key Results

SOTA VLMs achieved only 57.74% accuracy on Vietnamese multimodal exam questions while open-source models averaged 27.70%, both underperforming average human test-takers (66.54%).

Core Finding

VLMs consistently struggle with Vietnamese multimodal content integration, with thinking models like o3 substantially outperforming non-thinking models (74.07% vs 48.28-59.87%), suggesting that future work on tool-augmented reasoning is needed to close this gap and achieve better performance on complex reasoning tasks