PM⁴Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

Junyuan Gao^2*, Jiahe Song^3*, Jiang Wu^1*†,

Runchuan Zhu³, Guanlin Shen¹, Shasha Wang¹, Xingjian Wei¹, Haote Yang¹,

Songyang Zhang¹, Weijia Li^4,1, Bin Wang¹, Dahua Lin^1,5, Lijun Wu¹, Conghui He^1‡,

¹Shanghai Artificial Intelligence Laboratory, ²University of Chinese Academy of Sciences,

³Peking University, ⁴Sun Yat-Sen University, ⁵Chinese University of Hong Kong

^* Equal contribution ^† Project lead ^‡ Correspondence: heconghui@pjlab.org.cn

Paper Code 🤗 Dataset

Overview of PM⁴Bench, which includes parallel corpora in 10 languages and features two settings: `traditional` and `vision`, with four tasks: MDUR, MIQA, MMJB, and MSOCR. Based on PM⁴Bench, we comprehensively evaluate the usefulness and safety of LVLMs, delving into the relationship between their underlying OCR capabilities and higher-level abilities.

Abstract

Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM⁴Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM⁴Bench features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons. It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously see, read, and think, aligning with real-world applications. Additionally, PM⁴Bench incorporates safety evaluations, addressing critical oversight in existing multilingual benchmarks. Using PM⁴Bench, we evaluate 11 mainstream LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings, and identifying OCR capability as a key determinant of these imbalances. We released PM⁴Bench at https://github.com/opendatalab/PM4Bench.

Warning: This paper contains potentially offensive and harmful text.

🔥 Highlight
          Parallel Text for Multi-Modal: We offer the first Parallel Multilingual Multi-Modal Multi-task Benchmark on 10 parallel corpus, enabling fair and in-depth multilingual evaluation and analysis.
Comprehensive Evaluation: We conduct extensive evaluations for 11 LVLMs, setting up a comprehensive foundation for comparative analysis.
Meticulous Analysis: We conduct further analysis that reveals greater imbalance in vision settings, and OCR capability has strong correlation to LVLM's performance, providing guidance for future advance.

        

Introduction

Comprehensive evaluation of LVLMs in multilingual scenarios is crucial for identifying shortcomings and guiding further optimization. However, most existing benchmarks have certain limitations: (1) Some rely on language-specific corpora, coupling linguistic ability with cultural knowledge, making it difficult to discern whether performance gaps arise from cultural knowledge deficiencies or fundamental linguistic capabilities; (2) Text and images are processed separately, unlike how humans naturally interact with multi-modal information in the real world; and (3) Safety evaluation is neglected, posing risks for responsible deployment.

To address these gaps, we propose PM⁴Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM⁴Bench includes 10 languages and uses parallel corpora focused on world knowledge, decoupling performance from cultural contexts. It includes the vision setting where text and queries are embedded in images, which align with real-world application scenarios such as multi-modal agents, free-form web interaction, and perception and self-learning of embodied AI robots. Additionally, PM⁴Bench evaluates LVLM safety in multilingual and multi-modal contexts, filling a critical gap. Detailed comparison between PM⁴Bench and other benchmarks are listed in Benchmark Comparison .

Using PM⁴Bench, we evaluated 11 LVLMs, including leading open-sourced LVLMs, commercial APIs, light-weight LVLMs, and recent reasoning LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings. We found that increasing model size does not mitigate these imbalances, with Optical Character Recognition (OCR) capability identified as the key factor.

Benchmark Comparison.

Radar chart of overall vision setting performance on MDUR, MIQA, MMJB and MSOCR.

Overview of PM⁴Bench

Design Principles

Our core motivation is to comprehensively evaluate the performance of LVLMs in both usefulness and safety within multilingual & multi-modal scenarios. We aim to align more closely with real-world user applications, and assess LVLMs' cross-lingual performance disparities faithfully and systematically. Furthermore, we aim to accurately analyze and identify the underlying issues and shortcomings of current LVLMs, providing clear guidance for model optimization.

To achieve this, we propose the following design principles:

Targeted Language Selection: The selected languages should cover diverse language families, varying different writing scripts.
Parallel Corpus: The content across languages must be semantically identical. This ensures that language-specific and culturally related knowledge is decoupled from the evaluation tasks, allowing us to remain focused on assessing fundamental language capabilities.
Vision Setting: To simulate real-world applications and human perception, text and queries are "printed" onto images in vision setting.
Task Diversity: The benchmark should encompass a wide range of tasks, including perception, knowledge recall and reasoning, generation, and safety.

Task Introduction

Above: basic statistics information of PM⁴Bench.

Below: language Information Table. --- indicates that GraphCom does not provide specific numerical values. However, by comparing the number of characters, language families, and other aspects of the script systems, we have identified the rankings of Czech and Vietnamese in the table.

Comparison between traditional and vision setting. Under vision setting, the input of LVLM is a single image containing all the information needed to fulfill the task.

Under traditional setting, text content of questions and inserted images are separately given.

🤔 MDUR (Multi-Discipline Understanding and Reasoning)

MDUR aims to evaluate LVLM's multi-modal understanding, knowledge application and reasoning capability. Thus, we chose MMMU-pro as our data source. MMMU-pro is a comprehensive dataset created to assess multi-modal models on college-level tasks that demand specialized knowledge and critical reasoning. MMMU-pro has 1730 samples, each of which is an English multi-choice question with only one correct option.

We translate the text of original English questions into the 9 other languages and generated the vision form images. It is important to note that some of the inserted images in the MDUR task contain English characters, inherited from the original MMMU-pro samples. We believe that their presence has minimal impact on our "parallel" design principle.

Finally, we obtain the MDUR dataset covering 10 languages, with 1730 questions for each language. With MDUR task, we are able to extensively evaluate LVLM's capability to handle complicated knowledge understanding, reasoning, and application under multilingual senarios. Examples of MDUR samples can be found below (Left is traditional and right is vision setting).

💬 MIQA (Multi-Image Question Answering)

MIQA focuses on open-end question answering capabilities in multi-image input scenarios. We used MMDU, a multi-turn & multi-image dialog understanding benchmark containing 1.6K+ rounds of QA as our source of data. We sampled 109 QA pairs from MMDU, where we prioritized choosing questions that included more image inputs. These questions and corresponding reference answers are then translated into the 9 languages. Similar to MDUR task, we also provide both traditional and vision input setting for MIQA task.

It's worth nothing that all the questions and answers in the MIQA dataset are sourced from Wikipedia, which encompasses a wide range of general and specialized knowledge. Consequently, this necessitates that LVLM possesses not only strong visual perception and reasoning skills but also a comprehensive and robust knowledge base. Meanwhile, multi-image input also puts a challenge to model's ability to acquire, compare, and analyze information across images. MIQA adopts LLM as judge to score the open-ended answers of the LVLM from multiple dimensions.

We expect MIQA task to extensively evaluate LVLM's perception, understanding, knowledge application, and generation capabilities under multi-image & multilingual inputs. Examples of MIQA samples can be found below (Left is traditional and right is vision setting).

🔐 MMJB (Multi-Modal JailBreaking Challenge)

This task aims to evaluate LVLM's safety under multiodal & multilingual scenarios. We select SafeBench as our seed dataset, which has 500 poison instructions covering 10 safety topics. We translate these instructions into parallel corpus of 9 other languages, and then synthesize these multilingual queries into images following the SafeBench's method. We adopt LLM as judge to determine whether LVLM's response to the image is harmful. We also have a traditional input setting for MMJB, where only text form instructions are fed to the model. Examples of MMJB samples can be found below (Left is traditional and right is vision setting).

🧐 MSOCR (Multi-Size OCR Challenge)

This task aims to evaluate LVLM's ability in recognizing words and characters of various languages. We built MSOCR dataset from scratch by randomly select a series of word entries (together with their parallel corpus) from Wikipedia, and then plot the words on a plain white canvas to form the vision input. Each image contains 20 lines of words in a specific language, and these words, when combined, have no actual meanings.

The font size of each line decreases from 40 to 2 from top to bottom. The LVLM is required to recognize all the text in the image from top to bottom. We then identify the line at which the model first makes a recognition error, thereby evaluating the lower limit of font size that the model can effectively recognize.

We constructed the 10 sets of images, each corresponding to one of the 10 languages, with each set containing 100 images. For each image, the text in its different language versions are semantically identical. This guarantees a fair comparison across linguistic contexts. In this way, we aim to provide a simple yet efficient method for assessing LVLMs' OCR performance across different languages. Examples of MSOCR samples can be found below (MSOCR task only has vision setting).

PM⁴Bench Construction

Translation Pipeline

In order to ensure the quality of our data, we adopt the LLM and human-expert in loop translation pipeline to acquire the parallel corpus for MDUR, MIQA and MMJB task. As shown in translation_process, the pipeline consists of 3 stages: LLM translation, manual correction, and selection.

We first utilized GPT-4o-2024-08-06, which is not the model being evaluated in this paper, to translate the original English corpus into the target languages. Next, we provided both the original English corpus and the translated results to two native speaker annotators, who are also proficient in English. They worked independently and refined the machine-translated results based on their expertise. This process yielded 3 versions of the translations: the original machine translation and the two refined versions. Finally, we submitted the original English text along with the 3 translation versions to Claude-3.5-sonnet to select the optimal translation. As a result, for the MIQA task, 51% of the selected translations were refined by human experts. For the MDUR and MMJB tasks, this proportion exceeded 99%.

Construction of `vision` setting

When constructing vision setting samples, we maintained consistent layout and style across 10 language versions, with differences only in text content. This ensures that variations in cross-lingual evaluation results are primarily due to the model's language proficiency.

For the MDUR task, we integrate the question, options, and inserted images into a single webpage using an HTML template (adapted from MMMU-pro's open-sourced version) and save the screenshot. To increase complexity, we randomly varied text styles, such as font size, weight, style, underline and shadow. For the MIQA task, we use a plain white canvas with a fixed width of 1280 pixels. Text is wrapped, and inserted images are resized and plotted using the PIL library. For the MMJB task, before plotting, we wrap text lines to 15 characters for ko, zh and 25 characters for other languages. For the MSOCR task, we use a 1280*720 pixel plain white canvas, which is the commonly-used screen resolution.

PM⁴Bench Evaluation

How do LVLMs perform on PM⁴Bench?

For each task, each LVLM, we compute the average score S_avg. and the coefficient of variation S_cv. across scores of 10 languages. S_cv. reflects the performance variability of LVLMs across different languages, and it is calculated as: S_cv. = (σ / μ) × 100%, where σ is the standard deviation, and μ is the average of scores across the 10 languages.

As shown in table, Gemini-2.0-flash-thinking-exp dominates both settings on MDUR and MIQA tasks, and QVQ-72B reaches top on MMJB's traditional setting. As for MSOCR, the newly proposed Qwen2.5-VL-72B achieves SOTA. The results above demonstrate the superior overall performance of recent reasoning models. This validates the effectiveness of reasoning architecture in multilingual and multi-modal scenarios. Further investigation shows that in certain scenarios, LVLM demonstrates notable performance disparities across different languages, as indicated by the S_cv. in table, where the higher value reflects greater cross-lingual disparity.

🔍❓ Several Research Questions

🔍❓RQ1: How is the performance gap between `traditional` and `vision` setting?

We further divided score into 2 dimensions: the average of MDUR and MIQA represented the usefulness of the model, while the performance of MMJB represented its safety. The above figure Safe_Use_final visualizes the changes in performance of each model between the traditional and vision settings across the two dimensions.

It is clear that for most models, the usefulness decreases under the vision setting, while the safety increases. The decrease of model's usefulness may be due to model's limited ability to perceive textual content in images of vision setting, hindering model's capacity to obtain useful information in the MDUR and MIQA tasks. At the same time, this same limitation conversely enhanced the model's safety by inhibiting the extraction of harmful information in the MMJB task. Our subsequent analysis of OCR capabilities further supports this hypothesis.

We further examine the cross-language disparity between the traditional and vision settings. The higher S_cv. indicates the greater cross-language disparity. The results, shown in Variance Comparison, reveal that for MDUR, MIQA, and MMJB, the percentage of models demonstrating greater cross-language variability in the vision setting compared to the traditional setting is 82%, 100%, and 73%, respectively. This indicates that the vision setting not only compromises the overall performance of LVLMs but also intensifies cross-language imbalance challenges.

🔍❓RQ2: Does model size matters?

In recent years, scaling up model size has been widely acknowledged by both academia and industry as a crucial step toward achieving AGI. We summarized the impact of model size on performance of MDUR.

It can be seen that in terms of overall performance (characterized by S_avg.), as the model size increases, the performance of LVLM shows an increasing trend in both traditional and vision settings. However, there is not a similarly optimistic conclusion in terms of reducing cross-language imbalance (represented by S_cv.). Although InternVL2.5-MPO, Qwen2.5-VL and GPT-4o series models all show some degree of improvement in the traditional setting, as the model size increases, but in the vision setting, the differences between languages do not noticeably improve, and even worsen in the InternVL2.5-MPO and Qwen2.5-VL series models. Therefore, for the vision setting, we need to further explore the factors affecting cross-language differences, to better guide the efficient optimization of models.

🔍❓RQ3: OCR really matters!

The findings presented above collectively demonstrate that vision settings pose significant challenges for current LVLMs in multilingual contexts: (1) LVLMs exhibit marked underperformance in vision settings compared to traditional settings, (2) cross-lingual performance disparities are exacerbated in vision settings compared to traditional settings, and (3) crucially, these limitations persist despite model scaling efforts.

Therefore, it is reasonable to infer that the inferior performance on vision setting may be because of LVLM's inadequate implicit OCR capabilities for multilingual text, which can not be adjusted by simply using larger models.

To validate this hypothesis, we additionally designed OCR settings for MDUR, MIQA and MMJB tasks to evaluate how well does a model recognize the text content of vision setting images. We then compared and analyzed the relationship between the model's score of OCR setting and vision settings of these 3 tasks. Preliminary visualization results are shown in MDUR_regression. Furthermore, we calculated the Pearson Correlation Coefficients (PCCs). The statistics reveal that for the MDUR, MIQA, and MMJB tasks, the proportion of models with PCCs having an absolute value exceeding 0.5 (indicating a strong correlation) is 90.91%, 72.73%, and 72.73% of all 11 models, respectively.

The above results demonstrate high correlation between task performance and its OCR accuracy, indicating that OCR capability is a key factor influencing model's performance in vision settings. For MDUR and MIQA, better OCR results leads to better VQA accuracy and quality. For the MMJB task, superior OCR performance enables the model to more accurately recognize and interpret harmful instructions, which in turn increases the risk of model jail-breaking.

🔍❓RQ4: Do reasoning models have anything special?

In this section, we aim to analyze the characteristics of reasoning models in multilingual and multi-modal scenarios. Notably, Gemini-2.0-flash-thinking does not provide details of its reasoning process, so our case study is limited to QVQ-72B.

As shown in Overall score table, in the MDUR and MIQA tasks, Gemini-2.0-flash-thinking achieved the highest average scores in both vision and traditional settings. QVQ-72B ranked second in both settings of MDUR and second and third in the vision and traditional settings of MIQA, respectively. Both models also exhibited low \cv values. This indicates that reasoning models excel in knowledge recall, knowledge reasoning, and multi-image comprehension, with relatively balanced multilingual capabilities. The case study of QVQ-72B revealed that its reasoning process involves a deep understanding of questions and logical deduction of answers, which likely contributes to its higher accuracy. Additionally, both models occasionally used English for reasoning in non-English tasks, which may partially mitigate cross-lingual performance disparities.

In MMJB task, Gemini-2.0-flash-thinking did not perform well, while QVQ-72B outperformed all models in the traditional setting for zh, achieving a safety rate of 98.2. The case study revealed that when QVQ-72B refused to answer, it often did so without providing a reasoning process. This suggests that the model's safety performance primarily depends on its alignment efforts, and the influence of the reasoning chain remains unclear.

As for MSOCR task, Gemini-2.0-flash-thinking-exp ranked first in this task, while QVQ-72B also performed well. However, our case study revealed that although QVQ-72B engaged in extensive reasoning before giving OCR results, its reasoning did not involve correcting OCR results but rather reminders about its own tasks. Therefore, we believe that the strong performance of the models cannot be simply attributed to their reasoning capabilities.

To summary, we found that for tasks involving knowledge application, knowledge reasoning, and analyzing logical relationships within input content, the reasoning process of reasoning models significantly enhances their performance. However, for OCR or safety related tasks, it remains uncertain whether the reasoning process of reasoning models directly contributes to task performance.

BibTeX

@misc{gao2025pm4benchparallelmultilingualmultimodal,
      title={PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model}, 
      author={Junyuan Gao and Jiahe Song and Jiang Wu and Runchuan Zhu and Guanlin Shen and Shasha Wang and Xingjian Wei and Haote Yang and Songyang Zhang and Weijia Li and Bin Wang and Dahua Lin and Lijun Wu and Conghui He},
      year={2025},
      eprint={2503.18484},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.18484}, 
}

PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model