ME2: Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Abstract

With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students' comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding.

To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that, aside from recent large-scale open-source and closed-source models, most generalist open-source models, and even math-specialist models, struggle with the multimodal solution explanation task. This highlights a significant gap in current LLMs' ability to reason and explain with visual grounding in educational contexts.

ME2 Benchmark Overview

ME2 is a multimodal solution explanation benchmark consisting of 1,000 instances. Each instance contains a problem text (T_p), a problem image (I_p), an explanatory solution text (T_s), a solution image (I_s), and visual keypoints (VK) that highlight how the solution image differs from the original, along with a concise summary of the explanation.

Task Overview

We propose two subtasks to robustly analyze multimodal solution explanation capacity: (1) Visual Keypoint Identification, which challenges machines to recognize visual keypoints useful for subsequent explanation, and (2) Keypoint-based Explanation Generation, which requires models to generate explanations that explicitly reference the identified visual keypoints.

Dataset Statistics

Left: Distribution of mathematical topics covered in ME2 benchmark across geometry and algebra domains. Right: Detailed statistics showing problem types, text lengths, and visual keypoint characteristics across the dataset.

Experimental Results

Left: Visual Keypoint Identification accuracy across different models. Right: Keypoint-based Explanation Generation performance showing correctness, fidelity, and referencing scores.

Key Findings

Visual Keypoint Identification: Most open-source models struggle to identify relevant visual keypoints, with accuracy ranging from 0.006 to 0.149 for top models.
Keypoint-based Explanation Generation: Even when provided with correct keypoints, models find it challenging to generate explanations that effectively reference visual elements.
Performance Gap: Closed-source models (GPT-4o, Gemini 2.0 Flash) significantly outperform open-source alternatives, highlighting the complexity of multimodal explanation tasks.
Educational Impact: The benchmark reveals critical limitations in current AI systems' ability to provide educationally effective visual explanations.

These results underscore the need for further research in multimodal reasoning and visual grounding for educational applications.

Qualitative Results

Left: Examples of multimodal solution explanations from ME2 benchmark showing visual keypoints and corresponding explanatory text. Right: Additional examples demonstrating the diversity of visual elements and explanation styles in the ME2 dataset.

Failure Case Analysis

Analysis of common failure patterns in multimodal solution explanations. The figure shows different types of errors: (1) Identical elements with different descriptions, (2) Extra elements not present in the solution, and (3) Missing elements that are crucial for understanding the solution process.

Visual Attention Analysis

Visualization of model attention patterns when processing mathematical diagrams. The heatmaps show how different layers (7th, 14th, 21st) of multimodal models focus on various regions of geometric figures, revealing insights into the visual reasoning process for both successful and failed cases.

Explain with Visual Keypoints Like a Real Mentor!A Benchmark for Multimodal Solution Explanation