Logo WorldSense

Evaluating Real-world Omnimodal Understanding
for Multimodal LLMs

Jack Hong1, Shilin Yan1†, Jiayin Cai1, Xiaolong Jiang1, Yao Hu1, Weidi Xie2‡
Project Leader Corresponding Author
1Xiaohongshu Inc. 2Shanghai Jiao Tong University

Introduction

data-sample
We introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i) collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii) diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii) high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (48.0% best accuracy). We hope our WorldSense can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.

Leaderboard

Overall performance on WorldSense. Existing agents demonstrate significant limitations on real-world omnimodal understanding.

# Model LLM Size Modality Tech & Science Culture & Politics Daily Life Film & TV Performance Games Sports Music Average
Gemini 1.5 Pro

Google

- A+V 53.7 47.2 50.3 50.4 52.4 46.8 40.2 42.0 48.0
GPT-4o

OpenAI

- V 48.0 44.0 38.3 43.5 41.9 41.2 42.6 42.7 42.6
LLaVA-Video

Bytedance & NTU S-Lab

7B V 41.6 38.6 40.6 42.1 40.4 39.7 37.0 40.9 40.2
InternVL2.5

Shanghai AI Lab

8B V 43.7 40.9 34.6 39.7 37.8 36.2 39.4 41.1 39.1
LLaVA-OneVision

Bytedance & NTU S-Lab

7B V 38.9 38.9 36.3 37.6 37.8 37.9 36.3 39.1 37.7
Claude 3.5 Sonnet

Anthropic

- V 43.7 31.7 30.6 36.5 30.7 31.9 36.6 33.9 34.8
mPLUG-Owl3

Alibaba

7B V 37.5 31.4 31.0 34.1 33.3 33.2 32.1 30.5 32.9
Qwen2-VL

Alibaba

7B V 33.5 29.0 28.4 33.6 30.3 32.3 34.7 38.5 32.4
LLaMA3.2

Meta

7B V 27.5 25.7 28.9 25.9 27.7 21.1 29.0 26.8 27.1
Unified-IO-2 XXL

AllenAI

7B A+V 27.1 31.7 23.9 23.7 25.5 23.7 25.7 27.3 25.9
VideoLLaMA 2

Alibaba

7B A+V 29.4 25.4 21.8 24.5 26.2 24.6 25.5 27.1 25.4
Unified-IO-2 XL

AllenAI

3B A+V 26.5 24.4 22.5 23.5 24.7 28.0 25.7 24.2 24.7
Unified-IO-2 L

AllenAI

1B A+V 19.3 22.8 23.1 25.6 25.8 24.1 22.9 25.3 23.3
OneLLM

CUHK & Shanghai AI Lab

7B A+V 26.7 25.1 19.0 22.7 27.0 23.7 22.4 19.8 22.8
Video-LLaVA

PKU

7B V 23.6 20.8 19.1 17.3 23.6 17.2 20.8 20.1 20.3

If you want to add your model to our leaderboard, please contact jaaackhong@gmail.com and tattoo.ysl@gmail.com.

WorldSense

Overview

Our WorldSense have the following three main features:

(1) Collaboration of Omni-Modality, (2) Diversity of Videos and Tasks, (3) High-Quality Annotations.

Distribution of WorldSense.

(a) Video category hierarchy. (b) Task distribution. (c) Acoustic signals distribution. (d) Video duration distribution.

Benchmark Curation

Data collection and QA annotation pipelines.

(a) Data collection and curation process. (b) QA annotation and quality control pipeline.

Benchmark Static

Static comparison with other benchmarks.

Experiment Results

Overall Performance

overall_performance

Overall performance on our WorldSense.

Fine-grained Results

In-depth Analysis

overall_performance

Impact of vision information.

overall_performance

Impact of audio information.

overall_performance

Impact of audio information for Video MLLMs.

overall_performance

Impact of video frames.

Citation


        @article{hong2025worldsenseevaluatingrealworldomnimodal,
          title={WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs},
          author={Jack Hong and Shilin Yan and Jiayin Cai and Xiaolong Jiang and Yao Hu and Weidi Xie},
          year={2025},
          eprint={2502.04326},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2502.04326}, 
        }