WorldSense

Introduction

We introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i) collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii) diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii) high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure quality. Based on our WorldSense, we extensively evaluate various state-of-the-art models. The experimental results indicate that existing models face significant challenges in understanding real-world scenarios (48.0% best accuracy). We hope our WorldSense can provide a platform for evaluating the ability in constructing and understanding coherent contexts from omni-modality.

Leaderboard

Overall performance on WorldSense. Existing agents demonstrate significant limitations on real-world omnimodal understanding.

Model	LLM Size	Modality	Tech & Science	Culture & Politics	Daily Life	Film & TV	Performance	Games	Sports	Music	Average
Gemini 2.5 Pro Adaptive-Thinking Google	-	A+V	64.9	66.0	65.8	68.1	69.7	65.7	63.5	61.3	65.1
video-SALMONN 2+ 72B Tsinghua & ByteDance	72B	A+V	59.0	63.1	54.0	59.9	58.1	54.1	51.9	54.4	56.5
Gemini 2.5 Flash Thinking Google	-	A+V	51.8	50.2	54.1	51.2	59.6	50.6	51.6	51.5	52.3
Gemini 2.5 Flash No-Thinking Google	-	A+V	55.1	48.2	53.0	48.8	56.2	47.2	46.3	50.0	50.9
video-SALMONN 2+ 7B Tsinghua & ByteDance	7B	A+V	57.1	54.4	48.9	50.9	49.1	51.1	44.9	51.0	50.9
video-SALMONN 2+ 3B Tsinghua & ByteDance	3B	A+V	54.5	51.5	49.5	49.3	43.8	48.1	44.4	46.8	48.8
Gemini 1.5 Pro Google	-	A+V	53.7	47.2	50.3	50.4	52.4	46.8	40.2	42.0	48.0
Qwen 2.5 Omni Alibaba	-	A+V	47.8	49.8	43.6	43.8	48.3	39.1	43.5	47.3	45.4
GPT-4o OpenAI	-	V	48.0	44.0	38.3	43.5	41.9	41.2	42.6	42.7	42.6
LLaVA-Video Bytedance & NTU S-Lab	7B	V	41.6	38.6	40.6	42.1	40.4	39.7	37.0	40.9	40.2
InternVL2.5 Shanghai AI Lab	8B	V	43.7	40.9	34.6	39.7	37.8	36.2	39.4	41.1	39.1
LLaVA-OneVision Bytedance & NTU S-Lab	7B	V	38.9	38.9	36.3	37.6	37.8	37.9	36.3	39.1	37.7
VITA-1.5 NJU & Tecent Youtu Lab	7B	A+V	38.2	35.9	34.3	39.8	41.2	32.6	34.7	39.9	36.9
Claude 3.5 Sonnet Anthropic	-	V	43.7	31.7	30.6	36.5	30.7	31.9	36.6	33.9	34.8
mPLUG-Owl3 Alibaba	7B	V	37.5	31.4	31.0	34.1	33.3	33.2	32.1	30.5	32.9
Qwen2-Audio Alibaba	7B	A	33.5	33.7	32.7	33.2	28.5	28.3	28.8	40.9	32.8
Qwen2-VL Alibaba	7B	V	33.5	29.0	28.4	33.6	30.3	32.3	34.7	38.5	32.4
LLaMA3.2 Meta	7B	V	27.5	25.7	28.9	25.9	27.7	21.1	29.0	26.8	27.1
Unified-IO-2 XXL AllenAI	7B	A+V	27.1	31.7	23.9	23.7	25.5	23.7	25.7	27.3	25.9
VideoLLaMA 2 Alibaba	7B	A+V	29.4	25.4	21.8	24.5	26.2	24.6	25.5	27.1	25.4
Unified-IO-2 XL AllenAI	3B	A+V	26.5	24.4	22.5	23.5	24.7	28.0	25.7	24.2	24.7
Unified-IO-2 L AllenAI	1B	A+V	19.3	22.8	23.1	25.6	25.8	24.1	22.9	25.3	23.3
OneLLM CUHK & Shanghai AI Lab	7B	A+V	26.7	25.1	19.0	22.7	27.0	23.7	22.4	19.8	22.8
Video-LLaVA PKU	7B	V	23.6	20.8	19.1	17.3	23.6	17.2	20.8	20.1	20.3

If you want to add your model to our leaderboard, please contact jaaackhong@gmail.com and tattoo.ysl@gmail.com.

Overview

Our WorldSense have the following three main features:

(1) Collaboration of Omni-Modality, (2) Diversity of Videos and Tasks, (3) High-Quality Annotations.

Distribution of WorldSense.

(a) Video category hierarchy. (b) Task distribution. (c) Acoustic signals distribution. (d) Video duration distribution.

Benchmark Curation

Data collection and QA annotation pipelines.

(a) Data collection and curation process. (b) QA annotation and quality control pipeline.

Benchmark Static

Static comparison with other benchmarks.

Overall Performance

Overall performance on our WorldSense.

Fine-grained Results

Fine-grained results on task category.

Fine-grained results on audio type.

In-depth Analysis

Impact of vision information.

Impact of audio information.

Impact of audio information for Video MLLMs.

Impact of video frames.


        @article{hong2025worldsenseevaluatingrealworldomnimodal,
          title={WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs},
          author={Jack Hong and Shilin Yan and Jiayin Cai and Xiaolong Jiang and Yao Hu and Weidi Xie},
          year={2025},
          eprint={2502.04326},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2502.04326}, 
        }

WorldSense

Evaluating Real-world Omnimodal Understanding
for Multimodal LLMs

Introduction

Leaderboard

WorldSense

Overview

Benchmark Curation

Benchmark Static

Experiment Results

Overall Performance

Fine-grained Results

In-depth Analysis

Citation