Overall performance on WorldSense. Existing agents demonstrate significant limitations on real-world omnimodal understanding.
# | Model | LLM Size | Modality | Tech & Science | Culture & Politics | Daily Life | Film & TV | Performance | Games | Sports | Music | Average |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Gemini 2.5 Pro Adaptive-Thinking
|
- | A+V | 64.9 | 66.0 | 65.8 | 68.1 | 69.7 | 65.7 | 63.5 | 61.3 | 65.1 | |
video-SALMONN 2+ 72B
Tsinghua & ByteDance |
72B | A+V | 59.0 | 63.1 | 54.0 | 59.9 | 58.1 | 54.1 | 51.9 | 54.4 | 56.5 | |
Gemini 2.5 Flash Thinking
|
- | A+V | 51.8 | 50.2 | 54.1 | 51.2 | 59.6 | 50.6 | 51.6 | 51.5 | 52.3 | |
Gemini 2.5 Flash No-Thinking
|
- | A+V | 55.1 | 48.2 | 53.0 | 48.8 | 56.2 | 47.2 | 46.3 | 50.0 | 50.9 | |
video-SALMONN 2+ 7B
Tsinghua & ByteDance |
7B | A+V | 57.1 | 54.4 | 48.9 | 50.9 | 49.1 | 51.1 | 44.9 | 51.0 | 50.9 | |
video-SALMONN 2+ 3B
Tsinghua & ByteDance |
3B | A+V | 54.5 | 51.5 | 49.5 | 49.3 | 43.8 | 48.1 | 44.4 | 46.8 | 48.8 | |
Gemini 1.5 Pro
|
- | A+V | 53.7 | 47.2 | 50.3 | 50.4 | 52.4 | 46.8 | 40.2 | 42.0 | 48.0 | |
Qwen 2.5 Omni
Alibaba |
- | A+V | 47.8 | 49.8 | 43.6 | 43.8 | 48.3 | 39.1 | 43.5 | 47.3 | 45.4 | |
GPT-4o
OpenAI |
- | V | 48.0 | 44.0 | 38.3 | 43.5 | 41.9 | 41.2 | 42.6 | 42.7 | 42.6 | |
LLaVA-Video
Bytedance & NTU S-Lab |
7B | V | 41.6 | 38.6 | 40.6 | 42.1 | 40.4 | 39.7 | 37.0 | 40.9 | 40.2 | |
InternVL2.5
Shanghai AI Lab |
8B | V | 43.7 | 40.9 | 34.6 | 39.7 | 37.8 | 36.2 | 39.4 | 41.1 | 39.1 | |
LLaVA-OneVision
Bytedance & NTU S-Lab |
7B | V | 38.9 | 38.9 | 36.3 | 37.6 | 37.8 | 37.9 | 36.3 | 39.1 | 37.7 | |
VITA-1.5
NJU & Tecent Youtu Lab |
7B | A+V | 38.2 | 35.9 | 34.3 | 39.8 | 41.2 | 32.6 | 34.7 | 39.9 | 36.9 | |
Claude 3.5 Sonnet
Anthropic |
- | V | 43.7 | 31.7 | 30.6 | 36.5 | 30.7 | 31.9 | 36.6 | 33.9 | 34.8 | |
mPLUG-Owl3
Alibaba |
7B | V | 37.5 | 31.4 | 31.0 | 34.1 | 33.3 | 33.2 | 32.1 | 30.5 | 32.9 | |
Qwen2-Audio
Alibaba |
7B | A | 33.5 | 33.7 | 32.7 | 33.2 | 28.5 | 28.3 | 28.8 | 40.9 | 32.8 | |
Qwen2-VL
Alibaba |
7B | V | 33.5 | 29.0 | 28.4 | 33.6 | 30.3 | 32.3 | 34.7 | 38.5 | 32.4 | |
LLaMA3.2
Meta |
7B | V | 27.5 | 25.7 | 28.9 | 25.9 | 27.7 | 21.1 | 29.0 | 26.8 | 27.1 | |
Unified-IO-2 XXL
AllenAI |
7B | A+V | 27.1 | 31.7 | 23.9 | 23.7 | 25.5 | 23.7 | 25.7 | 27.3 | 25.9 | |
VideoLLaMA 2
Alibaba |
7B | A+V | 29.4 | 25.4 | 21.8 | 24.5 | 26.2 | 24.6 | 25.5 | 27.1 | 25.4 | |
Unified-IO-2 XL
AllenAI |
3B | A+V | 26.5 | 24.4 | 22.5 | 23.5 | 24.7 | 28.0 | 25.7 | 24.2 | 24.7 | |
Unified-IO-2 L
AllenAI |
1B | A+V | 19.3 | 22.8 | 23.1 | 25.6 | 25.8 | 24.1 | 22.9 | 25.3 | 23.3 | |
OneLLM
CUHK & Shanghai AI Lab |
7B | A+V | 26.7 | 25.1 | 19.0 | 22.7 | 27.0 | 23.7 | 22.4 | 19.8 | 22.8 | |
Video-LLaVA
PKU |
7B | V | 23.6 | 20.8 | 19.1 | 17.3 | 23.6 | 17.2 | 20.8 | 20.1 | 20.3 |