Overall performance on WorldSense. Existing agents demonstrate significant limitations on real-world omnimodal understanding.
# | Model | LLM Size | Modality | Tech & Science | Culture & Politics | Daily Life | Film & TV | Performance | Games | Sports | Music | Average |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Gemini 1.5 Pro
|
- | A+V | 53.7 | 47.2 | 50.3 | 50.4 | 52.4 | 46.8 | 40.2 | 42.0 | 48.0 | |
GPT-4o
OpenAI |
- | V | 48.0 | 44.0 | 38.3 | 43.5 | 41.9 | 41.2 | 42.6 | 42.7 | 42.6 | |
LLaVA-Video
Bytedance & NTU S-Lab |
7B | V | 41.6 | 38.6 | 40.6 | 42.1 | 40.4 | 39.7 | 37.0 | 40.9 | 40.2 | |
InternVL2.5
Shanghai AI Lab |
8B | V | 43.7 | 40.9 | 34.6 | 39.7 | 37.8 | 36.2 | 39.4 | 41.1 | 39.1 | |
LLaVA-OneVision
Bytedance & NTU S-Lab |
7B | V | 38.9 | 38.9 | 36.3 | 37.6 | 37.8 | 37.9 | 36.3 | 39.1 | 37.7 | |
Claude 3.5 Sonnet
Anthropic |
- | V | 43.7 | 31.7 | 30.6 | 36.5 | 30.7 | 31.9 | 36.6 | 33.9 | 34.8 | |
mPLUG-Owl3
Alibaba |
7B | V | 37.5 | 31.4 | 31.0 | 34.1 | 33.3 | 33.2 | 32.1 | 30.5 | 32.9 | |
Qwen2-VL
Alibaba |
7B | V | 33.5 | 29.0 | 28.4 | 33.6 | 30.3 | 32.3 | 34.7 | 38.5 | 32.4 | |
LLaMA3.2
Meta |
7B | V | 27.5 | 25.7 | 28.9 | 25.9 | 27.7 | 21.1 | 29.0 | 26.8 | 27.1 | |
Unified-IO-2 XXL
AllenAI |
7B | A+V | 27.1 | 31.7 | 23.9 | 23.7 | 25.5 | 23.7 | 25.7 | 27.3 | 25.9 | |
VideoLLaMA 2
Alibaba |
7B | A+V | 29.4 | 25.4 | 21.8 | 24.5 | 26.2 | 24.6 | 25.5 | 27.1 | 25.4 | |
Unified-IO-2 XL
AllenAI |
3B | A+V | 26.5 | 24.4 | 22.5 | 23.5 | 24.7 | 28.0 | 25.7 | 24.2 | 24.7 | |
Unified-IO-2 L
AllenAI |
1B | A+V | 19.3 | 22.8 | 23.1 | 25.6 | 25.8 | 24.1 | 22.9 | 25.3 | 23.3 | |
OneLLM
CUHK & Shanghai AI Lab |
7B | A+V | 26.7 | 25.1 | 19.0 | 22.7 | 27.0 | 23.7 | 22.4 | 19.8 | 22.8 | |
Video-LLaVA
PKU |
7B | V | 23.6 | 20.8 | 19.1 | 17.3 | 23.6 | 17.2 | 20.8 | 20.1 | 20.3 |