🏆 AI 模型基准测试对比

Qwen 3.6 Max (preview)
Qwen 3.6 plus
Qwen 3.5 plus
Claude 4.5 Opus
GLM 5.1

SuperGPQA

Graduate-Level Knowledge

AA-Omniscience Index

Knowledge Reliability and Hallucination

GDPval-AA

Real-World Valuable Task

QwenChineseBench

Chinese Real-World Knowledge

QwenClawBench

Real-World Agent

SkillsBench

Agent Skills

ToolcallFormatIFBench

Real-World Toolcall Following

QwenWebBench (Elo Rating)

Artifacts

SciCode

Research Coding

NL2Repo

Long-Horizon Coding

Terminal-Bench 2.0 (Terminus-2)

Agentic Terminal Coding

SWE-bench Pro

Agentic Coding

📝 备注