SuperGPQA
Graduate-Level Knowledge
AA-Omniscience Index
Knowledge Reliability and Hallucination
GDPval-AA
Real-World Valuable Task
QwenChineseBench
Chinese Real-World Knowledge
QwenClawBench
Real-World Agent
ToolcallFormatIFBench
Real-World Toolcall Following
QwenWebBench (Elo Rating)
Artifacts
NL2Repo
Long-Horizon Coding
Terminal-Bench 2.0 (Terminus-2)
Agentic Terminal Coding
SWE-bench Pro
Agentic Coding