End-to-end autonomous agent evaluation across 60 real-world tasks in 6 categories β running task by task, live
Per-category average scores β only runnable tasks counted in averages
Root cause analysis of GLM-5.1 benchmark performance
GLM-5.1 vs leaderboard models
| Model | Provider | Score | Status | Note |
|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 51.1% | Full run | Best on leaderboard |
| GPT-5.4 Turbo | OpenAI | ~30% | Full run | Reference comparison |
| GLM-5.1 | Zhipu AI | loading⦠| In progress | Task-by-task run, live updates |
| Step 3.5 Flash | StepFun | ~15% | Full run | Most cost-efficient |
* Reference scores from internlm.github.io/WildClawBench.