πŸ“Š Benchmark Results Β· Updated March 30, 2026

GLM-5.1 on WildClawBench

End-to-end autonomous agent evaluation across 60 real-world tasks in 6 categories β€” running task by task, live

β€”
Global Score (runnable)
β€”
Tasks Completed
β€”
Pending / Skipped
β€”
Last Updated
⚠️

Running tasks one by one β€” results update in real time

Tasks marked SKIP require Brave Search API (rate-limited on free tier). Tasks marked VIDEO require YouTube downloads blocked by bot detection. All other tasks run fully with GLM-5.1 via the GLM proxy.

Category Breakdown

Per-category average scores β€” only runnable tasks counted in averages

Why Is the Score Low?

Root cause analysis of GLM-5.1 benchmark performance

🚫 Previous Run: API Rate Limiting

  • 43/60 tasks in the first run hit the GLM proxy rate limit after ~26s / 4 API calls
  • This run is going task-by-task with pauses to avoid rate limiting
  • Category 03 Social Interaction already showing ~0.71 avg vs 0.017 before

πŸ” Brave Search API

  • Free tier limited to 1 req/sec, 2000/month β€” hits 429 on search-heavy tasks
  • All 11 Search & Retrieval tasks and 2 Productivity tasks skipped
  • A paid Brave key would unlock these 13 tasks

🎬 Video Tasks Blocked

  • 5 Creative Synthesis tasks require YouTube downloads via yt-dlp
  • YouTube bot-detection blocks yt-dlp without browser cookies
  • Tasks 1, 2, 4, 5, 11 in Creative Synthesis cannot run without manual cookie export

βœ… GLM-5.1 Strengths

  • Social Interaction ~0.71 β€” best category, strong multi-turn reasoning
  • task_9 scp_crawl 0.95 β€” excellent file/web operations
  • Jigsaw puzzles 0.74–0.78 β€” solid visual reasoning
  • Safety tasks 0.6–0.8 β€” good adversarial robustness

Model Comparison

GLM-5.1 vs leaderboard models

ModelProviderScoreStatusNote
Claude Opus 4.6Anthropic51.1%Full runBest on leaderboard
GPT-5.4 TurboOpenAI~30%Full runReference comparison
GLM-5.1Zhipu AIloading…In progressTask-by-task run, live updates
Step 3.5 FlashStepFun~15%Full runMost cost-efficient

* Reference scores from internlm.github.io/WildClawBench.