📊 Benchmark Results · Updated March 30, 2026

GLM-5.1 on WildClawBench

End-to-end autonomous agent evaluation across 60 real-world tasks in 6 categories — running task by task, live

—

Global Score (runnable)

—

Tasks Completed

—

Pending / Skipped

—

Last Updated

Why Is the Score Low?

Root cause analysis of GLM-5.1 benchmark performance

43/60 tasks in the first run hit the GLM proxy rate limit after ~26s / 4 API calls
This run is going task-by-task with pauses to avoid rate limiting
Category 03 Social Interaction already showing ~0.71 avg vs 0.017 before

5 Creative Synthesis tasks require YouTube downloads via yt-dlp
YouTube bot-detection blocks yt-dlp without browser cookies
Tasks 1, 2, 4, 5, 11 in Creative Synthesis cannot run without manual cookie export

GLM-5.1 vs leaderboard models

Model	Provider	Score	Status	Note
Claude Opus 4.6	Anthropic	51.1%	Full run	Best on leaderboard
GPT-5.4 Turbo	OpenAI	~30%	Full run	Reference comparison
GLM-5.1	Zhipu AI	loading…	In progress	Task-by-task run, live updates
Step 3.5 Flash	StepFun	~15%	Full run	Most cost-efficient