Ai Benchmarks for Code

Logical Intelligence Tops Leading AI Verification Benchmarks as Verified Code Generation Nears Reality with Aleph

Aleph, an AI coding agent sets new records on four major formal reasoning benchmarks, proving that automated code generation can be formally verified for mission-critical systems.

Gomboc AI Publishes First Open Benchmark for AI Code Remediation

15 cloud scenarios. 43 merge-ready fixes. 100% loop closure. 12 minutes and $17 to author once; seconds and zero-cost ...

SD Times

Beyond Benchmarks: Measuring the True Cost of AI-Generated Code

Value stream management involves people in the organization to examine workflows and other processes to ensure they are deriving the maximum value from their efforts while eliminating waste — of ...

Hosted on MSN

What AI coding benchmarks still miss about software quality

Most AI coding benchmarks still ask the question: did the agent produce code that passes the current tests? This is a useful question, but it is too narrow. Software development is iterative.

13d

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

DeepSWE puts GPT-5.5 atop the AI coding leaderboard while raising new questions about Claude Opus, SWE-Bench Pro, and ...

Anthropic says 80% of its new production code is now authored by Claude — how your enterprise can keep up

Achieving an 80 percent automated codebase requires more than purchasing API tokens or configuring agent loops; it demands a ...

Crypto Briefing

Claude now authors over 80% of code merged into its own codebase

Anthropic reveals Claude Code now writes over 80% of merged production code, up from low single digits in early 2025, reshaping AI development and engineer ...

Decrypt

Huawei's New Benchmark Gives AI Agents Months of Your Life—Then Watches Them Fail

Claw-Anything simulates a real digital existence and asks AI assistants to handle it. GPT-5.5, the best model available, scored 34.5%.

InfoWorld

Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high. Benchmarks are often reduced to leaderboard ...

25d

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing ...

Forbes

AI Models Still Struggle With Reasoning — And Here’s Why

Forbes contributors publish independent expert analyses and insights. I write about the economics of AI. What looks like intelligence in AI models may just be memorization. A closer look at benchmarks ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results