AIBENCHMARKINTEGRATION

84 Tools, 60 Papers, One Question: Is AI Auditing Ready?

Dan OgurtsovMar 202645 min read

This is a long read. Feel free to navigate using your AI setup — ask it to summarize a specific chapter, find a particular tool, or jump to conclusions.

Why We Wrote This

Our team set out to answer a question that kept coming up in conversations with auditors, protocol teams, and investors: can AI actually audit smart contracts, or is it all hype?

We couldn't find a single source that covered the full picture — tools, benchmarks, funding, real results, independence of evaluations — all in one place. Plenty of vendor blogs claim breakthroughs. Plenty of academic papers report 90%+ accuracy. But when you dig into independent evaluations, the numbers tell a very different story.

So we decided to do it ourselves. We collected everything publicly available: academic papers, GitHub repositories, benchmark leaderboards, audit firm disclosures, independent evaluations, competition results, and funding data. Nine primary research files. 4,716 lines of structured facts. 84 distinct tools. 16 benchmarks. 60+ academic papers.

What follows is the result — a comprehensive longread covering the AI smart contract auditing landscape as of early 2026. We tried to be fair, cite everything, and note where the same team built both the benchmark and the top-scoring tool. The picture that emerged is more nuanced than either the hype or the skepticism suggests.

The short version: AI has gone from academic toy (2019) to winning a $500K audit competition against 1,600+ human researchers (October 2025) in seven years. But the best tools still catch only 30-40% of bugs in independent tests, an estimated ~80% of real vulnerabilities are business logic issues that no AI can auto-detect, and every major benchmark was created by an entity whose own tool ranks first on it. The field is real. The limitations are also real.

What's Inside

#SectionWhat you'll learn
1The $150K ProblemWhy AI auditing exists — the cost gap, the supply shortage, and the first signals that automation works
2Timeline: 2019-2026Seven years from academic paper to production win, era by era
3How They WorkEight architecture patterns across 84 tools — from rule-based to multi-agent
4The Benchmark Landscape16 benchmarks, four metric families, and why no two numbers are comparable
5The NumbersTool-by-tool performance data across every available benchmark
6Who Benchmarks WhomEvery major benchmark creator's tool ranks first on their own benchmark
7What Actually WorksIndependent evaluations, production deployments, and the 30-40% reality
8The Funding ParadoxWhy a $0 solo project outperforms a $1M-funded startup
9Exponential GrowthFrom 2% to 72% exploit rate in one year — and what it means
10ConclusionsFive key findings and practical recommendations

Key Numbers at a Glance

MetricValue
AI audit tools cataloged84
Best independent recall30-40%
Best independent precision4.1-17.9%
Top exploit rate (EVMBench)72.2% (GPT-5.3-Codex)
Top exploit rate (SCONE-bench)51.11% collective; 65% Claude Opus 4.5 post-cutoff
First AI competition win$500K Monad audit, Oct 2025
Bugs AI can't auto-detect~80% (business logic, industry estimate)
AI audit cost$0.01 - $13

Chapter 1: The $150K Problem

A professional smart contract audit for complex protocols can cost $150,000+ and take weeks to months (source). The supply of qualified auditors is finite; the demand is not. Meanwhile, DeFiHackLabs has reproduced 680+ hacking incidents dating back to 2017 across Ethereum, BSC, and Base (DeFiHackLabs) – each one a contract that was either unaudited or audited and still exploited.

This is not a theoretical problem. When Nethermind ran AuditAgent retroactively against ResupplyFi's contracts after the protocol lost $9.8M in a June 2025 exploit, the tool flagged the exact exchange rate logic flaw that caused the hack (Nethermind blog). A caveat: retroactive testing (running a tool after a vulnerability is known) is easier than prospective detection (scanning a full codebase without knowing where the bug is). Still, the result demonstrates the vulnerability was within the tool's detection capability.

When Anthropic's red team built SCONE-bench – a benchmark of 405 real-world exploited contracts – and pointed 10 frontier AI models at them, those models produced working exploits for 207 contracts, draining $550M in simulated stolen funds. On contracts exploited after the models' knowledge cutoff, AI agents still found $4.6M in exploitable value across 19 post-cutoff contracts (55.8%) (Source: SCONE-bench).

The economics are stark. Here is what an AI audit costs versus a human one:

ApproachCostSource
GPTScan (hybrid GPT + static analysis)~$0.01 per 1K lines of codeGPTScan paper
Nethermind AuditAgent$0.02-$0.10 per billable line of codeNethermind pricing
SCONE-bench agent run (average)$1.22 per runSCONE-bench
Veritas Protocol (full audit)$13.08 per auditVeritas Protocol
Bunzz Audit90% cheaper than human auditBunzz
Traditional manual audit$150,000+Veritas Protocol

The gap between $0.01 and $150,000 is not a rounding error – though these figures measure different things (per-1K-lines cost vs. per-engagement cost). Even the most expensive AI audit tool in the table – Veritas at $13.08 per full audit – is ~11,500x cheaper than hiring a human team.

This price asymmetry has attracted serious capital and research attention. The field now includes 84 distinct AI audit tools cataloged across production systems, research prototypes, MCP servers, and Claude Code skills . Total identified funding exceeds $38M across tools like Octane Security ($6.75M seed co-led by Archetype and Winklevoss Capital, with Gemini and Circle), AgentLISA ($12M via token launch), Sherlock ($5.5M), Olympix ($4.3M), and Cantina ($7.83M). Academic interest is equally deep: 60+ academic papers have been published on machine learning and deep learning for smart contract security, spanning venues from ICSE and NDSS to IEEE S&P (survey).

But cost reduction without accuracy has limited practical value. The central tension of this field – the reason this research exists – is the gap between what AI can detect and what still requires a human. As of August 2025, no single AI tool could find more than one-third of high-severity issues in a controlled benchmark (Source: Viggiano benchmark). By October 2025, an AI tool won a $500K audit competition against 1,600+ researchers (Source: Octane/Code4rena).

The field was changing fast. To understand where we are, we need to understand how we got here.

Chapter 2: Timeline – From Academic Paper to Audit Win (2019-2026)

The idea of using machine learning to find smart contract vulnerabilities is not new. It is seven years old. What has changed is whether the idea works.

Era 1: Academic Foundations (2019-2023)

The first generation of tools was purely academic. In 2019, SmartEmbed introduced structural code embeddings for clone detection and bug detection in smart contracts, published at ICSME. A year later, GNNSCVulDetector brought graph neural networks to vulnerability detection at IJCAI 2020, modeling contracts as graph structures to catch reentrancy, timestamp dependence, and infinite loop vulnerabilities.

These academic tools produced impressive numbers in controlled settings. Peculiar, using GraphCodeBERT, achieved 91.80% precision and 92.40% recall for reentrancy detection across 40,932 contracts (ISSRE 2021). ContractWard reported Micro-F1 and Macro-F1 scores above 96% on 49,502 contracts using XGBoost (IEEE TNSE 2021). On paper, the problem looked nearly solved.

Then GPT-4 arrived, and the community got excited – and then disappointed. In 2023, Zellic (one of the most respected audit firms, having reviewed Solana and LayerZero) published a blog post titled "Can GPT Audit Smart Contracts?" and concluded: no. GPT-4 failed all trials on a known bug. A separate academic study, "Do You Still Need a Manual Audit?", found that GPT-4 and Claude correctly identified vulnerability types in only 40% of 52 compromised DeFi contracts, with a high false positive rate.

Era 2: Hybrid Tools and Competition (2023-2024)

The breakthrough was not making LLMs smarter. It was combining them with existing tools. GPTScan (ICSE 2024) married GPT with static analysis and achieved >90% precision on token contracts, finding 9 bugs that human auditors had missed, at a cost of $0.01 per 1,000 lines of code. GPTLens (IEEE TPS 2023) introduced an adversarial two-stage framework – one LLM attacks, another validates – doubling detection success from 38.5% to 76.9%.

Meanwhile, the competition arena was producing its own signal. LightChaser – notably, a traditional pattern-matching system, not an AI/LLM tool – dominated Code4rena's Bot Races throughout 2024, competing in 60+ races with 1,000+ detection patterns and consistently placing at the top. This raised an important question: was AI actually better than well-crafted rules?

Era 3: Production Reality (2025-2026)

2025 is when the field graduated from benchmarks to production – with all the messiness that entails.

August 2025: Antonio Viggiano (Size Protocol) ran the first rigorous independent benchmark. Result: no single AI tool found more than 1/3 of high-severity issues. Humans found 8/12 High/Med issues; the best AI earned $1.7K in simulated payouts versus $5.4K for humans. But there was a silver lining: different AI tools collectively found the 4 issues that humans missed (Viggiano benchmark).

Mid 2025: Nethermind reported 30% average recall across 29 real completed audits (mean 11.6 contracts, 725 LOC per project), detecting valid issues in 62% of projects (Nethermind blog).

September-October 2025: The inflection point. Octane Security won the $500K Monad Audit Competition on Code4rena, placing #1 among 1,600+ researchers, catching 3 of 4 high-severity findings in a novel Rust/C++ codebase. This was the first time an AI tool had won a major audit competition outright.

December 2025: Anthropic's red team published SCONE-bench. On 405 real-world exploited contracts, 10 frontier models collectively produced working exploits for 207 contracts (51.11%), draining $550M in simulated stolen funds. Claude Opus 4.5 achieved the highest individual exploit rate on the post-cutoff subset: 65% (13/20 contracts). Two novel zero-day vulnerabilities were discovered across 2,849 unknown contracts (SCONE-bench).

February 2026: OpenAI and Paradigm released EVMBench, and GPT-5.3-Codex exploited 72.2% of bugs – up from under 20% when the project began. Claude Opus 4.6 scored the highest detection rate at 45.6% (EVMBench).

The Full Timeline

PeriodMilestoneEra
2019SmartEmbed (ICSME) – first code embedding for SC bug detectionAcademic
2020GNNSCVulDetector (IJCAI) – first GNN for SC vulnerability detectionAcademic
2020SolidiFI benchmark released (ISSTA) – first systematic tool evaluationAcademic
2021Peculiar achieves 91.8% precision for reentrancy (ISSRE)Academic
2023GPTLens – first adversarial LLM framework (IEEE TPS)Hybrid
2023Zellic: "ChatGPT cannot audit smart contracts" – GPT-4 fails all trialsReality check
2023"Do You Still Need Manual Audit?" – GPT-4/Claude correct 40% of the timeReality check
2024GPTScan (ICSE) – first GPT + static analysis hybrid, >90% precisionHybrid
2024LightChaser dominates Code4rena Bot Races (60+ races)Hybrid
Early 2025PropertyGPT wins NDSS Distinguished Paper, finds 12 zero-daysProduction
Aug 2025Viggiano benchmark: no AI finds >1/3 high-severity issuesProduction
Mid 2025Nethermind reports 30% recall on 29 real auditsProduction
Sep 2025Savant Chat achieves top 6 in Sherlock's Symbiotic contestProduction
Sep-Oct 2025Octane wins $500K Monad audit – first AI to win major competitionProduction
Dec 2025SCONE-bench: 10 models exploit 51.11% of 405 contractsProduction
Feb 2026EVMBench: GPT-5.3-Codex exploits 72.2% of bugsProduction

The trajectory from SmartEmbed's code embeddings in 2019 to GPT-5.3-Codex exploiting 72.2% of benchmark bugs in 2026 spans seven years and three distinct eras. The academic era proved the concept. The hybrid era showed that combining AI with traditional tools was the key. The production era revealed that AI can now compete with – and sometimes beat – human auditors in structured competitions.

With AI audit tools clearly evolving, the question becomes: how do these systems actually work under the hood?

Chapter 3: How They Work — Architecture Patterns

LightChaser, an anonymous bot with 1,000+ detection patterns, competed in 60+ Code4rena Bot Races in 2024 — consistently placing at the top — without a single neural network parameter. Meanwhile, PropertyGPT — a system combining retrieval-augmented generation with formal verification — discovered 12 zero-day vulnerabilities and earned $8,256 in bug bounties (NDSS 2025 Distinguished Paper). These two tools share almost nothing in their design, yet both outperform raw GPT-4 at finding smart contract bugs. The architecture matters more than the model.

Eight distinct architecture patterns have emerged across the 84 tools cataloged in this research. Understanding them is the difference between choosing a tool that finds reentrancy bugs in milliseconds and one that discovers novel logic flaws nobody has seen before.

  1. Rule-Based Pattern Matching The simplest and fastest approach. LightChaser deploys 1,000+ detection patterns and 100+ gas optimization checks, runs entirely locally with no API calls, and delivers reports within 24 hours. Slither, the foundational static analyzer by Trail of Bits, has accumulated 6,149 GitHub stars as the most widely used smart contract security tool.

Strength: Speed and low false positives. Weakness: Cannot find novel bugs — only patterns someone already wrote a rule for. Industry practitioners estimate ~80% of actual bugs are business logic issues not auto-detectable by current AI tools.

  1. Static Analysis + ML Ensemble Wake Arena combines 108 specialized detectors (87 from a private library battle-tested on billion-dollar audits) with multi-agent AI and graph-driven reasoning over data dependency graphs. The source data categorizes Wake Arena under "graph-driven reasoning" architectures; it is listed here because its detector ensemble also functions as a static analysis layer. Aderyn, a Rust-based analyzer with 733 GitHub stars, traverses ASTs with sub-second analysis times and integrates as an MCP server allowing AI models to use it as an external tool.

According to Ackee's own reporting: Wake Arena achieved 43/94 high-severity findings (45.7%) on its historical benchmark — the highest among all tested tools — and contributed 33% of all findings in real audits of Lido, Printr, and Everstake (October-December 2025). Note: these are self-reported results from the tool's developer; no independent verification was available (Wake Arena blog).

  1. Fine-Tuned LLM audit_gpt by FuzzLand fine-tunes GPT-3/4 on vulnerability data sourced from Solodit for approximately $16 in total fine-tuning cost. FTSmartAudit takes this further with multi-stage knowledge distillation: classical distillation from large teacher models, external domain knowledge from audit reports, and reward-guided learning. Its dataset includes 6,454 contracts from 72 Code4rena projects with 784 H/M findings across 120 distinct vulnerability labels.

Key finding: Distilled lightweight models outperform both commercial tools and larger models on complex vulnerabilities. The tradeoff is that fine-tuned models overfit to known patterns and lose generality on novel bug classes (FTSmartAudit).

  1. LLM + Static Analysis Hybrid GPTScan, published at ICSE 2024, was the first tool to combine GPT with static analysis for logic vulnerability detection. It breaks vulnerability types into scenarios and properties, uses GPT to match candidate vulnerable functions, then validates findings with static confirmation. Result: >90% precision on token contracts, >70% recall on ground-truth logic vulnerabilities, and 9 new bugs missed by human auditors — at a cost of ~$0.01 per 1K lines.

Nethermind AuditAgent follows a similar hybrid approach, combining ML models, symbolic execution, and a continuously-updated exploit knowledge base. On 29 completed real audits, it achieved 30% average recall and detected valid issues in 62% of projects. When tested retroactively against the ResupplyFi contract after the $9.8M hack (June 2025), it flagged the exact exchange rate logic flaw — demonstrating the vulnerability was within the tool's detection capability (Nethermind blog).

  1. Multi-Agent Systems Sherlock AI combines static analysis techniques, auditor-informed heuristics, and machine learning models trained on real vulnerabilities, routing findings to domain-specific analyzers (reentrancy, access control, price manipulation). Its training data comes from verified findings in Sherlock contests and exploited codebases, with knowledge transfer from 0x52, a top security researcher whose auditing techniques are encoded as heuristics. Octane Security uses AI models trained on millions of code/exploit instances and custom-tuned per codebase — and became the first AI tool to win a major audit competition, taking #1 in the $500K Monad competition among 1,600+ researchers (Code4rena Monad audit).

  2. RAG + Formal Verification AgentLISA operates a multi-agent pipeline: vulnerability scanning, invariant generation, and formal proof. PropertyGPT — from the same NTU research group — generates properties using an LLM, validates them with a fuzzer, and refines through counter-example guidance. It embeds existing human-written properties into a vector database and uses RAG for in-context learning. The broader AgentLISA ecosystem draws training data from 10 audit platforms and 3,086 security specialists' validated findings. PropertyGPT achieved 80% recall versus ground truth and detected 26/37 CVEs.

  3. Graph-Driven Reasoning Hound, built by Bernhard Mueller (creator of Mythril), uses a knowledge graph + belief system architecture. It builds a graph representation of the contract's state space, control flow, and data flow. The LLM generates hypotheses about vulnerabilities; the belief system validates or refutes them against the graph. Only hypotheses that survive graph validation are reported. GNNSCVulDetector, published at IJCAI 2020, constructs contract graphs capturing syntactic and semantic structures, then applies DR-GCN and TMP neural network models. This graph-first approach enables cross-contract and cross-function analysis that prompt-first tools miss.

The funding dynamics here are notable. Hound, developed solo and unfunded, achieves 31.2% recall on ScaBench — though it's worth noting that Mueller co-developed ScaBench, so this is partially a self-evaluation (see Chapter 6). Nethermind AuditAgent reports 30% recall on a separate 29-audit production evaluation — a different measurement context (curated benchmark vs. real-world audits), making direct comparison difficult. Nethermind is backed by significant Ethereum Foundation grants for client development at the company level, though that funding is not audit-tool-specific (Nethermind blog; Hound GitHub).

  1. Hypothesis Generator/Critic GPTLens, published at IEEE TPS 2023, introduced a two-stage adversarial framework: an AUDITOR role generates broad vulnerability candidates through LLM scanning; a CRITIC role evaluates and filters out false positives. Multiple auditor instances independently review the code, then the critic validates findings. This adversarial pattern achieves 76.9% success rate versus 38.5% for one-stage detection — the critic effectively doubles effectiveness. Savant Chat extends this with thousands of parallel LLM calls coordinating specialized models, plus PoC generation delegated to an open-source SWE agent.

Architecture Comparison

ArchitectureExamplesStrengthsWeaknessesBest For
Rule-based patternsLightChaser (1000+ patterns), SlitherSpeed, low FP, reproducibleCannot find novel bugsKnown vulnerability classes, CI/CD gates
Static analysis + ML ensembleWake Arena (108 det.), AderynHigh detection rate, cross-functionComplex setup, detector maintenanceProduction audits, pair auditing
Fine-tuned LLMaudit_gpt, FTSmartAuditCheap ($16), domain-specificOverfits to training dataSpecific vulnerability classes
LLM + static hybridGPTScan, Nethermind AuditAgentGPTScan >90% precision on token contracts; finds novel logic bugsRelies on LLM quality, API costLogic vulnerability detection
Multi-agent systemsSherlock AI, OctaneSpecialist routing, competition-winningComplexity, orchestration overheadBroad audits, novel codebases
RAG + formal verificationAgentLISA, PropertyGPTZero-day finding, mathematical proofSlow, requires formal specsCritical infrastructure, formal guarantees
Graph-driven reasoningHound, GNNSCVulDetectorCross-contract analysis, interpretableGraph construction overheadComplex multi-contract protocols
Hypothesis generator/criticGPTLens, Savant ChatDoubles detection vs naive LLMHigh token cost, parallel computeSystematic exploration, FP reduction

Understanding how tools work is one thing. Understanding how we know they work is another — and that brings us to the benchmarks.

Chapter 4: How We Measure — The Benchmark Landscape

GPT-4 detects 0.9% of violations on SC-Bench without an oracle. Savant Chat claims 87% accuracy on CTFBench. GPTScan reports >90% precision on token contracts. These numbers describe completely different things, measured on completely different datasets, using completely different metrics. There is no "MMLU for smart contract auditing" — everyone measures differently, and the results are not comparable .

Across the 84 tools and 60+ academic papers surveyed, we identified 16 distinct benchmarks used to evaluate smart contract audit tools. They differ in scale by three orders of magnitude, use at least four fundamentally different metric families, and are often built by entities that also produce competing tools.

Benchmark Types

Synthetic benchmarks inject known bugs into contracts to create controlled environments. SolidiFI (UBC, ISSTA 2020) systematically injected 9,369 bugs across 7 vulnerability types into 50 contracts. SmartBugs (ASE 2020) curates 143 contracts with 208 tagged vulnerabilities annotated with SWC tags directly in code comments. These provide clean, reproducible signals — but the bugs are artificial and may be easier to detect than real-world flaws.

CTF-based benchmarks use single-vulnerability contracts for focused testing. CTFBench (AuditDB) gives each contract exactly 1 injected vulnerability — providing clean signal but unrealistic simplicity. Ethernaut and Damn Vulnerable DeFi serve as educational CTF platforms that double as evaluation targets.

Real-audit benchmarks draw from actual vulnerability disclosures and represent the most demanding evaluation tier. EVMBench (OpenAI + Paradigm, February 2026) curates 120 vulnerabilities from 40 real audits, with three evaluation modes: Detect (120 vulns), Patch (45 vulns), and Exploit (24 vulns in Docker-sandboxed blockchain forks). ScaBench (Bernhard Mueller / Nethermind, 2025) covers 31 projects with 555 vulnerabilities curated from Code4rena, Cantina, and Sherlock. LISABench (CertiK + NTU) scales to 10,185 code-complete cases from 584 protocols — 25x larger than SCONE-bench — sourced from 10 audit platforms and validated by 3,086 security specialists.

Live competition benchmarks measure tools against humans in real time. Code4rena Bot Races have run 60+ races since 2024. Separately, the $500K Monad Audit Competition on Code4rena (a full audit competition, not a Bot Race) produced the first AI-wins-over-humans result when Octane placed #1 among 1,600+ researchers. Viggiano's benchmark tested 10 AI tools against a human audit of a single ERC4626 project (743 nSLOC, 12 known issues): humans earned $5.4K in simulated earnings; the best AI earned $1.7K. No single AI found more than 1/3 of high-severity issues, but different AI tools found the 4 issues that humans missed.

Metric Families

The field uses at least four distinct measurement approaches, and conflating them is one of the most common errors in evaluating these tools.

Recall/Precision/F1 dominates academic work. Of the 16 benchmarks, 14 report recall, 11 report precision, and 8 compute F1 scores. But these numbers are rarely comparable because ground-truth definitions vary. Lyubenov's independent evaluation found Nethermind AuditAgent at 40% recall / 4.1% precision and Savant Chat at 35% recall / 17.9% precision — meaning AuditAgent produces ~24 false positives per true finding (Lyubenov eval).

VDR + OI (Vulnerability Detection Rate + Overreporting Index) is used by CTFBench. VDR measures matched vulnerabilities per total contracts; OI quantifies false positives per line of code. CTFBench is the only benchmark explicitly measuring overreporting (CTFBench paper).

Exploit rate + dollar value represents the most adversarial measurement approach. SCONE-bench (Anthropic, December 2025) tested 10 frontier models on 405 contracts derived from DeFiHackLabs incidents. The results: agents produced working exploits for 207 of 405 contracts, simulating $550M in stolen funds. On post-knowledge-cutoff contracts, agents found $4.6M in exploits across 19 contracts (55.8%). Average cost per agent run: $1.22 (SCONE-bench).

Severity-stratified scoring is used by real-audit benchmarks like ScaBench and Wake Arena's internal evaluation, which reports detection broken down as 2H, 6M, 1L, 1W with only 2 false positives in a pure AI audit (Wake Arena blog).

Scale Variance

The sheer range of benchmark scale makes apples-to-apples comparison nearly impossible. The spectrum spans from 24 vulnerabilities (EVMBench Exploit mode) to 15,975 violations (SC-Bench ERC standard compliance). Between these extremes: SmartBugs curated (208 vulns), ACToolBench (180 access control vulns where all 6 evaluated tools achieved only 3-8% recall), ScaBench (555 vulns), SolEval (1,507 samples), SC-Bench (5,377 contracts), SolidiFI (9,369 injected bugs), and LISABench (10,185 cases) .

Who Builds Them — and Why It Matters Every major benchmark was created by a team that also builds one of the tools being evaluated — something worth keeping in mind when interpreting published results.

OpenAI + Paradigm built EVMBench — and OpenAI's GPT-5.3-Codex holds the 72.2% exploit score record on it. Anthropic built SCONE-bench — and Claude Opus 4.5 achieved the highest individual post-cutoff exploit rate at 65%, while the collective rate across all 10 models was 51.11%. AuditDB created CTFBench and their own Savant Chat scores VDR 0.952 on it, while independent evaluations show only 35% recall. AgentLISA created LISABench and claims top performance on their own benchmark. ScaBench was co-developed by Nethermind's team, whose AuditAgent's scoring algorithm serves as the standard scorer .

Among all sources reviewed for this research, only two evaluations are fully independent of tool vendors: Lyubenov's evaluation (comparing AuditAgent, Savant Chat, and AlmanaxAI on the same test set) and Viggiano's benchmark (comparing 10 AI tools against a human firm's audit). Both consistently show lower numbers than self-reported results .

Complete Benchmark Comparison

BenchmarkScalePrimary MetricsCreatorYearPublic?
EVMBench120 vulns / 40 auditsRecall, Patch rate, Exploit rateOpenAI + Paradigm2026Yes
SCONE-bench405 contractsExploit %, $ value stolenAnthropic Red Team2025Partial
ScaBench555 vulns / 31 projectsRecall, Precision by severityNethermind / Bernhard Mueller2025Yes
CTFBench1 vuln/contract seriesVDR, OI (overreporting)AuditDB (Igor Gulamov)2024-25Yes
LISABench10,185 cases / 584 protocolsRecall, Precision, SeverityCertiK + NTU (Prof. Yang Liu)2025Yes
SmartBugs208 vulns / 143 contractsRecall/Precision by SWCAcademic (smartbugs.github.io)2020Yes
SolidiFI9,369 injected bugs / 50 contractsPrecision/Recall per vuln typeUBC (DependableSystemsLab)2020Yes
SC-Bench15,975 violations / 5,377 contractsERC violation recallPurdue CS (system-pclub)2024-25Yes
SolEval1,507 samples / 28 reposPass@k, Gas@k, Vul@kpzy20002025Yes
SolidityBench25 tasks + OZ specspass@1, pass@3BrainDAO / IQ2024-25Yes
ACToolBench180 AC vulnsRecall on 5 AC subtypesASE 2025 (Daoyuan Wu group)2025Partial
VeriSmart Benchmarks487 CVE contracts + suitesRecall on CVE vulnsKorea University (KUPL)2019-20Yes
SC Benchmark Suites46,186 contractsRecall/Precision at scalerenardbebe (academic)2021Yes
Viggiano's Benchmark12 issues / 1 projectScore out of 12, $ equivalentAntonio Viggiano (Size Protocol)2025Partial
Code4rena Bot Races60+ races, hundreds of findingsCompetitive ranking vs humansCode4rena2024+Yes
Lyubenov's EvaluationMultiple tools, same test setRecall, PrecisionLyuboslav Lyubenov (independent)2025Yes

Now that we understand what the rulers look like, let's see what they actually measure.

Chapter 5: The Numbers — What Tools Actually Score

GPT-5.3-Codex exploits 72.2% of critical smart contract bugs on EVMBench. On SCONE-bench, 10 frontier models collectively exploited 51.11% of 405 real-world vulnerable contracts, with Claude Opus 4.5 achieving the highest individual post-cutoff exploit rate at 65%. Collectively, these models drained $4.6M in simulated funds from post-cutoff contracts. One year earlier, the best AI agents managed 2%. These are the headline numbers. Below them lies a far more complicated dataset, one that demands careful reading rather than quick conclusions.

EVMBench (February 2026)

Created by OpenAI and Paradigm (with frontend support from OtterSec), EVMBench evaluates models across three modes: detect (120 vulnerabilities), patch (45 vulnerabilities), and exploit (24 vulnerabilities from 16 repos). The benchmark uses Solidity repos with Foundry test harnesses in Docker-based sandboxed blockchain environments (EVMBench whitepaper).

ModelModeScore
GPT-5.3-CodexExploit72.2%
Claude Opus 4.6Detect45.6% (highest detection rate)
Gemini 3 ProDetectTested, score not published
GPT-5Detect/Patch/ExploitTested, exact scores not published
GPT-4oDetectTested, baseline comparison

A critical distinction emerges from these results: GPT-5.3-Codex dominates on exploitation – writing working attack code – while Claude Opus 4.6 leads on detection, identifying that vulnerabilities exist. These are fundamentally different capabilities. OpenAI announced $10M in API credits for open-source security as part of the same broader security initiative (OpenAI announcement).

SCONE-bench (December 2025)

Anthropic's Red Team, working with MATS Fellows, built SCONE-bench from 405 smart contracts with real-world vulnerabilities exploited between 2020 and 2025 across Ethereum, BSC, and Base. Agents operate in Docker containers with locally forked blockchains, under a 60-minute time limit per contract, at an average cost of $1.22 per agent run (SCONE-bench).

ModelExploit RateNotes
10 frontier models combined51.11% (207/405)$550M simulated stolen
Claude Opus 4.565% (13/20 post-cutoff)Highest individual post-cutoff rate
GPT-5TestedExact score not published
Claude Sonnet 4.5TestedLower than Opus

The most striking data point: in one year, AI agents went from 2% to a collective 51.11% exploit rate (207/405), and from $5K to $4.6M in simulated stolen funds on post-cutoff contracts. On the post-cutoff subset specifically, agents exploited 19 contracts (55.8%). Additionally, 2 novel zero-day vulnerabilities were discovered across 2,849 previously unknown contracts (SCONE-bench).

ScaBench (2025)

Co-developed by Bernhard Mueller and Nethermind (lead researcher: Cristiano Silva, PhD), ScaBench contains 31 projects with 555 vulnerabilities curated from Code4rena, Cantina, and Sherlock audits (ScaBench).

ToolRecall
Hound (Bernhard Mueller)31.2%
Nethermind AuditAgent30% (from separate 29-audit evaluation)
13 tools on leaderboardAll showing "TBD" as of research date

A notable observation: Hound, a solo unfunded project, achieves 31.2% recall on ScaBench. Nethermind AuditAgent reports 30% on a separate 29-audit production evaluation – a different measurement context, but comparable magnitude. Nethermind's significant Ethereum Foundation grants for client development represent company-level funding rather than tool-specific investment, yet the scale difference between it and a one-person project yielding near-identical recall remains striking (ScaBench; Nethermind blog).

CTFBench (2024-2025)

Created by AuditDB, CTFBench uses small contracts with exactly one injected vulnerability each, measuring VDR (Vulnerability Detection Rate) and OI (Overreporting Index) (CTFBench).

ToolVDRNotes
Savant Chat0.952Note: benchmark created by same team

Code4rena Bot Races

ToolResult
LightChaserTop performer across 60+ races (2024)
Octane#1 on Monad $500K competition (Sept-Oct 2025), 1,600+ researchers competing, 3/4 high-severity findings caught

Independent Evaluations

Lyubenov's evaluation (Lyuboslav Lyubenov, independent researcher and Solodit MCP creator) represents the only fully independent third-party evaluator found in this research (Lyubenov's evaluation):

ToolRecallPrecision
Nethermind AuditAgent40%4.1% (~24 false positives per true finding)
Savant Chat35%17.9%
AlmanaxAI5%5.9%

Viggiano's benchmark (Antonio Viggiano, Size Protocol, August 2025) tested a single project – Size Meta Vault, an ERC4626 contract with 743 nSLOC and 12 known High/Medium issues. Humans found 8/12 issues ($5.4K simulated earnings). The best AI tool found issues worth $1.7K. No single AI found more than 1/3 of high-severity issues. Yet different AI tools collectively found the 4 issues that humans missed, pointing toward complementary rather than replacement value (Viggiano's benchmark).

Wake Arena real audits (October-December 2025) were deployed on Lido, Printr, and Everstake audits. AI contributed 33% of all findings. Across all three protocols: 5 of 10 critical vulnerabilities found (50%) and 5 unique findings beyond what human auditors discovered. A purely AI audit produced 10 findings (2H, 6M, 1L, 1W), with only 2 false positives (Wake Arena blog).

Academic Papers: The 90%+ Club

PaperKey ResultVenue
iAuditF1=91.21%, accuracy=91.11% on 263 real vulnerabilitiesICSE 2025
LLM-SmartAudit98% accuracy on common vulnerabilities, 12/13 CVEsIEEE TSE 2025
SmartAuditFlow100% accuracy on common/critical, 41.2% on real-world, all 13 CVEsACM TOSEM 2025
PropertyGPT26/37 CVEs, 12 zero-days ($8,256 bounties), 80% recallNDSS 2025
SCVHunter (HGAT)Reentrancy 93.72%, nested call 85.41%, transaction state dependency 87.37%, block info 91.07%ICSE 2024

Master Comparison Table

Tool / ModelEVMBench DetectEVMBench ExploitSCONE-benchScaBench RecallCTFBench VDRLyubenov RecallLyubenov PrecisionReal Audit Coverage
GPT-5.3-Codex72.2%
Claude Opus 4.645.6%
Claude Opus 4.565% (post-cutoff); 51.11% collective
Gemini 3 ProTestedTested
GPT-5TestedTestedTested
GPT-5 (plain, no tools)24/94 high-sev (25.5%)
Nethermind AuditAgent–*40%4.1%30% avg on 29 audits
Hound31.2%
Savant Chat0.95235%17.9%Top 6 Sherlock contest
Wake Arena33% (Lido, Printr, Everstake)
Zellic V1241/94 high-sev (43.6%)
AlmanaxAI5%5.9%
LightChaser#1 C4 Bot Races (60+)
Octane#1 Monad $500K

*Nethermind AuditAgent's 30% recall is from its own 29-audit production evaluation, not from ScaBench directly. Hound's 31.2% is the ScaBench benchmark result.

These numbers paint a compelling but incomplete picture. The benchmarks themselves have structural nuances worth understanding — and the pattern is systematic.

Chapter 6: Benchmarks and Independence — Who Benchmarks Whom

A note before we begin: In every case below, the same team that built a benchmark also built a tool that scores well on it. This is natural in an emerging field — the people with enough domain expertise to design meaningful benchmarks are often the same people building the tools. We are not implying any manipulation or bad faith. Many of these benchmarks are genuinely valuable contributions. The point is simply that when a third party independently tests the same tools, the numbers tend to be lower — and that's useful context for interpreting results.

In five out of five major smart contract audit benchmarks released between 2024 and 2026, the creator's own tool or model ranks first. This is a consistent structural pattern — natural and understandable, since the people best positioned to build benchmarks are often those who also build the tools. But it does mean that independent evaluations carry extra weight when interpreting results.

The Pattern: Creator Tops Creator's Benchmark

Case 1: CTFBench and Savant Chat (Full overlap)

AuditDB, led by Igor Gulamov, created CTFBench. AuditDB's own product, Savant Chat, tops it with a VDR of 0.952 and claimed 87% accuracy. When Lyuboslav Lyubenov, an independent researcher, evaluated Savant Chat on his own test set, performance dropped to 35% recall and 17.9% precision. The gap between 87% claimed accuracy and 35% independently measured recall is significant. While the metrics are not directly comparable (accuracy vs. recall), it highlights how much evaluation context matters (CTFBench; Lyubenov eval).

Case 2: LISABench and AgentLISA (Full overlap)

AgentLISA, backed by CertiK and NTU's Prof. Yang Liu (600+ publications, h-index 60+), created LISABench – 10,185 code-complete cases from 584 protocols, 25x larger than SCONE-bench. AgentLISA claims $7.3M+ in real exploits detected since June 2025. No independent verification of these claims was found in any source reviewed for this research. The benchmark also sits adjacent to a $12M LISA token launch, adding a financial dimension worth considering when evaluating claims (AgentLISA).

Case 3: ScaBench and Nethermind AuditAgent (Partial overlap)

Bernhard Mueller and Nethermind co-developed ScaBench. Nethermind's AuditAgent scoring algorithm serves as ScaBench's standard scorer – meaning the tool's creator defines what counts as a correct answer. Mueller's separate solo project, Hound, scored 31.2% recall, offering partial independent data. But the fact that the tool's creator also defines the scoring rules is worth noting (ScaBench methodology).

Case 4: EVMBench and GPT-5.3-Codex (Partial overlap)

OpenAI and Paradigm released EVMBench (with frontend support from OtterSec). OpenAI's GPT-5.3-Codex achieved the top exploit score of 72.2%. This is a partially mitigated overlap: other models were tested on the same benchmark, including Claude Opus 4.6 at 45.6% detection and Gemini 3 Pro (score not published). The benchmark methodology, dataset, and code are public. The primary concern is selection bias in task design – exploitation-focused tasks may favor models optimized for agentic code generation, which is precisely what Codex was built for (EVMBench whitepaper; Paradigm blog).

Case 5: SCONE-bench and Claude Opus 4.5 (Partial overlap, most mitigated)

Anthropic's Red Team created SCONE-bench. Claude Opus 4.5 achieved the highest individual post-cutoff exploit rate at 65% (13/20), while the collective rate across all 10 models was 51.11% (207/405). This case has the strongest mitigating factors: the methodology is fully transparent, the benchmark is open-source, 10 frontier models were tested including GPT-5, and the paper explicitly frames the work as a safety evaluation rather than a marketing exercise. The benchmark was derived from the existing DefiHackLabs repository (680+ incidents), reducing the potential for cherry-picking (SCONE-bench).

Degrees of Overlap

Not all cases are equal. A useful framework for distinguishing them:

DegreeDefinitionExamples
FullCreator's tool tops creator's benchmark, no independent testing, no third-party scores publishedCTFBench/Savant Chat, LISABench/AgentLISA
PartialCreator's model leads, but other models tested on same benchmark with published scoresEVMBench/GPT-5.3-Codex, ScaBench/AuditAgent
MitigatedCreator's model leads, methodology transparent, open-source data, multiple third-party models testedSCONE-bench/Claude Opus 4.5

The Correctors: Independent Evaluators

Two individuals stand out as independent correctors in a field short on them.

Lyuboslav Lyubenov tested Nethermind AuditAgent, Savant Chat, and AlmanaxAI against the same contract set. His results provide a useful independent baseline. Savant Chat drops from 87% accuracy (self-reported) to 35% recall / 17.9% precision — different metrics, but a notable difference. AuditAgent shows 40% recall but only 4.1% precision – roughly 24 false positives per genuine finding. AlmanaxAI, which raised $1M in pre-seed funding, scored 5% recall / 5.9% precision — suggesting the tool was still early in development at the time of testing (Lyubenov's evaluation).

Antonio Viggiano (Size Protocol) ran a single-project benchmark in August 2025: 743 nSLOC, 12 known High/Medium issues. His central finding: no single AI tool found more than 1/3 of high-severity issues. Humans found 8/12 issues. The best AI tool recovered $1.7K in simulated earnings against humans' $5.4K. Some AI submissions tended to overrate severity levels. But Viggiano also noted that different AI tools collectively found the 4 issues that humans missed entirely – evidence for complementary, not replacement, value (Viggiano's benchmark).

The "90% Academic" Problem Academic papers routinely report metrics above 90%. iAudit achieves F1=91.21% (ICSE 2025). LLM-SmartAudit claims 98% accuracy (IEEE TSE 2025). SmartAuditFlow reports 100% accuracy on common vulnerability categories (ACM TOSEM 2025). PropertyGPT reaches 80% recall (NDSS 2025).

Meanwhile, real-world recall sits at 30-40% (Nethermind blog; Lyubenov eval). SmartAuditFlow itself acknowledges the gap: its 100% accuracy on "common/critical" vulnerabilities drops to 41.2% on "comprehensive real-world projects." The difference is the dataset. Academic benchmarks use curated vulnerability sets with known, often well-documented vulnerability types. Real-world codebases contain novel logic flaws, protocol-specific attack surfaces, and business logic errors that industry practitioners estimate represent approximately 80% of actual bugs – the portion that current tools cannot auto-detect (SmartAuditFlow).

Summary: Who Built What

BenchmarkCreatorTop ScorerIndependenceIndependent Verification Status
CTFBenchAuditDB (Igor Gulamov)Savant Chat (AuditDB) – VDR 0.952FullLyubenov: 35% recall, 17.9% precision
LISABenchAgentLISA/NTU (Prof. Yang Liu)AgentLISA – claims $7.3MFullNo independent verification found
ScaBenchNethermind/Bernhard MuellerAuditAgent – scoring algo is theirsPartialHound 31.2% provides some data
EVMBenchOpenAI + Paradigm (OtterSec frontend)GPT-5.3-Codex – 72.2% exploitPartialClaude 45.6% detect; other models tested
SCONE-benchAnthropic Red TeamClaude Opus 4.5 – 65% post-cutoff; 51.11% collectiveMitigatedGPT-5 tested, 10 models total, open methodology

Side-by-Side: Claimed vs. Independently Verified

ToolClaimed PerformanceIndependent PerformanceGap
Savant Chat87% accuracy (CTFBench)35% recall / 17.9% precision (Lyubenov)Different metrics; independent recall far below claimed accuracy
Nethermind AuditAgent62% of projects have valid issues40% recall / 4.1% precision (Lyubenov)62% measures project-level detection, not precision; actual precision is 4.1%
AlmanaxAI"Various security detections" (website)5% recall / 5.9% precision (Lyubenov)Early-stage, limited results at time of testing
AgentLISA$7.3M in exploits detectedNo independent verificationUnknown
Academic tools (avg)90-100% F1/accuracy on curated sets30-40% recall in real auditsMetrics not directly comparable (F1 vs recall); gap is real but multiplier is approximate

If the benchmarks can't be fully trusted, what does real-world deployment actually tell us? Let's look at the production data.

Chapter 7: Reality Check — What Actually Works in Production

In Lyubenov's independent evaluation — the only fully third-party benchmark conducted outside any vendor's control — the best-performing AI audit tool achieved 40% recall and 4.1% precision (Lyubenov eval). That precision figure means roughly 24 false positives for every genuine vulnerability found. The second-best tool managed 35% recall at 17.9% precision. The lowest-scoring tool managed 5% recall and 5.9% precision — suggesting it was still in an early development stage at the time of evaluation. These are the independent numbers, measured outside any vendor's own testing environment.

The Performance Ceiling

Across every independent measurement available, the realistic performance ceiling for AI smart contract audit tools sits at 30–40% recall and 4.1–17.9% precision in best-case scenarios. Nethermind's own data from 29 completed real-world audits confirms this range: 30% average recall, with some individual projects reaching 50%, and valid issues detected in 62% of projects (Nethermind blog).

These numbers are consistent. Viggiano's benchmark on Size Protocol's Meta Vault (743 nSLOC, 12 known high/medium issues) found that no single AI tool discovered more than one-third of high-severity issues. The best AI submission was worth $1.7K in simulated earnings, while human auditors found 8 of 12 issues for $5.4K (Viggiano's benchmark).

The ~80% Detection Gap A widely cited estimate among industry practitioners — including organizations like Cyfrin (behind the Aderyn static analyzer and the Solodit vulnerability database with 50K+ findings) — is that ~80% of actual bugs are business logic issues not auto-detectable by machines, including AI. We were unable to find a rigorous study behind this specific number — it reflects a practitioner consensus rather than a measured statistic. But the directional claim is supported by benchmark data: the best tools plateau at 30-40% recall in independent tests, consistent with a large category of bugs they cannot reach. The vulnerabilities that matter most in production — protocol-specific economic assumptions, governance manipulation vectors, incentive misalignments — require understanding the intent of the code, not just its syntax.

What AI Finds vs. What It Misses

The pattern is clear and consistent across tools: AI excels at detecting known vulnerability classes with well-documented signatures, and struggles with anything requiring protocol-specific reasoning.

CategoryAI Detects ReliablyAI Misses Consistently
Classic patternsReentrancy, integer overflow/underflow, unchecked external callsProtocol-specific logic flaws
Access controlMissing access modifiers, unprotected functionsSubtle permission escalation chains
Known templatesTimestamp dependence, tx.origin misuse, delegatecall risksEconomic attack vectors (flash loan exploits, oracle manipulation)
Code-levelDead code, gas inefficiencies, standard violationsCross-protocol interaction bugs, governance manipulation

This taxonomy holds across tool architectures. Mythril reliably catches reentrancy and integer overflows . Sherlock AI routes vulnerabilities to domain-specific analyzers for reentrancy, access control, and price manipulation. Yet none of these tools consistently identify the business logic flaws that the ~80% industry estimate points to.

Production Deployments That Matter

Strip away the benchmarks and self-reported metrics. Here is what has actually been demonstrated in production or competition settings with verifiable results:

ToolContextResultsEvidence
Wake Arena3 real audits: Lido, Printr, Everstake (Oct-Dec 2025)33% of all findings; across all 3 protocols: 5/10 critical vulns found + 5 unique findings beyond human auditors; pure AI audit (LUKSO): 10 findings (2H, 6M, 1L, 1W), only 2 false positivesWake Arena blog
Nethermind AuditAgent29 completed production audits30% recall average; retroactively detected the exact ResupplyFi exchange rate flaw that caused the $9.8M hackNethermind blog
Octane Security$500K Monad Audit Competition on Code4rena (Sept-Oct 2025)#1 among 1,600+ researchers; caught 3/4 high-severity findings in novel Rust/C++ codebase; also a top performer on Immunefi leaderboardCode4rena Monad
Sherlock AIInternal evaluationCombines static analysis, heuristics from 0x52, and ML on verified contest findingsSherlock AI
Savant ChatSherlock Symbiotic/DeFi Audit Contest (Sept 2025)Top 6 competing against dozens of human auditorsSavant announcement

The Nethermind result deserves special attention: when tested retroactively against the ResupplyFi contract after the $9.8M hack (June 2025), AuditAgent flagged the exact exchange rate logic flaw — demonstrating the vulnerability was within the tool's detection capability and could have been prevented had the tool been applied pre-deployment (Nethermind blog).

The "Pair Auditor" Consensus The most credible organizations in smart contract security consistently position AI as an augmentation tool, not a replacement. Nethermind explicitly designs AuditAgent as a "pair auditor" that augments manual review (Nethermind). Trail of Bits licenses its Claude Code skills under Creative Commons (CC BY-SA 4.0), positioning them as augmentation tools rather than standalone solutions. OpenZeppelin built an MCP server integrating its contract standards into AI workflows, positioning it as a development-time guardrail rather than a standalone audit solution.

No serious player claims full automation. The consensus is clear: AI handles the first pass, humans handle the judgment.

Complementary Coverage

Perhaps the most important finding from Viggiano's benchmark is not what any single tool found, but what the union of tools found: "Different AI tools found the 4 issues that humans missed" (Viggiano's benchmark). The tools that caught those issues were not the same tools that performed best overall. Different architectures — knowledge graphs, multi-agent systems, hypothesis-critic patterns — have different blind spots.

This points to the real value proposition: running multiple AI tools in parallel during the initial scan phase, widening the coverage surface, and letting human auditors focus their limited attention on the business logic and economic reasoning that AI cannot reach.

Production results also reveal a surprising secondary finding: the tools with the best results are not always the best funded. In fact, the correlation between money and performance is nearly zero.

Chapter 8: The Funding Paradox

Hound, a solo open-source project with $0 in funding, achieves 31.2% recall on ScaBench. AlmanaxAI, backed by $1M in pre-seed capital, scored 5% recall and 5.9% precision in Lyubenov's evaluation — the lowest among independently tested tools at the time. LightChaser, another unfunded tool from an anonymous developer, was the top-placing bot in Code4rena Bot Races across 60+ races in 2024 (Hound; AlmanaxAI; LightChaser). The relationship between money and results in AI smart contract auditing is, at best, nonexistent.

The Funding Landscape

The total identifiable funding in dedicated AI audit tooling exceeds $38M when including Cantina's $7.83M Series A (spread unevenly across a handful of players, and $12M of the total is a token launch, not traditional VC):

FundingToolBest Verified ResultVerdict
$12M (token launch)AgentLISAClaims $7.3M in exploits detected (unverified); created own benchmark (LISABench)Unverified
$6.75M (seed)Octane SecurityWon $500K Monad competition (#1 among 1,600+); #1 ImmunefiDelivers
$5.5M (seed)Sherlock AI21 valid findings in eval (vs ChatGPT 5.2's 3)Delivers
$4.3M (seed)OlympixClaims 30%+ of Solidity developers as users (self-reported); no public benchmark dataUnverified
~$1.2M (reported)QuillShieldCTFBench participant, no standout results reportedLimited public data
$1M (pre-seed)AlmanaxAI5% recall, 5.9% precision in independent testEarly stage
$0Hound (Mueller)31.2% recall on ScaBenchDelivers
$0LightChaser*Top-placing bot in Code4rena Bot Races 2024 (60+ races)Delivers

*LightChaser is likely a traditional pattern-matching system (1,000+ detection patterns), not an AI/LLM tool. Included for comparison as it competes in the same arena.

The pattern is stark. Octane and Sherlock demonstrate that funding can produce results when combined with the right team and domain expertise. But AlmanaxAI and QuillShield have yet to demonstrate publicly verifiable results matching their funding levels. Meanwhile, Hound and LightChaser prove that zero funding is no barrier to competitive performance.

What Matters More Than Money

Three factors consistently predict tool quality better than capital raised:

  1. Domain expertise of the creator. Bernhard Mueller created Mythril (4,207 stars, the standard symbolic execution tool for Solidity), then created Hound with a unique knowledge-graph architecture, then co-developed ScaBench (the field's standard benchmark scorer), and advises Sherlock AI. His unfunded solo project outperforms tools backed by millions because he has 15+ years of security experience and built the foundational tools the entire ecosystem depends on (Mythril; Hound; ScaBench). The researcher known as 0x52, one of Sherlock's top security researchers, had their auditing techniques directly encoded as heuristics into Sherlock AI — a direct transfer of human expertise into machine capability (Sherlock AI).

  2. Architecture choices. Specialization outperforms generalization. Sherlock AI combines static analysis, auditor-informed heuristics, and ML models routing vulnerabilities to domain-specific analyzers (Sherlock AI). Wake Arena deploys 108 specialized detectors combined with graph-driven reasoning. Hound builds knowledge graphs rather than relying on prompts. Tools that treat smart contract auditing as a generic LLM task — feed code in, get vulnerabilities out — consistently underperform those with purpose-built architectures.

  3. Training data quality. Octane trains on proprietary data from Code4rena, Sherlock, and Immunefi competitions — real findings from real audits validated by human experts (Octane). Sherlock AI trains on verified findings from its own contest platform plus exploited codebases (Sherlock AI). AgentLISA claims training on data from 10 audit platforms with knowledge distilled from 3,086 specialists. The tools with real, human-validated audit data consistently outperform those training on synthetic or scraped datasets.

The Key People Network

The field is remarkably small at the top. A handful of individuals connect the most important tools, benchmarks, and organizations:

PersonAffiliations & ToolsImpact
Bernhard MuellerCreated Mythril, created Hound (solo), co-developed ScaBench (Nethermind), advises Sherlock AIMost trusted individual in the space. Only person with open-source, independent, and advisory presence across multiple tools
Prof. Yang Liu (NTU)AgentLISA, LISABench, PropertyGPT (NDSS 2025 Distinguished Paper)Strongest academic pedigree: 600+ publications, h-index 60+
0x52Top Sherlock researcher, knowledge embedded in Sherlock AIExpert knowledge transfer: human auditor instincts encoded into AI heuristics
Giovanni VignoneCo-founder & CEO, Octane SecurityLed the first AI tool to win a major audit competition
Antonio ViggianoCreated independent AI benchmark, Size ProtocolKey independent evaluator providing unbiased performance data
Lyuboslav LyubenovCreated Solodit MCP, independent tool comparisonOnly fully independent third-party evaluator of multiple commercial tools

Mueller alone touches four of the most credible entities in the space. Yang Liu's academic lab produced both a top tool (AgentLISA/PropertyGPT) and a major benchmark (LISABench) — though that dual role is the pattern documented in Chapter 6 . The people who built the foundational infrastructure — Mythril, Slither, the contest platforms — are the same people now building or advising the AI tools attempting to automate what they once did manually.

If expertise matters more than capital, and the field is evolving this fast, what does the trajectory look like? The pace of change is remarkable.

Chapter 9: The Frontier — Exponential Growth (Late 2025-2026)

Approximately one year before SCONE-bench's December 2025 publication, earlier-generation AI agents exploited 2% of its contracts, stealing $5K in simulated funds. By December 2025, 10 frontier models (most of which did not exist a year prior) collectively produced working exploits for 207 of 405 contracts (51.11%), generating $550M in simulated stolen funds. On the post-cutoff subset (contracts exploited after models' knowledge cutoffs), agents exploited 19 contracts (55.8%) and drained $4.6M (SCONE-bench).

An important caveat: part of this jump reflects the arrival of entirely new, more capable models – not just improvement of existing ones. The 2% baseline was set by weaker models that have since been superseded. Still, the improvement is dramatic – and the February 2026 EVMBench results suggest it has not plateaued. On a separate benchmark, GPT-5.3-Codex now exploits 72.2% of critical Code4rena bugs, up from <20% when that project started (EVMBench).

What Changed: Three Technical Drivers

The jump from single-digit to double-digit exploit rates was not driven by a single breakthrough. Three parallel developments converged in late 2025.

Reasoning models. DeepSeek R1 and similar chain-of-thought reasoning models introduced step-by-step reasoning over code, enabling models to trace multi-step state transitions across contract calls – exactly the kind of analysis that reentrancy and price manipulation bugs demand. On SolEval, a benchmark measuring Solidity code generation (not vulnerability detection), DeepSeek-V3 achieved 26.29% Pass@10 – the highest among all models – indicating strong structural understanding of smart contract syntax and semantics (SolEval). Note: DeepSeek-V3 is a different model from DeepSeek R1; both contribute to the broader trend of improved smart contract comprehension.

Agent scaffolding. The shift from "prompt a model" to "deploy an agent" – with function calling, tool use, and blockchain interaction – transformed exploit generation from a text completion task into an autonomous workflow. SCONE-bench specifically measures agents that fork mainnet state, deploy contracts, and execute transactions. The 10 frontier models tested produced working exploits for 207 of 405 contracts ($550M simulated stolen funds) and discovered 2 novel zero-day vulnerabilities across 2,849 previously unexamined contracts (SCONE-bench).

Context windows. Modern frontier models support large context windows (e.g., Claude's 200K tokens), enabling ingestion of entire protocols – multiple interacting contracts, libraries, and deployment configurations – in a single pass. Cross-contract reasoning, previously the domain of graph-based tools like Wake Arena (108 detectors), is now accessible to general-purpose LLMs.

Frontier Model Comparison

The EVMBench and SCONE-bench leaderboards reveal distinct model strengths:

ModelEVMBench ExploitEVMBench DetectSCONE-bench ExploitContext WindowRelease
GPT-5.3-Codex72.2%Feb 2026
Claude Opus 4.645.6%200K2026
Claude Opus 4.565% (post-cutoff); 51.11% collective200K2025
GPT-5TestedTestedTested2025
Gemini 3 ProTestedTestedTested2025-2026
DeepSeek R1Tested2025

The specialization is notable (EVMBench; SCONE-bench). GPT-5.3-Codex leads on exploitation – writing and executing attack code – while Claude Opus 4.6 leads on detection – identifying vulnerabilities without necessarily exploiting them. These are different capabilities, and the gap between them matters for the offense-defense balance.

The Offensive-Defensive Asymmetry

On EVMBench, the highest exploit score (72.2%, measured on 24 vulnerabilities) exceeds the highest detection score (45.6%, measured on 120 vulnerabilities) — though these are different test sets of different sizes, so the comparison is directional rather than exact. The underlying logic still holds: an attacker needs to find one exploitable bug to drain millions; a defender needs near-100% coverage to prevent it.

This asymmetry is compounded by cost. A SCONE-bench agent run costs an average of $1.22 (SCONE-bench). Scanning every deployed contract on Ethereum for exploitable bugs is now economically trivial for any well-resourced attacker. OpenAI's announcement of $10M in API credits for open-source security as part of its broader security push signals that the industry recognizes this threat (OpenAI EVMBench announcement).

Projections (With Caveats)

The growth from 2% to 51.11% in one year is dramatic, but it's worth remembering how technology adoption curves typically work: early progress often looks exponential, then hits a plateau as the low-hanging fruit is exhausted and harder problems remain. We've seen this pattern with self-driving cars, machine translation, and protein folding — rapid initial gains followed by a long tail of diminishing returns on increasingly difficult edge cases. AI smart contract auditing is likely on the same trajectory. What we can say with confidence is that the field is firmly in its growth and improvement phase — the capability is real and getting better. But extrapolating the current rate indefinitely would be naive. Three specific caveats constrain any projection:

Known patterns only. SCONE-bench and EVMBench test against historically exploited contracts with documented vulnerability classes. Performance on truly novel attack vectors is unmeasured. Detection ceiling. Industry practitioners estimate that ~80% of actual bugs are business logic issues not auto-detectable by machines, including AI. These include protocol-specific logic errors – mispriced oracles, flawed governance mechanisms, incentive misalignments – that require understanding intent, not just code. This ceiling is unlikely to break without fundamental advances in specification reasoning. Benchmark contamination. On contracts exploited after model training cutoffs, SCONE-bench agents exploited 19 post-cutoff contracts (55.8%) – close to the overall 51.11% collective rate, but measured on a much smaller sample (SCONE-bench). True zero-shot capability remains uncertain due to sample size. The trajectory is clear: we are in an active growth phase. The early exponential curve will eventually flatten — but right now, capability is improving faster than most of the industry has internalized. The question is no longer whether AI will transform smart contract security, but how practitioners should adapt while the field is still maturing.

Chapter 10: What It All Means — Conclusions and Practical Recommendations

We started this research to understand whether AI smart contract auditing is real or hype. After analyzing 84 tools, 16 benchmarks, and 60+ academic papers, here's what we found.

Five Key Findings

  1. From curiosity to competition win in 7 years. SmartEmbed (ICSME 2019) was the first code-embedding approach for smart contract bug detection. Octane Security won a $500K audit competition against 1,600+ researchers in October 2025 (SmartEmbed; Octane/Code4rena Monad). The gap between "interesting paper" and "production tool" has closed.

  2. A pair auditor, not a replacement. The best tools achieve 30-40% recall in independent evaluations – Nethermind AuditAgent scored 30% on its own 29-audit evaluation and 40% in Lyubenov's independent test (same tool, different context). Precision ranges from 4.1% (AuditAgent) to 17.9% (Savant Chat) (Lyubenov eval; Nethermind blog). This is useful. It is not sufficient.

  3. The ~80% detection gap. Industry practitioners estimate ~80% of actual bugs are business logic issues not auto-detectable by machines, including AI. These include protocol-specific logic errors, economic assumptions, and governance design flaws – vulnerabilities that require understanding the system's intended behavior, not just its code.

  4. Every major benchmark is self-evaluated. In all five major benchmarks examined, the creator's own tool or model ranks first — CTFBench/AuditDB, LISABench/AgentLISA, EVMBench/OpenAI, ScaBench/Nethermind, SCONE-bench/Anthropic. This is natural — the people who build benchmarks are often the domain experts. But it means independent evaluations carry extra weight. And those independent evaluations consistently show lower performance than self-reported numbers.

  5. Exploit capability is compounding fast. On SCONE-bench, AI agents went from 2% to 51.11% (collective) exploit rate in roughly one year. On the separate EVMBench benchmark, GPT-5.3-Codex exploits 72.2% of bugs. These are different benchmarks measuring different things, but both point in the same direction (SCONE-bench; EVMBench).

Current State Summary

DimensionCurrent StateGapOpportunity
Detection recall30-40%60-70% of bugs missedMulti-tool ensembles
Precision4.1-17.9%82-96% of findings are false positivesBetter training data, filtering
Auto-detectable bugs~20% detectable~80% not auto-detectableSpecification-aware reasoning
Exploit capability51-72% (different benchmarks)Growing fast; offense may outpace defenseDefensive AI investment
Benchmarks16 benchmarks, incomparableNo standard metricsIndependent benchmark bodies
Cost$0.01-$13 per AI auditvs $150K+ manualHybrid workflows

For Protocol Teams

Use AI as a first-pass scan, not a final verdict. Run multiple AI tools – Viggiano's benchmark showed that different AI tools collectively found the 4 issues that humans missed, because each tool covers a different subset of vulnerabilities (Viggiano's benchmark). Then invest human auditor time where AI cannot reach: business logic, cross-protocol interactions, and economic incentive design.

The cost equation favors hybrid approaches: an AI pre-scan (ranging from $0.01 per 1K lines to $13 per full scan, depending on the tool) followed by a focused human audit can reduce the scope — and cost — of manual review, while covering both AI-detectable patterns and human-only business logic (GPTScan; Veritas Protocol).

For Auditors

Adopt AI tools to widen coverage, not to reduce effort. Focus human time on business logic and novel attack patterns – the ~80% of actual bugs that current AI cannot auto-detect. Use multiple tools: AuditAgent, Savant Chat, and Hound find different bugs. Hound (31.2% recall on ScaBench, $0 funding) and Nethermind AuditAgent (30% on production audits) show recall in a similar range — though these are different measurement contexts and Hound's score is on its co-creator's benchmark (see Chapter 6). The broader pattern suggests that architecture and domain expertise can matter as much as funding (ScaBench; Nethermind blog).

For AI Tool Builders

Specialize. LightChaser competed in 60+ Code4rena Bot Races with 1,000+ detection patterns, consistently ranking at the top – no LLM required. Wake Arena found 43/94 high-severity issues with 108 specialized detectors (Wake Arena blog; LightChaser). The evidence says specialization beats generalization.

Invest in real audit training data, not synthetic. PropertyGPT trained on real CVEs found 12 zero-day vulnerabilities and won NDSS 2025 Distinguished Paper (PropertyGPT). And seek independent benchmark validation – results carry more weight when measured by a third party. In all five major benchmarks we examined, the creator's own tool ranked first (Chapter 6).

For Researchers

The highest-impact open problem is detecting the ~80% of bugs that are business logic issues – economic design, cross-protocol interaction flaws, and governance vulnerabilities – that machines, including AI, cannot currently auto-detect. Cross-contract reasoning and formal verification integration (as in PropertyGPT and SmartInv) show promise but remain academic. SmartInv alone identified 119 zero-days from 89,621 contracts (SmartInv). The field also needs independent benchmark bodies – individuals like Lyubenov and Viggiano are providing critical independent evaluation, but their work is ad hoc and unfunded.

Reference Table: Key Metrics with Sources

MetricValueSource
AI audit tools cataloged84This research
Total identified funding (tracked tools)$38M+ (incl. $12M token launch)Various sources
Best independent recall30-40%Nethermind 30%, Lyubenov eval 40%
Best independent precision4.1-17.9%Lyubenov eval
Exploit rate (EVMBench, best)72.2%EVMBench, GPT-5.3-Codex
Exploit rate (SCONE-bench, collective)51.11% (207/405); Opus 4.5 65% post-cutoffSCONE-bench
First AI competition win$500K MonadOctane/Code4rena, Oct 2025
Auto-detection gap~80% of actual bugs are business logic issues, not auto-detectableIndustry estimate
Academic papers60+Survey
AI audit cost$0.01-$13Various tools
Manual audit cost$150,000+Industry standard

The Bottom Line

AI will not replace security auditors. AI will replace auditors who don't use AI. The 30-40% recall ceiling for individual tools, the ~80% business-logic gap, and the lack of fully independent benchmarks all confirm that human expertise remains irreplaceable. But the $0.01-$13 vs $150,000+ cost gap, the complementary coverage across tools, and the exponential growth in exploit capability all confirm that AI is no longer optional. The practitioners who combine AI's exhaustive pattern matching with human judgment on intent, incentives, and architecture will define the next standard of smart contract security.

Methodology

This research was compiled from publicly available sources only. Our data collection covered:

Academic papers: 60+ papers from ICSE, NDSS, IEEE S&P, ACM TOSEM, IJCAI, and other top venues GitHub repositories: Tool repos, benchmark datasets, and open-source implementations Company blogs: Nethermind, Ackee (Wake Arena), Sherlock, Anthropic, OpenAI, Paradigm Competition platforms: Code4rena, Sherlock, Immunefi Independent evaluations: Lyubenov's tool comparison, Viggiano's benchmark Benchmark websites: SCONE-bench, EVMBench, ScaBench, CTFBench, LISABench Funding data: Crunchbase, press releases, token launch announcements All claims are linked to primary sources. Where the same team built both a benchmark and the top-scoring tool, we noted it. Where numbers couldn't be independently verified, we noted that too.

Data collection period: February 2026. The smart contract security space moves fast — some numbers may have changed by the time you read this.

Have corrections, additions, or feedback? We aim to keep this research accurate and up to date.

Have corrections, additions, or feedback? We aim to keep this research accurate and up to date.

Get in Touch