84 Tools, 60 Papers, One Question: Is AI Auditing Ready?
This is a long read. Feel free to navigate using your AI setup — ask it to summarize a specific chapter, find a particular tool, or jump to conclusions.
Why We Wrote This
Our team set out to answer a question that kept coming up in conversations with auditors, protocol teams, and investors: can AI actually audit smart contracts, or is it all hype?
We couldn't find a single source that covered the full picture — tools, benchmarks, funding, real results, independence of evaluations — all in one place. Plenty of vendor blogs claim breakthroughs. Plenty of academic papers report 90%+ accuracy. But when you dig into independent evaluations, the numbers tell a very different story.
So we decided to do it ourselves. We collected everything publicly available: academic papers, GitHub repositories, benchmark leaderboards, audit firm disclosures, independent evaluations, competition results, and funding data. Nine primary research files. 4,716 lines of structured facts. 84 distinct tools. 16 benchmarks. 60+ academic papers.
What follows is the result — a comprehensive longread covering the AI smart contract auditing landscape as of early 2026. We tried to be fair, cite everything, and note where the same team built both the benchmark and the top-scoring tool. The picture that emerged is more nuanced than either the hype or the skepticism suggests.
The short version: AI has gone from academic toy (2019) to winning a $500K audit competition against 1,600+ human researchers (October 2025) in seven years. But the best tools still catch only 30-40% of bugs in independent tests, an estimated ~80% of real vulnerabilities are business logic issues that no AI can auto-detect, and every major benchmark was created by an entity whose own tool ranks first on it. The field is real. The limitations are also real.
What's Inside
| # | Section | What you'll learn |
|---|---|---|
| 1 | The $150K Problem | Why AI auditing exists — the cost gap, the supply shortage, and the first signals that automation works |
| 2 | Timeline: 2019-2026 | Seven years from academic paper to production win, era by era |
| 3 | How They Work | Eight architecture patterns across 84 tools — from rule-based to multi-agent |
| 4 | The Benchmark Landscape | 16 benchmarks, four metric families, and why no two numbers are comparable |
| 5 | The Numbers | Tool-by-tool performance data across every available benchmark |
| 6 | Who Benchmarks Whom | Every major benchmark creator's tool ranks first on their own benchmark |
| 7 | What Actually Works | Independent evaluations, production deployments, and the 30-40% reality |
| 8 | The Funding Paradox | Why a $0 solo project outperforms a $1M-funded startup |
| 9 | Exponential Growth | From 2% to 72% exploit rate in one year — and what it means |
| 10 | Conclusions | Five key findings and practical recommendations |
Key Numbers at a Glance
| Metric | Value |
|---|---|
| AI audit tools cataloged | 84 |
| Best independent recall | 30-40% |
| Best independent precision | 4.1-17.9% |
| Top exploit rate (EVMBench) | 72.2% (GPT-5.3-Codex) |
| Top exploit rate (SCONE-bench) | 51.11% collective; 65% Claude Opus 4.5 post-cutoff |
| First AI competition win | $500K Monad audit, Oct 2025 |
| Bugs AI can't auto-detect | ~80% (business logic, industry estimate) |
| AI audit cost | $0.01 - $13 |
Chapter 1: The $150K Problem
A professional smart contract audit for complex protocols can cost $150,000+ and take weeks to months (source). The supply of qualified auditors is finite; the demand is not. Meanwhile, DeFiHackLabs has reproduced 680+ hacking incidents dating back to 2017 across Ethereum, BSC, and Base (DeFiHackLabs) – each one a contract that was either unaudited or audited and still exploited.
This is not a theoretical problem. When Nethermind ran AuditAgent retroactively against ResupplyFi's contracts after the protocol lost $9.8M in a June 2025 exploit, the tool flagged the exact exchange rate logic flaw that caused the hack (Nethermind blog). A caveat: retroactive testing (running a tool after a vulnerability is known) is easier than prospective detection (scanning a full codebase without knowing where the bug is). Still, the result demonstrates the vulnerability was within the tool's detection capability.
When Anthropic's red team built SCONE-bench – a benchmark of 405 real-world exploited contracts – and pointed 10 frontier AI models at them, those models produced working exploits for 207 contracts, draining $550M in simulated stolen funds. On contracts exploited after the models' knowledge cutoff, AI agents still found $4.6M in exploitable value across 19 post-cutoff contracts (55.8%) (Source: SCONE-bench).
The economics are stark. Here is what an AI audit costs versus a human one:
| Approach | Cost | Source |
|---|---|---|
| GPTScan (hybrid GPT + static analysis) | ~$0.01 per 1K lines of code | GPTScan paper |
| Nethermind AuditAgent | $0.02-$0.10 per billable line of code | Nethermind pricing |
| SCONE-bench agent run (average) | $1.22 per run | SCONE-bench |
| Veritas Protocol (full audit) | $13.08 per audit | Veritas Protocol |
| Bunzz Audit | 90% cheaper than human audit | Bunzz |
| Traditional manual audit | $150,000+ | Veritas Protocol |
The gap between $0.01 and $150,000 is not a rounding error – though these figures measure different things (per-1K-lines cost vs. per-engagement cost). Even the most expensive AI audit tool in the table – Veritas at $13.08 per full audit – is ~11,500x cheaper than hiring a human team.
This price asymmetry has attracted serious capital and research attention. The field now includes 84 distinct AI audit tools cataloged across production systems, research prototypes, MCP servers, and Claude Code skills . Total identified funding exceeds $38M across tools like Octane Security ($6.75M seed co-led by Archetype and Winklevoss Capital, with Gemini and Circle), AgentLISA ($12M via token launch), Sherlock ($5.5M), Olympix ($4.3M), and Cantina ($7.83M). Academic interest is equally deep: 60+ academic papers have been published on machine learning and deep learning for smart contract security, spanning venues from ICSE and NDSS to IEEE S&P (survey).
But cost reduction without accuracy has limited practical value. The central tension of this field – the reason this research exists – is the gap between what AI can detect and what still requires a human. As of August 2025, no single AI tool could find more than one-third of high-severity issues in a controlled benchmark (Source: Viggiano benchmark). By October 2025, an AI tool won a $500K audit competition against 1,600+ researchers (Source: Octane/Code4rena).
The field was changing fast. To understand where we are, we need to understand how we got here.
Chapter 2: Timeline – From Academic Paper to Audit Win (2019-2026)
The idea of using machine learning to find smart contract vulnerabilities is not new. It is seven years old. What has changed is whether the idea works.
Era 1: Academic Foundations (2019-2023)
The first generation of tools was purely academic. In 2019, SmartEmbed introduced structural code embeddings for clone detection and bug detection in smart contracts, published at ICSME. A year later, GNNSCVulDetector brought graph neural networks to vulnerability detection at IJCAI 2020, modeling contracts as graph structures to catch reentrancy, timestamp dependence, and infinite loop vulnerabilities.
These academic tools produced impressive numbers in controlled settings. Peculiar, using GraphCodeBERT, achieved 91.80% precision and 92.40% recall for reentrancy detection across 40,932 contracts (ISSRE 2021). ContractWard reported Micro-F1 and Macro-F1 scores above 96% on 49,502 contracts using XGBoost (IEEE TNSE 2021). On paper, the problem looked nearly solved.
Then GPT-4 arrived, and the community got excited – and then disappointed. In 2023, Zellic (one of the most respected audit firms, having reviewed Solana and LayerZero) published a blog post titled "Can GPT Audit Smart Contracts?" and concluded: no. GPT-4 failed all trials on a known bug. A separate academic study, "Do You Still Need a Manual Audit?", found that GPT-4 and Claude correctly identified vulnerability types in only 40% of 52 compromised DeFi contracts, with a high false positive rate.
Era 2: Hybrid Tools and Competition (2023-2024)
The breakthrough was not making LLMs smarter. It was combining them with existing tools. GPTScan (ICSE 2024) married GPT with static analysis and achieved >90% precision on token contracts, finding 9 bugs that human auditors had missed, at a cost of $0.01 per 1,000 lines of code. GPTLens (IEEE TPS 2023) introduced an adversarial two-stage framework – one LLM attacks, another validates – doubling detection success from 38.5% to 76.9%.
Meanwhile, the competition arena was producing its own signal. LightChaser – notably, a traditional pattern-matching system, not an AI/LLM tool – dominated Code4rena's Bot Races throughout 2024, competing in 60+ races with 1,000+ detection patterns and consistently placing at the top. This raised an important question: was AI actually better than well-crafted rules?
Era 3: Production Reality (2025-2026)
2025 is when the field graduated from benchmarks to production – with all the messiness that entails.
August 2025: Antonio Viggiano (Size Protocol) ran the first rigorous independent benchmark. Result: no single AI tool found more than 1/3 of high-severity issues. Humans found 8/12 High/Med issues; the best AI earned $1.7K in simulated payouts versus $5.4K for humans. But there was a silver lining: different AI tools collectively found the 4 issues that humans missed (Viggiano benchmark).
Mid 2025: Nethermind reported 30% average recall across 29 real completed audits (mean 11.6 contracts, 725 LOC per project), detecting valid issues in 62% of projects (Nethermind blog).
September-October 2025: The inflection point. Octane Security won the $500K Monad Audit Competition on Code4rena, placing #1 among 1,600+ researchers, catching 3 of 4 high-severity findings in a novel Rust/C++ codebase. This was the first time an AI tool had won a major audit competition outright.
December 2025: Anthropic's red team published SCONE-bench. On 405 real-world exploited contracts, 10 frontier models collectively produced working exploits for 207 contracts (51.11%), draining $550M in simulated stolen funds. Claude Opus 4.5 achieved the highest individual exploit rate on the post-cutoff subset: 65% (13/20 contracts). Two novel zero-day vulnerabilities were discovered across 2,849 unknown contracts (SCONE-bench).
February 2026: OpenAI and Paradigm released EVMBench, and GPT-5.3-Codex exploited 72.2% of bugs – up from under 20% when the project began. Claude Opus 4.6 scored the highest detection rate at 45.6% (EVMBench).
The Full Timeline
| Period | Milestone | Era |
|---|---|---|
| 2019 | SmartEmbed (ICSME) – first code embedding for SC bug detection | Academic |
| 2020 | GNNSCVulDetector (IJCAI) – first GNN for SC vulnerability detection | Academic |
| 2020 | SolidiFI benchmark released (ISSTA) – first systematic tool evaluation | Academic |
| 2021 | Peculiar achieves 91.8% precision for reentrancy (ISSRE) | Academic |
| 2023 | GPTLens – first adversarial LLM framework (IEEE TPS) | Hybrid |
| 2023 | Zellic: "ChatGPT cannot audit smart contracts" – GPT-4 fails all trials | Reality check |
| 2023 | "Do You Still Need Manual Audit?" – GPT-4/Claude correct 40% of the time | Reality check |
| 2024 | GPTScan (ICSE) – first GPT + static analysis hybrid, >90% precision | Hybrid |
| 2024 | LightChaser dominates Code4rena Bot Races (60+ races) | Hybrid |
| Early 2025 | PropertyGPT wins NDSS Distinguished Paper, finds 12 zero-days | Production |
| Aug 2025 | Viggiano benchmark: no AI finds >1/3 high-severity issues | Production |
| Mid 2025 | Nethermind reports 30% recall on 29 real audits | Production |
| Sep 2025 | Savant Chat achieves top 6 in Sherlock's Symbiotic contest | Production |
| Sep-Oct 2025 | Octane wins $500K Monad audit – first AI to win major competition | Production |
| Dec 2025 | SCONE-bench: 10 models exploit 51.11% of 405 contracts | Production |
| Feb 2026 | EVMBench: GPT-5.3-Codex exploits 72.2% of bugs | Production |
The trajectory from SmartEmbed's code embeddings in 2019 to GPT-5.3-Codex exploiting 72.2% of benchmark bugs in 2026 spans seven years and three distinct eras. The academic era proved the concept. The hybrid era showed that combining AI with traditional tools was the key. The production era revealed that AI can now compete with – and sometimes beat – human auditors in structured competitions.
With AI audit tools clearly evolving, the question becomes: how do these systems actually work under the hood?
Chapter 3: How They Work — Architecture Patterns
LightChaser, an anonymous bot with 1,000+ detection patterns, competed in 60+ Code4rena Bot Races in 2024 — consistently placing at the top — without a single neural network parameter. Meanwhile, PropertyGPT — a system combining retrieval-augmented generation with formal verification — discovered 12 zero-day vulnerabilities and earned $8,256 in bug bounties (NDSS 2025 Distinguished Paper). These two tools share almost nothing in their design, yet both outperform raw GPT-4 at finding smart contract bugs. The architecture matters more than the model.
Eight distinct architecture patterns have emerged across the 84 tools cataloged in this research. Understanding them is the difference between choosing a tool that finds reentrancy bugs in milliseconds and one that discovers novel logic flaws nobody has seen before.
- Rule-Based Pattern Matching The simplest and fastest approach. LightChaser deploys 1,000+ detection patterns and 100+ gas optimization checks, runs entirely locally with no API calls, and delivers reports within 24 hours. Slither, the foundational static analyzer by Trail of Bits, has accumulated 6,149 GitHub stars as the most widely used smart contract security tool.
Strength: Speed and low false positives. Weakness: Cannot find novel bugs — only patterns someone already wrote a rule for. Industry practitioners estimate ~80% of actual bugs are business logic issues not auto-detectable by current AI tools.
- Static Analysis + ML Ensemble Wake Arena combines 108 specialized detectors (87 from a private library battle-tested on billion-dollar audits) with multi-agent AI and graph-driven reasoning over data dependency graphs. The source data categorizes Wake Arena under "graph-driven reasoning" architectures; it is listed here because its detector ensemble also functions as a static analysis layer. Aderyn, a Rust-based analyzer with 733 GitHub stars, traverses ASTs with sub-second analysis times and integrates as an MCP server allowing AI models to use it as an external tool.
According to Ackee's own reporting: Wake Arena achieved 43/94 high-severity findings (45.7%) on its historical benchmark — the highest among all tested tools — and contributed 33% of all findings in real audits of Lido, Printr, and Everstake (October-December 2025). Note: these are self-reported results from the tool's developer; no independent verification was available (Wake Arena blog).
- Fine-Tuned LLM audit_gpt by FuzzLand fine-tunes GPT-3/4 on vulnerability data sourced from Solodit for approximately $16 in total fine-tuning cost. FTSmartAudit takes this further with multi-stage knowledge distillation: classical distillation from large teacher models, external domain knowledge from audit reports, and reward-guided learning. Its dataset includes 6,454 contracts from 72 Code4rena projects with 784 H/M findings across 120 distinct vulnerability labels.
Key finding: Distilled lightweight models outperform both commercial tools and larger models on complex vulnerabilities. The tradeoff is that fine-tuned models overfit to known patterns and lose generality on novel bug classes (FTSmartAudit).
- LLM + Static Analysis Hybrid GPTScan, published at ICSE 2024, was the first tool to combine GPT with static analysis for logic vulnerability detection. It breaks vulnerability types into scenarios and properties, uses GPT to match candidate vulnerable functions, then validates findings with static confirmation. Result: >90% precision on token contracts, >70% recall on ground-truth logic vulnerabilities, and 9 new bugs missed by human auditors — at a cost of ~$0.01 per 1K lines.
Nethermind AuditAgent follows a similar hybrid approach, combining ML models, symbolic execution, and a continuously-updated exploit knowledge base. On 29 completed real audits, it achieved 30% average recall and detected valid issues in 62% of projects. When tested retroactively against the ResupplyFi contract after the $9.8M hack (June 2025), it flagged the exact exchange rate logic flaw — demonstrating the vulnerability was within the tool's detection capability (Nethermind blog).
-
Multi-Agent Systems Sherlock AI combines static analysis techniques, auditor-informed heuristics, and machine learning models trained on real vulnerabilities, routing findings to domain-specific analyzers (reentrancy, access control, price manipulation). Its training data comes from verified findings in Sherlock contests and exploited codebases, with knowledge transfer from 0x52, a top security researcher whose auditing techniques are encoded as heuristics. Octane Security uses AI models trained on millions of code/exploit instances and custom-tuned per codebase — and became the first AI tool to win a major audit competition, taking #1 in the $500K Monad competition among 1,600+ researchers (Code4rena Monad audit).
-
RAG + Formal Verification AgentLISA operates a multi-agent pipeline: vulnerability scanning, invariant generation, and formal proof. PropertyGPT — from the same NTU research group — generates properties using an LLM, validates them with a fuzzer, and refines through counter-example guidance. It embeds existing human-written properties into a vector database and uses RAG for in-context learning. The broader AgentLISA ecosystem draws training data from 10 audit platforms and 3,086 security specialists' validated findings. PropertyGPT achieved 80% recall versus ground truth and detected 26/37 CVEs.
-
Graph-Driven Reasoning Hound, built by Bernhard Mueller (creator of Mythril), uses a knowledge graph + belief system architecture. It builds a graph representation of the contract's state space, control flow, and data flow. The LLM generates hypotheses about vulnerabilities; the belief system validates or refutes them against the graph. Only hypotheses that survive graph validation are reported. GNNSCVulDetector, published at IJCAI 2020, constructs contract graphs capturing syntactic and semantic structures, then applies DR-GCN and TMP neural network models. This graph-first approach enables cross-contract and cross-function analysis that prompt-first tools miss.
The funding dynamics here are notable. Hound, developed solo and unfunded, achieves 31.2% recall on ScaBench — though it's worth noting that Mueller co-developed ScaBench, so this is partially a self-evaluation (see Chapter 6). Nethermind AuditAgent reports 30% recall on a separate 29-audit production evaluation — a different measurement context (curated benchmark vs. real-world audits), making direct comparison difficult. Nethermind is backed by significant Ethereum Foundation grants for client development at the company level, though that funding is not audit-tool-specific (Nethermind blog; Hound GitHub).
- Hypothesis Generator/Critic GPTLens, published at IEEE TPS 2023, introduced a two-stage adversarial framework: an AUDITOR role generates broad vulnerability candidates through LLM scanning; a CRITIC role evaluates and filters out false positives. Multiple auditor instances independently review the code, then the critic validates findings. This adversarial pattern achieves 76.9% success rate versus 38.5% for one-stage detection — the critic effectively doubles effectiveness. Savant Chat extends this with thousands of parallel LLM calls coordinating specialized models, plus PoC generation delegated to an open-source SWE agent.
Architecture Comparison
| Architecture | Examples | Strengths | Weaknesses | Best For |
|---|---|---|---|---|
| Rule-based patterns | LightChaser (1000+ patterns), Slither | Speed, low FP, reproducible | Cannot find novel bugs | Known vulnerability classes, CI/CD gates |
| Static analysis + ML ensemble | Wake Arena (108 det.), Aderyn | High detection rate, cross-function | Complex setup, detector maintenance | Production audits, pair auditing |
| Fine-tuned LLM | audit_gpt, FTSmartAudit | Cheap ($16), domain-specific | Overfits to training data | Specific vulnerability classes |
| LLM + static hybrid | GPTScan, Nethermind AuditAgent | GPTScan >90% precision on token contracts; finds novel logic bugs | Relies on LLM quality, API cost | Logic vulnerability detection |
| Multi-agent systems | Sherlock AI, Octane | Specialist routing, competition-winning | Complexity, orchestration overhead | Broad audits, novel codebases |
| RAG + formal verification | AgentLISA, PropertyGPT | Zero-day finding, mathematical proof | Slow, requires formal specs | Critical infrastructure, formal guarantees |
| Graph-driven reasoning | Hound, GNNSCVulDetector | Cross-contract analysis, interpretable | Graph construction overhead | Complex multi-contract protocols |
| Hypothesis generator/critic | GPTLens, Savant Chat | Doubles detection vs naive LLM | High token cost, parallel compute | Systematic exploration, FP reduction |
Understanding how tools work is one thing. Understanding how we know they work is another — and that brings us to the benchmarks.
Chapter 4: How We Measure — The Benchmark Landscape
GPT-4 detects 0.9% of violations on SC-Bench without an oracle. Savant Chat claims 87% accuracy on CTFBench. GPTScan reports >90% precision on token contracts. These numbers describe completely different things, measured on completely different datasets, using completely different metrics. There is no "MMLU for smart contract auditing" — everyone measures differently, and the results are not comparable .
Across the 84 tools and 60+ academic papers surveyed, we identified 16 distinct benchmarks used to evaluate smart contract audit tools. They differ in scale by three orders of magnitude, use at least four fundamentally different metric families, and are often built by entities that also produce competing tools.
Benchmark Types
Synthetic benchmarks inject known bugs into contracts to create controlled environments. SolidiFI (UBC, ISSTA 2020) systematically injected 9,369 bugs across 7 vulnerability types into 50 contracts. SmartBugs (ASE 2020) curates 143 contracts with 208 tagged vulnerabilities annotated with SWC tags directly in code comments. These provide clean, reproducible signals — but the bugs are artificial and may be easier to detect than real-world flaws.
CTF-based benchmarks use single-vulnerability contracts for focused testing. CTFBench (AuditDB) gives each contract exactly 1 injected vulnerability — providing clean signal but unrealistic simplicity. Ethernaut and Damn Vulnerable DeFi serve as educational CTF platforms that double as evaluation targets.
Real-audit benchmarks draw from actual vulnerability disclosures and represent the most demanding evaluation tier. EVMBench (OpenAI + Paradigm, February 2026) curates 120 vulnerabilities from 40 real audits, with three evaluation modes: Detect (120 vulns), Patch (45 vulns), and Exploit (24 vulns in Docker-sandboxed blockchain forks). ScaBench (Bernhard Mueller / Nethermind, 2025) covers 31 projects with 555 vulnerabilities curated from Code4rena, Cantina, and Sherlock. LISABench (CertiK + NTU) scales to 10,185 code-complete cases from 584 protocols — 25x larger than SCONE-bench — sourced from 10 audit platforms and validated by 3,086 security specialists.
Live competition benchmarks measure tools against humans in real time. Code4rena Bot Races have run 60+ races since 2024. Separately, the $500K Monad Audit Competition on Code4rena (a full audit competition, not a Bot Race) produced the first AI-wins-over-humans result when Octane placed #1 among 1,600+ researchers. Viggiano's benchmark tested 10 AI tools against a human audit of a single ERC4626 project (743 nSLOC, 12 known issues): humans earned $5.4K in simulated earnings; the best AI earned $1.7K. No single AI found more than 1/3 of high-severity issues, but different AI tools found the 4 issues that humans missed.
Metric Families
The field uses at least four distinct measurement approaches, and conflating them is one of the most common errors in evaluating these tools.
Recall/Precision/F1 dominates academic work. Of the 16 benchmarks, 14 report recall, 11 report precision, and 8 compute F1 scores. But these numbers are rarely comparable because ground-truth definitions vary. Lyubenov's independent evaluation found Nethermind AuditAgent at 40% recall / 4.1% precision and Savant Chat at 35% recall / 17.9% precision — meaning AuditAgent produces ~24 false positives per true finding (Lyubenov eval).
VDR + OI (Vulnerability Detection Rate + Overreporting Index) is used by CTFBench. VDR measures matched vulnerabilities per total contracts; OI quantifies false positives per line of code. CTFBench is the only benchmark explicitly measuring overreporting (CTFBench paper).
Exploit rate + dollar value represents the most adversarial measurement approach. SCONE-bench (Anthropic, December 2025) tested 10 frontier models on 405 contracts derived from DeFiHackLabs incidents. The results: agents produced working exploits for 207 of 405 contracts, simulating $550M in stolen funds. On post-knowledge-cutoff contracts, agents found $4.6M in exploits across 19 contracts (55.8%). Average cost per agent run: $1.22 (SCONE-bench).
Severity-stratified scoring is used by real-audit benchmarks like ScaBench and Wake Arena's internal evaluation, which reports detection broken down as 2H, 6M, 1L, 1W with only 2 false positives in a pure AI audit (Wake Arena blog).
Scale Variance
The sheer range of benchmark scale makes apples-to-apples comparison nearly impossible. The spectrum spans from 24 vulnerabilities (EVMBench Exploit mode) to 15,975 violations (SC-Bench ERC standard compliance). Between these extremes: SmartBugs curated (208 vulns), ACToolBench (180 access control vulns where all 6 evaluated tools achieved only 3-8% recall), ScaBench (555 vulns), SolEval (1,507 samples), SC-Bench (5,377 contracts), SolidiFI (9,369 injected bugs), and LISABench (10,185 cases) .
Who Builds Them — and Why It Matters Every major benchmark was created by a team that also builds one of the tools being evaluated — something worth keeping in mind when interpreting published results.
OpenAI + Paradigm built EVMBench — and OpenAI's GPT-5.3-Codex holds the 72.2% exploit score record on it. Anthropic built SCONE-bench — and Claude Opus 4.5 achieved the highest individual post-cutoff exploit rate at 65%, while the collective rate across all 10 models was 51.11%. AuditDB created CTFBench and their own Savant Chat scores VDR 0.952 on it, while independent evaluations show only 35% recall. AgentLISA created LISABench and claims top performance on their own benchmark. ScaBench was co-developed by Nethermind's team, whose AuditAgent's scoring algorithm serves as the standard scorer .
Among all sources reviewed for this research, only two evaluations are fully independent of tool vendors: Lyubenov's evaluation (comparing AuditAgent, Savant Chat, and AlmanaxAI on the same test set) and Viggiano's benchmark (comparing 10 AI tools against a human firm's audit). Both consistently show lower numbers than self-reported results .
Complete Benchmark Comparison
| Benchmark | Scale | Primary Metrics | Creator | Year | Public? |
|---|---|---|---|---|---|
| EVMBench | 120 vulns / 40 audits | Recall, Patch rate, Exploit rate | OpenAI + Paradigm | 2026 | Yes |
| SCONE-bench | 405 contracts | Exploit %, $ value stolen | Anthropic Red Team | 2025 | Partial |
| ScaBench | 555 vulns / 31 projects | Recall, Precision by severity | Nethermind / Bernhard Mueller | 2025 | Yes |
| CTFBench | 1 vuln/contract series | VDR, OI (overreporting) | AuditDB (Igor Gulamov) | 2024-25 | Yes |
| LISABench | 10,185 cases / 584 protocols | Recall, Precision, Severity | CertiK + NTU (Prof. Yang Liu) | 2025 | Yes |
| SmartBugs | 208 vulns / 143 contracts | Recall/Precision by SWC | Academic (smartbugs.github.io) | 2020 | Yes |
| SolidiFI | 9,369 injected bugs / 50 contracts | Precision/Recall per vuln type | UBC (DependableSystemsLab) | 2020 | Yes |
| SC-Bench | 15,975 violations / 5,377 contracts | ERC violation recall | Purdue CS (system-pclub) | 2024-25 | Yes |
| SolEval | 1,507 samples / 28 repos | Pass@k, Gas@k, Vul@k | pzy2000 | 2025 | Yes |
| SolidityBench | 25 tasks + OZ specs | pass@1, pass@3 | BrainDAO / IQ | 2024-25 | Yes |
| ACToolBench | 180 AC vulns | Recall on 5 AC subtypes | ASE 2025 (Daoyuan Wu group) | 2025 | Partial |
| VeriSmart Benchmarks | 487 CVE contracts + suites | Recall on CVE vulns | Korea University (KUPL) | 2019-20 | Yes |
| SC Benchmark Suites | 46,186 contracts | Recall/Precision at scale | renardbebe (academic) | 2021 | Yes |
| Viggiano's Benchmark | 12 issues / 1 project | Score out of 12, $ equivalent | Antonio Viggiano (Size Protocol) | 2025 | Partial |
| Code4rena Bot Races | 60+ races, hundreds of findings | Competitive ranking vs humans | Code4rena | 2024+ | Yes |
| Lyubenov's Evaluation | Multiple tools, same test set | Recall, Precision | Lyuboslav Lyubenov (independent) | 2025 | Yes |
Now that we understand what the rulers look like, let's see what they actually measure.
Chapter 5: The Numbers — What Tools Actually Score
GPT-5.3-Codex exploits 72.2% of critical smart contract bugs on EVMBench. On SCONE-bench, 10 frontier models collectively exploited 51.11% of 405 real-world vulnerable contracts, with Claude Opus 4.5 achieving the highest individual post-cutoff exploit rate at 65%. Collectively, these models drained $4.6M in simulated funds from post-cutoff contracts. One year earlier, the best AI agents managed 2%. These are the headline numbers. Below them lies a far more complicated dataset, one that demands careful reading rather than quick conclusions.
EVMBench (February 2026)
Created by OpenAI and Paradigm (with frontend support from OtterSec), EVMBench evaluates models across three modes: detect (120 vulnerabilities), patch (45 vulnerabilities), and exploit (24 vulnerabilities from 16 repos). The benchmark uses Solidity repos with Foundry test harnesses in Docker-based sandboxed blockchain environments (EVMBench whitepaper).
| Model | Mode | Score |
|---|---|---|
| GPT-5.3-Codex | Exploit | 72.2% |
| Claude Opus 4.6 | Detect | 45.6% (highest detection rate) |
| Gemini 3 Pro | Detect | Tested, score not published |
| GPT-5 | Detect/Patch/Exploit | Tested, exact scores not published |
| GPT-4o | Detect | Tested, baseline comparison |
A critical distinction emerges from these results: GPT-5.3-Codex dominates on exploitation – writing working attack code – while Claude Opus 4.6 leads on detection, identifying that vulnerabilities exist. These are fundamentally different capabilities. OpenAI announced $10M in API credits for open-source security as part of the same broader security initiative (OpenAI announcement).
SCONE-bench (December 2025)
Anthropic's Red Team, working with MATS Fellows, built SCONE-bench from 405 smart contracts with real-world vulnerabilities exploited between 2020 and 2025 across Ethereum, BSC, and Base. Agents operate in Docker containers with locally forked blockchains, under a 60-minute time limit per contract, at an average cost of $1.22 per agent run (SCONE-bench).
| Model | Exploit Rate | Notes |
|---|---|---|
| 10 frontier models combined | 51.11% (207/405) | $550M simulated stolen |
| Claude Opus 4.5 | 65% (13/20 post-cutoff) | Highest individual post-cutoff rate |
| GPT-5 | Tested | Exact score not published |
| Claude Sonnet 4.5 | Tested | Lower than Opus |
The most striking data point: in one year, AI agents went from 2% to a collective 51.11% exploit rate (207/405), and from $5K to $4.6M in simulated stolen funds on post-cutoff contracts. On the post-cutoff subset specifically, agents exploited 19 contracts (55.8%). Additionally, 2 novel zero-day vulnerabilities were discovered across 2,849 previously unknown contracts (SCONE-bench).
ScaBench (2025)
Co-developed by Bernhard Mueller and Nethermind (lead researcher: Cristiano Silva, PhD), ScaBench contains 31 projects with 555 vulnerabilities curated from Code4rena, Cantina, and Sherlock audits (ScaBench).
| Tool | Recall |
|---|---|
| Hound (Bernhard Mueller) | 31.2% |
| Nethermind AuditAgent | 30% (from separate 29-audit evaluation) |
| 13 tools on leaderboard | All showing "TBD" as of research date |
A notable observation: Hound, a solo unfunded project, achieves 31.2% recall on ScaBench. Nethermind AuditAgent reports 30% on a separate 29-audit production evaluation – a different measurement context, but comparable magnitude. Nethermind's significant Ethereum Foundation grants for client development represent company-level funding rather than tool-specific investment, yet the scale difference between it and a one-person project yielding near-identical recall remains striking (ScaBench; Nethermind blog).
CTFBench (2024-2025)
Created by AuditDB, CTFBench uses small contracts with exactly one injected vulnerability each, measuring VDR (Vulnerability Detection Rate) and OI (Overreporting Index) (CTFBench).
| Tool | VDR | Notes |
|---|---|---|
| Savant Chat | 0.952 | Note: benchmark created by same team |
Code4rena Bot Races
| Tool | Result |
|---|---|
| LightChaser | Top performer across 60+ races (2024) |
| Octane | #1 on Monad $500K competition (Sept-Oct 2025), 1,600+ researchers competing, 3/4 high-severity findings caught |
Independent Evaluations
Lyubenov's evaluation (Lyuboslav Lyubenov, independent researcher and Solodit MCP creator) represents the only fully independent third-party evaluator found in this research (Lyubenov's evaluation):
| Tool | Recall | Precision |
|---|---|---|
| Nethermind AuditAgent | 40% | 4.1% (~24 false positives per true finding) |
| Savant Chat | 35% | 17.9% |
| AlmanaxAI | 5% | 5.9% |
Viggiano's benchmark (Antonio Viggiano, Size Protocol, August 2025) tested a single project – Size Meta Vault, an ERC4626 contract with 743 nSLOC and 12 known High/Medium issues. Humans found 8/12 issues ($5.4K simulated earnings). The best AI tool found issues worth $1.7K. No single AI found more than 1/3 of high-severity issues. Yet different AI tools collectively found the 4 issues that humans missed, pointing toward complementary rather than replacement value (Viggiano's benchmark).
Wake Arena real audits (October-December 2025) were deployed on Lido, Printr, and Everstake audits. AI contributed 33% of all findings. Across all three protocols: 5 of 10 critical vulnerabilities found (50%) and 5 unique findings beyond what human auditors discovered. A purely AI audit produced 10 findings (2H, 6M, 1L, 1W), with only 2 false positives (Wake Arena blog).
Academic Papers: The 90%+ Club
| Paper | Key Result | Venue |
|---|---|---|
| iAudit | F1=91.21%, accuracy=91.11% on 263 real vulnerabilities | ICSE 2025 |
| LLM-SmartAudit | 98% accuracy on common vulnerabilities, 12/13 CVEs | IEEE TSE 2025 |
| SmartAuditFlow | 100% accuracy on common/critical, 41.2% on real-world, all 13 CVEs | ACM TOSEM 2025 |
| PropertyGPT | 26/37 CVEs, 12 zero-days ($8,256 bounties), 80% recall | NDSS 2025 |
| SCVHunter (HGAT) | Reentrancy 93.72%, nested call 85.41%, transaction state dependency 87.37%, block info 91.07% | ICSE 2024 |
Master Comparison Table
| Tool / Model | EVMBench Detect | EVMBench Exploit | SCONE-bench | ScaBench Recall | CTFBench VDR | Lyubenov Recall | Lyubenov Precision | Real Audit Coverage |
|---|---|---|---|---|---|---|---|---|
| GPT-5.3-Codex | – | 72.2% | – | – | – | – | – | – |
| Claude Opus 4.6 | 45.6% | – | – | – | – | – | – | – |
| Claude Opus 4.5 | – | – | 65% (post-cutoff); 51.11% collective | – | – | – | – | – |
| Gemini 3 Pro | Tested | – | Tested | – | – | – | – | – |
| GPT-5 | Tested | Tested | Tested | – | – | – | – | – |
| GPT-5 (plain, no tools) | – | – | – | – | – | – | – | 24/94 high-sev (25.5%) |
| Nethermind AuditAgent | – | – | – | –* | – | 40% | 4.1% | 30% avg on 29 audits |
| Hound | – | – | – | 31.2% | – | – | – | – |
| Savant Chat | – | – | – | – | 0.952 | 35% | 17.9% | Top 6 Sherlock contest |
| Wake Arena | – | – | – | – | – | – | – | 33% (Lido, Printr, Everstake) |
| Zellic V12 | – | – | – | – | – | – | – | 41/94 high-sev (43.6%) |
| AlmanaxAI | – | – | – | – | – | 5% | 5.9% | – |
| LightChaser | – | – | – | – | – | – | – | #1 C4 Bot Races (60+) |
| Octane | – | – | – | – | – | – | – | #1 Monad $500K |
*Nethermind AuditAgent's 30% recall is from its own 29-audit production evaluation, not from ScaBench directly. Hound's 31.2% is the ScaBench benchmark result.
These numbers paint a compelling but incomplete picture. The benchmarks themselves have structural nuances worth understanding — and the pattern is systematic.
Chapter 6: Benchmarks and Independence — Who Benchmarks Whom
A note before we begin: In every case below, the same team that built a benchmark also built a tool that scores well on it. This is natural in an emerging field — the people with enough domain expertise to design meaningful benchmarks are often the same people building the tools. We are not implying any manipulation or bad faith. Many of these benchmarks are genuinely valuable contributions. The point is simply that when a third party independently tests the same tools, the numbers tend to be lower — and that's useful context for interpreting results.
In five out of five major smart contract audit benchmarks released between 2024 and 2026, the creator's own tool or model ranks first. This is a consistent structural pattern — natural and understandable, since the people best positioned to build benchmarks are often those who also build the tools. But it does mean that independent evaluations carry extra weight when interpreting results.
The Pattern: Creator Tops Creator's Benchmark
Case 1: CTFBench and Savant Chat (Full overlap)
AuditDB, led by Igor Gulamov, created CTFBench. AuditDB's own product, Savant Chat, tops it with a VDR of 0.952 and claimed 87% accuracy. When Lyuboslav Lyubenov, an independent researcher, evaluated Savant Chat on his own test set, performance dropped to 35% recall and 17.9% precision. The gap between 87% claimed accuracy and 35% independently measured recall is significant. While the metrics are not directly comparable (accuracy vs. recall), it highlights how much evaluation context matters (CTFBench; Lyubenov eval).
Case 2: LISABench and AgentLISA (Full overlap)
AgentLISA, backed by CertiK and NTU's Prof. Yang Liu (600+ publications, h-index 60+), created LISABench – 10,185 code-complete cases from 584 protocols, 25x larger than SCONE-bench. AgentLISA claims $7.3M+ in real exploits detected since June 2025. No independent verification of these claims was found in any source reviewed for this research. The benchmark also sits adjacent to a $12M LISA token launch, adding a financial dimension worth considering when evaluating claims (AgentLISA).
Case 3: ScaBench and Nethermind AuditAgent (Partial overlap)
Bernhard Mueller and Nethermind co-developed ScaBench. Nethermind's AuditAgent scoring algorithm serves as ScaBench's standard scorer – meaning the tool's creator defines what counts as a correct answer. Mueller's separate solo project, Hound, scored 31.2% recall, offering partial independent data. But the fact that the tool's creator also defines the scoring rules is worth noting (ScaBench methodology).
Case 4: EVMBench and GPT-5.3-Codex (Partial overlap)
OpenAI and Paradigm released EVMBench (with frontend support from OtterSec). OpenAI's GPT-5.3-Codex achieved the top exploit score of 72.2%. This is a partially mitigated overlap: other models were tested on the same benchmark, including Claude Opus 4.6 at 45.6% detection and Gemini 3 Pro (score not published). The benchmark methodology, dataset, and code are public. The primary concern is selection bias in task design – exploitation-focused tasks may favor models optimized for agentic code generation, which is precisely what Codex was built for (EVMBench whitepaper; Paradigm blog).
Case 5: SCONE-bench and Claude Opus 4.5 (Partial overlap, most mitigated)
Anthropic's Red Team created SCONE-bench. Claude Opus 4.5 achieved the highest individual post-cutoff exploit rate at 65% (13/20), while the collective rate across all 10 models was 51.11% (207/405). This case has the strongest mitigating factors: the methodology is fully transparent, the benchmark is open-source, 10 frontier models were tested including GPT-5, and the paper explicitly frames the work as a safety evaluation rather than a marketing exercise. The benchmark was derived from the existing DefiHackLabs repository (680+ incidents), reducing the potential for cherry-picking (SCONE-bench).
Degrees of Overlap
Not all cases are equal. A useful framework for distinguishing them:
| Degree | Definition | Examples |
|---|---|---|
| Full | Creator's tool tops creator's benchmark, no independent testing, no third-party scores published | CTFBench/Savant Chat, LISABench/AgentLISA |
| Partial | Creator's model leads, but other models tested on same benchmark with published scores | EVMBench/GPT-5.3-Codex, ScaBench/AuditAgent |
| Mitigated | Creator's model leads, methodology transparent, open-source data, multiple third-party models tested | SCONE-bench/Claude Opus 4.5 |
The Correctors: Independent Evaluators
Two individuals stand out as independent correctors in a field short on them.
Lyuboslav Lyubenov tested Nethermind AuditAgent, Savant Chat, and AlmanaxAI against the same contract set. His results provide a useful independent baseline. Savant Chat drops from 87% accuracy (self-reported) to 35% recall / 17.9% precision — different metrics, but a notable difference. AuditAgent shows 40% recall but only 4.1% precision – roughly 24 false positives per genuine finding. AlmanaxAI, which raised $1M in pre-seed funding, scored 5% recall / 5.9% precision — suggesting the tool was still early in development at the time of testing (Lyubenov's evaluation).
Antonio Viggiano (Size Protocol) ran a single-project benchmark in August 2025: 743 nSLOC, 12 known High/Medium issues. His central finding: no single AI tool found more than 1/3 of high-severity issues. Humans found 8/12 issues. The best AI tool recovered $1.7K in simulated earnings against humans' $5.4K. Some AI submissions tended to overrate severity levels. But Viggiano also noted that different AI tools collectively found the 4 issues that humans missed entirely – evidence for complementary, not replacement, value (Viggiano's benchmark).
The "90% Academic" Problem Academic papers routinely report metrics above 90%. iAudit achieves F1=91.21% (ICSE 2025). LLM-SmartAudit claims 98% accuracy (IEEE TSE 2025). SmartAuditFlow reports 100% accuracy on common vulnerability categories (ACM TOSEM 2025). PropertyGPT reaches 80% recall (NDSS 2025).
Meanwhile, real-world recall sits at 30-40% (Nethermind blog; Lyubenov eval). SmartAuditFlow itself acknowledges the gap: its 100% accuracy on "common/critical" vulnerabilities drops to 41.2% on "comprehensive real-world projects." The difference is the dataset. Academic benchmarks use curated vulnerability sets with known, often well-documented vulnerability types. Real-world codebases contain novel logic flaws, protocol-specific attack surfaces, and business logic errors that industry practitioners estimate represent approximately 80% of actual bugs – the portion that current tools cannot auto-detect (SmartAuditFlow).
Summary: Who Built What
| Benchmark | Creator | Top Scorer | Independence | Independent Verification Status |
|---|---|---|---|---|
| CTFBench | AuditDB (Igor Gulamov) | Savant Chat (AuditDB) – VDR 0.952 | Full | Lyubenov: 35% recall, 17.9% precision |
| LISABench | AgentLISA/NTU (Prof. Yang Liu) | AgentLISA – claims $7.3M | Full | No independent verification found |
| ScaBench | Nethermind/Bernhard Mueller | AuditAgent – scoring algo is theirs | Partial | Hound 31.2% provides some data |
| EVMBench | OpenAI + Paradigm (OtterSec frontend) | GPT-5.3-Codex – 72.2% exploit | Partial | Claude 45.6% detect; other models tested |
| SCONE-bench | Anthropic Red Team | Claude Opus 4.5 – 65% post-cutoff; 51.11% collective | Mitigated | GPT-5 tested, 10 models total, open methodology |
Side-by-Side: Claimed vs. Independently Verified
| Tool | Claimed Performance | Independent Performance | Gap |
|---|---|---|---|
| Savant Chat | 87% accuracy (CTFBench) | 35% recall / 17.9% precision (Lyubenov) | Different metrics; independent recall far below claimed accuracy |
| Nethermind AuditAgent | 62% of projects have valid issues | 40% recall / 4.1% precision (Lyubenov) | 62% measures project-level detection, not precision; actual precision is 4.1% |
| AlmanaxAI | "Various security detections" (website) | 5% recall / 5.9% precision (Lyubenov) | Early-stage, limited results at time of testing |
| AgentLISA | $7.3M in exploits detected | No independent verification | Unknown |
| Academic tools (avg) | 90-100% F1/accuracy on curated sets | 30-40% recall in real audits | Metrics not directly comparable (F1 vs recall); gap is real but multiplier is approximate |
If the benchmarks can't be fully trusted, what does real-world deployment actually tell us? Let's look at the production data.
Chapter 7: Reality Check — What Actually Works in Production
In Lyubenov's independent evaluation — the only fully third-party benchmark conducted outside any vendor's control — the best-performing AI audit tool achieved 40% recall and 4.1% precision (Lyubenov eval). That precision figure means roughly 24 false positives for every genuine vulnerability found. The second-best tool managed 35% recall at 17.9% precision. The lowest-scoring tool managed 5% recall and 5.9% precision — suggesting it was still in an early development stage at the time of evaluation. These are the independent numbers, measured outside any vendor's own testing environment.
The Performance Ceiling
Across every independent measurement available, the realistic performance ceiling for AI smart contract audit tools sits at 30–40% recall and 4.1–17.9% precision in best-case scenarios. Nethermind's own data from 29 completed real-world audits confirms this range: 30% average recall, with some individual projects reaching 50%, and valid issues detected in 62% of projects (Nethermind blog).
These numbers are consistent. Viggiano's benchmark on Size Protocol's Meta Vault (743 nSLOC, 12 known high/medium issues) found that no single AI tool discovered more than one-third of high-severity issues. The best AI submission was worth $1.7K in simulated earnings, while human auditors found 8 of 12 issues for $5.4K (Viggiano's benchmark).
The ~80% Detection Gap A widely cited estimate among industry practitioners — including organizations like Cyfrin (behind the Aderyn static analyzer and the Solodit vulnerability database with 50K+ findings) — is that ~80% of actual bugs are business logic issues not auto-detectable by machines, including AI. We were unable to find a rigorous study behind this specific number — it reflects a practitioner consensus rather than a measured statistic. But the directional claim is supported by benchmark data: the best tools plateau at 30-40% recall in independent tests, consistent with a large category of bugs they cannot reach. The vulnerabilities that matter most in production — protocol-specific economic assumptions, governance manipulation vectors, incentive misalignments — require understanding the intent of the code, not just its syntax.
What AI Finds vs. What It Misses
The pattern is clear and consistent across tools: AI excels at detecting known vulnerability classes with well-documented signatures, and struggles with anything requiring protocol-specific reasoning.
| Category | AI Detects Reliably | AI Misses Consistently |
|---|---|---|
| Classic patterns | Reentrancy, integer overflow/underflow, unchecked external calls | Protocol-specific logic flaws |
| Access control | Missing access modifiers, unprotected functions | Subtle permission escalation chains |
| Known templates | Timestamp dependence, tx.origin misuse, delegatecall risks | Economic attack vectors (flash loan exploits, oracle manipulation) |
| Code-level | Dead code, gas inefficiencies, standard violations | Cross-protocol interaction bugs, governance manipulation |
This taxonomy holds across tool architectures. Mythril reliably catches reentrancy and integer overflows . Sherlock AI routes vulnerabilities to domain-specific analyzers for reentrancy, access control, and price manipulation. Yet none of these tools consistently identify the business logic flaws that the ~80% industry estimate points to.
Production Deployments That Matter
Strip away the benchmarks and self-reported metrics. Here is what has actually been demonstrated in production or competition settings with verifiable results:
| Tool | Context | Results | Evidence |
|---|---|---|---|
| Wake Arena | 3 real audits: Lido, Printr, Everstake (Oct-Dec 2025) | 33% of all findings; across all 3 protocols: 5/10 critical vulns found + 5 unique findings beyond human auditors; pure AI audit (LUKSO): 10 findings (2H, 6M, 1L, 1W), only 2 false positives | Wake Arena blog |
| Nethermind AuditAgent | 29 completed production audits | 30% recall average; retroactively detected the exact ResupplyFi exchange rate flaw that caused the $9.8M hack | Nethermind blog |
| Octane Security | $500K Monad Audit Competition on Code4rena (Sept-Oct 2025) | #1 among 1,600+ researchers; caught 3/4 high-severity findings in novel Rust/C++ codebase; also a top performer on Immunefi leaderboard | Code4rena Monad |
| Sherlock AI | Internal evaluation | Combines static analysis, heuristics from 0x52, and ML on verified contest findings | Sherlock AI |
| Savant Chat | Sherlock Symbiotic/DeFi Audit Contest (Sept 2025) | Top 6 competing against dozens of human auditors | Savant announcement |
The Nethermind result deserves special attention: when tested retroactively against the ResupplyFi contract after the $9.8M hack (June 2025), AuditAgent flagged the exact exchange rate logic flaw — demonstrating the vulnerability was within the tool's detection capability and could have been prevented had the tool been applied pre-deployment (Nethermind blog).
The "Pair Auditor" Consensus The most credible organizations in smart contract security consistently position AI as an augmentation tool, not a replacement. Nethermind explicitly designs AuditAgent as a "pair auditor" that augments manual review (Nethermind). Trail of Bits licenses its Claude Code skills under Creative Commons (CC BY-SA 4.0), positioning them as augmentation tools rather than standalone solutions. OpenZeppelin built an MCP server integrating its contract standards into AI workflows, positioning it as a development-time guardrail rather than a standalone audit solution.
No serious player claims full automation. The consensus is clear: AI handles the first pass, humans handle the judgment.
Complementary Coverage
Perhaps the most important finding from Viggiano's benchmark is not what any single tool found, but what the union of tools found: "Different AI tools found the 4 issues that humans missed" (Viggiano's benchmark). The tools that caught those issues were not the same tools that performed best overall. Different architectures — knowledge graphs, multi-agent systems, hypothesis-critic patterns — have different blind spots.
This points to the real value proposition: running multiple AI tools in parallel during the initial scan phase, widening the coverage surface, and letting human auditors focus their limited attention on the business logic and economic reasoning that AI cannot reach.
Production results also reveal a surprising secondary finding: the tools with the best results are not always the best funded. In fact, the correlation between money and performance is nearly zero.
Chapter 8: The Funding Paradox
Hound, a solo open-source project with $0 in funding, achieves 31.2% recall on ScaBench. AlmanaxAI, backed by $1M in pre-seed capital, scored 5% recall and 5.9% precision in Lyubenov's evaluation — the lowest among independently tested tools at the time. LightChaser, another unfunded tool from an anonymous developer, was the top-placing bot in Code4rena Bot Races across 60+ races in 2024 (Hound; AlmanaxAI; LightChaser). The relationship between money and results in AI smart contract auditing is, at best, nonexistent.
The Funding Landscape
The total identifiable funding in dedicated AI audit tooling exceeds $38M when including Cantina's $7.83M Series A (spread unevenly across a handful of players, and $12M of the total is a token launch, not traditional VC):
| Funding | Tool | Best Verified Result | Verdict |
|---|---|---|---|
| $12M (token launch) | AgentLISA | Claims $7.3M in exploits detected (unverified); created own benchmark (LISABench) | Unverified |
| $6.75M (seed) | Octane Security | Won $500K Monad competition (#1 among 1,600+); #1 Immunefi | Delivers |
| $5.5M (seed) | Sherlock AI | 21 valid findings in eval (vs ChatGPT 5.2's 3) | Delivers |
| $4.3M (seed) | Olympix | Claims 30%+ of Solidity developers as users (self-reported); no public benchmark data | Unverified |
| ~$1.2M (reported) | QuillShield | CTFBench participant, no standout results reported | Limited public data |
| $1M (pre-seed) | AlmanaxAI | 5% recall, 5.9% precision in independent test | Early stage |
| $0 | Hound (Mueller) | 31.2% recall on ScaBench | Delivers |
| $0 | LightChaser* | Top-placing bot in Code4rena Bot Races 2024 (60+ races) | Delivers |
*LightChaser is likely a traditional pattern-matching system (1,000+ detection patterns), not an AI/LLM tool. Included for comparison as it competes in the same arena.
The pattern is stark. Octane and Sherlock demonstrate that funding can produce results when combined with the right team and domain expertise. But AlmanaxAI and QuillShield have yet to demonstrate publicly verifiable results matching their funding levels. Meanwhile, Hound and LightChaser prove that zero funding is no barrier to competitive performance.
What Matters More Than Money
Three factors consistently predict tool quality better than capital raised:
-
Domain expertise of the creator. Bernhard Mueller created Mythril (4,207 stars, the standard symbolic execution tool for Solidity), then created Hound with a unique knowledge-graph architecture, then co-developed ScaBench (the field's standard benchmark scorer), and advises Sherlock AI. His unfunded solo project outperforms tools backed by millions because he has 15+ years of security experience and built the foundational tools the entire ecosystem depends on (Mythril; Hound; ScaBench). The researcher known as 0x52, one of Sherlock's top security researchers, had their auditing techniques directly encoded as heuristics into Sherlock AI — a direct transfer of human expertise into machine capability (Sherlock AI).
-
Architecture choices. Specialization outperforms generalization. Sherlock AI combines static analysis, auditor-informed heuristics, and ML models routing vulnerabilities to domain-specific analyzers (Sherlock AI). Wake Arena deploys 108 specialized detectors combined with graph-driven reasoning. Hound builds knowledge graphs rather than relying on prompts. Tools that treat smart contract auditing as a generic LLM task — feed code in, get vulnerabilities out — consistently underperform those with purpose-built architectures.
-
Training data quality. Octane trains on proprietary data from Code4rena, Sherlock, and Immunefi competitions — real findings from real audits validated by human experts (Octane). Sherlock AI trains on verified findings from its own contest platform plus exploited codebases (Sherlock AI). AgentLISA claims training on data from 10 audit platforms with knowledge distilled from 3,086 specialists. The tools with real, human-validated audit data consistently outperform those training on synthetic or scraped datasets.
The Key People Network
The field is remarkably small at the top. A handful of individuals connect the most important tools, benchmarks, and organizations:
| Person | Affiliations & Tools | Impact |
|---|---|---|
| Bernhard Mueller | Created Mythril, created Hound (solo), co-developed ScaBench (Nethermind), advises Sherlock AI | Most trusted individual in the space. Only person with open-source, independent, and advisory presence across multiple tools |
| Prof. Yang Liu (NTU) | AgentLISA, LISABench, PropertyGPT (NDSS 2025 Distinguished Paper) | Strongest academic pedigree: 600+ publications, h-index 60+ |
| 0x52 | Top Sherlock researcher, knowledge embedded in Sherlock AI | Expert knowledge transfer: human auditor instincts encoded into AI heuristics |
| Giovanni Vignone | Co-founder & CEO, Octane Security | Led the first AI tool to win a major audit competition |
| Antonio Viggiano | Created independent AI benchmark, Size Protocol | Key independent evaluator providing unbiased performance data |
| Lyuboslav Lyubenov | Created Solodit MCP, independent tool comparison | Only fully independent third-party evaluator of multiple commercial tools |
Mueller alone touches four of the most credible entities in the space. Yang Liu's academic lab produced both a top tool (AgentLISA/PropertyGPT) and a major benchmark (LISABench) — though that dual role is the pattern documented in Chapter 6 . The people who built the foundational infrastructure — Mythril, Slither, the contest platforms — are the same people now building or advising the AI tools attempting to automate what they once did manually.
If expertise matters more than capital, and the field is evolving this fast, what does the trajectory look like? The pace of change is remarkable.
Chapter 9: The Frontier — Exponential Growth (Late 2025-2026)
Approximately one year before SCONE-bench's December 2025 publication, earlier-generation AI agents exploited 2% of its contracts, stealing $5K in simulated funds. By December 2025, 10 frontier models (most of which did not exist a year prior) collectively produced working exploits for 207 of 405 contracts (51.11%), generating $550M in simulated stolen funds. On the post-cutoff subset (contracts exploited after models' knowledge cutoffs), agents exploited 19 contracts (55.8%) and drained $4.6M (SCONE-bench).
An important caveat: part of this jump reflects the arrival of entirely new, more capable models – not just improvement of existing ones. The 2% baseline was set by weaker models that have since been superseded. Still, the improvement is dramatic – and the February 2026 EVMBench results suggest it has not plateaued. On a separate benchmark, GPT-5.3-Codex now exploits 72.2% of critical Code4rena bugs, up from <20% when that project started (EVMBench).
What Changed: Three Technical Drivers
The jump from single-digit to double-digit exploit rates was not driven by a single breakthrough. Three parallel developments converged in late 2025.
Reasoning models. DeepSeek R1 and similar chain-of-thought reasoning models introduced step-by-step reasoning over code, enabling models to trace multi-step state transitions across contract calls – exactly the kind of analysis that reentrancy and price manipulation bugs demand. On SolEval, a benchmark measuring Solidity code generation (not vulnerability detection), DeepSeek-V3 achieved 26.29% Pass@10 – the highest among all models – indicating strong structural understanding of smart contract syntax and semantics (SolEval). Note: DeepSeek-V3 is a different model from DeepSeek R1; both contribute to the broader trend of improved smart contract comprehension.
Agent scaffolding. The shift from "prompt a model" to "deploy an agent" – with function calling, tool use, and blockchain interaction – transformed exploit generation from a text completion task into an autonomous workflow. SCONE-bench specifically measures agents that fork mainnet state, deploy contracts, and execute transactions. The 10 frontier models tested produced working exploits for 207 of 405 contracts ($550M simulated stolen funds) and discovered 2 novel zero-day vulnerabilities across 2,849 previously unexamined contracts (SCONE-bench).
Context windows. Modern frontier models support large context windows (e.g., Claude's 200K tokens), enabling ingestion of entire protocols – multiple interacting contracts, libraries, and deployment configurations – in a single pass. Cross-contract reasoning, previously the domain of graph-based tools like Wake Arena (108 detectors), is now accessible to general-purpose LLMs.
Frontier Model Comparison
The EVMBench and SCONE-bench leaderboards reveal distinct model strengths:
| Model | EVMBench Exploit | EVMBench Detect | SCONE-bench Exploit | Context Window | Release |
|---|---|---|---|---|---|
| GPT-5.3-Codex | 72.2% | – | – | – | Feb 2026 |
| Claude Opus 4.6 | – | 45.6% | – | 200K | 2026 |
| Claude Opus 4.5 | – | – | 65% (post-cutoff); 51.11% collective | 200K | 2025 |
| GPT-5 | Tested | Tested | Tested | – | 2025 |
| Gemini 3 Pro | Tested | Tested | Tested | – | 2025-2026 |
| DeepSeek R1 | – | – | Tested | – | 2025 |
The specialization is notable (EVMBench; SCONE-bench). GPT-5.3-Codex leads on exploitation – writing and executing attack code – while Claude Opus 4.6 leads on detection – identifying vulnerabilities without necessarily exploiting them. These are different capabilities, and the gap between them matters for the offense-defense balance.
The Offensive-Defensive Asymmetry
On EVMBench, the highest exploit score (72.2%, measured on 24 vulnerabilities) exceeds the highest detection score (45.6%, measured on 120 vulnerabilities) — though these are different test sets of different sizes, so the comparison is directional rather than exact. The underlying logic still holds: an attacker needs to find one exploitable bug to drain millions; a defender needs near-100% coverage to prevent it.
This asymmetry is compounded by cost. A SCONE-bench agent run costs an average of $1.22 (SCONE-bench). Scanning every deployed contract on Ethereum for exploitable bugs is now economically trivial for any well-resourced attacker. OpenAI's announcement of $10M in API credits for open-source security as part of its broader security push signals that the industry recognizes this threat (OpenAI EVMBench announcement).
Projections (With Caveats)
The growth from 2% to 51.11% in one year is dramatic, but it's worth remembering how technology adoption curves typically work: early progress often looks exponential, then hits a plateau as the low-hanging fruit is exhausted and harder problems remain. We've seen this pattern with self-driving cars, machine translation, and protein folding — rapid initial gains followed by a long tail of diminishing returns on increasingly difficult edge cases. AI smart contract auditing is likely on the same trajectory. What we can say with confidence is that the field is firmly in its growth and improvement phase — the capability is real and getting better. But extrapolating the current rate indefinitely would be naive. Three specific caveats constrain any projection:
Known patterns only. SCONE-bench and EVMBench test against historically exploited contracts with documented vulnerability classes. Performance on truly novel attack vectors is unmeasured. Detection ceiling. Industry practitioners estimate that ~80% of actual bugs are business logic issues not auto-detectable by machines, including AI. These include protocol-specific logic errors – mispriced oracles, flawed governance mechanisms, incentive misalignments – that require understanding intent, not just code. This ceiling is unlikely to break without fundamental advances in specification reasoning. Benchmark contamination. On contracts exploited after model training cutoffs, SCONE-bench agents exploited 19 post-cutoff contracts (55.8%) – close to the overall 51.11% collective rate, but measured on a much smaller sample (SCONE-bench). True zero-shot capability remains uncertain due to sample size. The trajectory is clear: we are in an active growth phase. The early exponential curve will eventually flatten — but right now, capability is improving faster than most of the industry has internalized. The question is no longer whether AI will transform smart contract security, but how practitioners should adapt while the field is still maturing.
Chapter 10: What It All Means — Conclusions and Practical Recommendations
We started this research to understand whether AI smart contract auditing is real or hype. After analyzing 84 tools, 16 benchmarks, and 60+ academic papers, here's what we found.
Five Key Findings
-
From curiosity to competition win in 7 years. SmartEmbed (ICSME 2019) was the first code-embedding approach for smart contract bug detection. Octane Security won a $500K audit competition against 1,600+ researchers in October 2025 (SmartEmbed; Octane/Code4rena Monad). The gap between "interesting paper" and "production tool" has closed.
-
A pair auditor, not a replacement. The best tools achieve 30-40% recall in independent evaluations – Nethermind AuditAgent scored 30% on its own 29-audit evaluation and 40% in Lyubenov's independent test (same tool, different context). Precision ranges from 4.1% (AuditAgent) to 17.9% (Savant Chat) (Lyubenov eval; Nethermind blog). This is useful. It is not sufficient.
-
The ~80% detection gap. Industry practitioners estimate ~80% of actual bugs are business logic issues not auto-detectable by machines, including AI. These include protocol-specific logic errors, economic assumptions, and governance design flaws – vulnerabilities that require understanding the system's intended behavior, not just its code.
-
Every major benchmark is self-evaluated. In all five major benchmarks examined, the creator's own tool or model ranks first — CTFBench/AuditDB, LISABench/AgentLISA, EVMBench/OpenAI, ScaBench/Nethermind, SCONE-bench/Anthropic. This is natural — the people who build benchmarks are often the domain experts. But it means independent evaluations carry extra weight. And those independent evaluations consistently show lower performance than self-reported numbers.
-
Exploit capability is compounding fast. On SCONE-bench, AI agents went from 2% to 51.11% (collective) exploit rate in roughly one year. On the separate EVMBench benchmark, GPT-5.3-Codex exploits 72.2% of bugs. These are different benchmarks measuring different things, but both point in the same direction (SCONE-bench; EVMBench).
Current State Summary
| Dimension | Current State | Gap | Opportunity |
|---|---|---|---|
| Detection recall | 30-40% | 60-70% of bugs missed | Multi-tool ensembles |
| Precision | 4.1-17.9% | 82-96% of findings are false positives | Better training data, filtering |
| Auto-detectable bugs | ~20% detectable | ~80% not auto-detectable | Specification-aware reasoning |
| Exploit capability | 51-72% (different benchmarks) | Growing fast; offense may outpace defense | Defensive AI investment |
| Benchmarks | 16 benchmarks, incomparable | No standard metrics | Independent benchmark bodies |
| Cost | $0.01-$13 per AI audit | vs $150K+ manual | Hybrid workflows |
For Protocol Teams
Use AI as a first-pass scan, not a final verdict. Run multiple AI tools – Viggiano's benchmark showed that different AI tools collectively found the 4 issues that humans missed, because each tool covers a different subset of vulnerabilities (Viggiano's benchmark). Then invest human auditor time where AI cannot reach: business logic, cross-protocol interactions, and economic incentive design.
The cost equation favors hybrid approaches: an AI pre-scan (ranging from $0.01 per 1K lines to $13 per full scan, depending on the tool) followed by a focused human audit can reduce the scope — and cost — of manual review, while covering both AI-detectable patterns and human-only business logic (GPTScan; Veritas Protocol).
For Auditors
Adopt AI tools to widen coverage, not to reduce effort. Focus human time on business logic and novel attack patterns – the ~80% of actual bugs that current AI cannot auto-detect. Use multiple tools: AuditAgent, Savant Chat, and Hound find different bugs. Hound (31.2% recall on ScaBench, $0 funding) and Nethermind AuditAgent (30% on production audits) show recall in a similar range — though these are different measurement contexts and Hound's score is on its co-creator's benchmark (see Chapter 6). The broader pattern suggests that architecture and domain expertise can matter as much as funding (ScaBench; Nethermind blog).
For AI Tool Builders
Specialize. LightChaser competed in 60+ Code4rena Bot Races with 1,000+ detection patterns, consistently ranking at the top – no LLM required. Wake Arena found 43/94 high-severity issues with 108 specialized detectors (Wake Arena blog; LightChaser). The evidence says specialization beats generalization.
Invest in real audit training data, not synthetic. PropertyGPT trained on real CVEs found 12 zero-day vulnerabilities and won NDSS 2025 Distinguished Paper (PropertyGPT). And seek independent benchmark validation – results carry more weight when measured by a third party. In all five major benchmarks we examined, the creator's own tool ranked first (Chapter 6).
For Researchers
The highest-impact open problem is detecting the ~80% of bugs that are business logic issues – economic design, cross-protocol interaction flaws, and governance vulnerabilities – that machines, including AI, cannot currently auto-detect. Cross-contract reasoning and formal verification integration (as in PropertyGPT and SmartInv) show promise but remain academic. SmartInv alone identified 119 zero-days from 89,621 contracts (SmartInv). The field also needs independent benchmark bodies – individuals like Lyubenov and Viggiano are providing critical independent evaluation, but their work is ad hoc and unfunded.
Reference Table: Key Metrics with Sources
| Metric | Value | Source |
|---|---|---|
| AI audit tools cataloged | 84 | This research |
| Total identified funding (tracked tools) | $38M+ (incl. $12M token launch) | Various sources |
| Best independent recall | 30-40% | Nethermind 30%, Lyubenov eval 40% |
| Best independent precision | 4.1-17.9% | Lyubenov eval |
| Exploit rate (EVMBench, best) | 72.2% | EVMBench, GPT-5.3-Codex |
| Exploit rate (SCONE-bench, collective) | 51.11% (207/405); Opus 4.5 65% post-cutoff | SCONE-bench |
| First AI competition win | $500K Monad | Octane/Code4rena, Oct 2025 |
| Auto-detection gap | ~80% of actual bugs are business logic issues, not auto-detectable | Industry estimate |
| Academic papers | 60+ | Survey |
| AI audit cost | $0.01-$13 | Various tools |
| Manual audit cost | $150,000+ | Industry standard |
The Bottom Line
AI will not replace security auditors. AI will replace auditors who don't use AI. The 30-40% recall ceiling for individual tools, the ~80% business-logic gap, and the lack of fully independent benchmarks all confirm that human expertise remains irreplaceable. But the $0.01-$13 vs $150,000+ cost gap, the complementary coverage across tools, and the exponential growth in exploit capability all confirm that AI is no longer optional. The practitioners who combine AI's exhaustive pattern matching with human judgment on intent, incentives, and architecture will define the next standard of smart contract security.
Methodology
This research was compiled from publicly available sources only. Our data collection covered:
Academic papers: 60+ papers from ICSE, NDSS, IEEE S&P, ACM TOSEM, IJCAI, and other top venues GitHub repositories: Tool repos, benchmark datasets, and open-source implementations Company blogs: Nethermind, Ackee (Wake Arena), Sherlock, Anthropic, OpenAI, Paradigm Competition platforms: Code4rena, Sherlock, Immunefi Independent evaluations: Lyubenov's tool comparison, Viggiano's benchmark Benchmark websites: SCONE-bench, EVMBench, ScaBench, CTFBench, LISABench Funding data: Crunchbase, press releases, token launch announcements All claims are linked to primary sources. Where the same team built both a benchmark and the top-scoring tool, we noted it. Where numbers couldn't be independently verified, we noted that too.
Data collection period: February 2026. The smart contract security space moves fast — some numbers may have changed by the time you read this.
Have corrections, additions, or feedback? We aim to keep this research accurate and up to date.
Have corrections, additions, or feedback? We aim to keep this research accurate and up to date.
Get in Touch