Research Manifesto: AI Security at a Tipping Point
What we observe
Smart contract security is at a tipping point. Data from the past year shows simultaneous breakthrough and failure:
Breakthrough: AI can now exploit 72% of real Code4rena bugs (EVMbench, Feb 2026). Frontier models reproduce real-world attacks worth millions in simulated stolen funds (SCONE-bench, Feb 2026). The rate of improvement is exponential. Exploit capability roughly doubles every 1.3 months, though researchers expect this to plateau.
Failure: The same AI that's great at attacking is bad at defending. The best production AI audit tool is wrong nearly half the time, 55.3% precision (Sherlock Benchmark, 2026). Detect rate is only 41%, patch rate just 28% (EVMbench). And auditing as a whole, human or AI, isn't preventing hacks: 92% of exploited contracts in 2022 had already been audited (AnChain.AI, 2023).
AI is simultaneously powerful enough to be dangerous and unreliable enough to be unusable as a standalone defense.
We exist because we see in this gap not a verdict, but a research problem with concrete, measurable steps toward a solution.
Three challenges that define our work
Challenge 1. Capability boundaries: where AI breaks and how to push them
AI excels at pattern matching: reentrancy, overflow, missing access control. Everything that LOOKS like a vulnerability is found with 73.7-100% F1-score on labeled benchmark datasets (LLM-SmartAudit, IEEE TSE 2025). But when the same system is tested on real-world Code4rena contracts, recall drops to ~48% (F1 not reported for this dataset). The gap between lab and production is enormous. GPTScan detects logic vulnerabilities for $0.01 per 1K lines in 14 seconds, though precision varies from 57% on complex projects to 90% on token contracts (Sun et al., ICSE 2024).
But security is not about pattern matching. The most expensive hacks involve business logic, cross-protocol attacks, and economic exploits. This is where AI breaks:
- Business logic: The model cannot assess whether a protocol's economic model makes sense. This requires understanding intent, not code patterns.
- Cross-contract vulnerabilities: LLMs are limited by context windows and struggle to track data flow across multiple contracts. Most existing tools analyze contracts in isolation.
- New Solidity versions: Recall on v0.8 reentrancy drops from 92% to 13% for GPT-4o-mini, though results vary dramatically by model (Gemini 1.5 Pro retains 93%). Models trained on older Solidity patterns struggle with v0.8 (Xiao et al., 2025).
- Exhaustive audit: Models tend to stop after identifying a single issue rather than exhaustively auditing the codebase (EVMbench, Paradigm, 2026).
However, the boundary is not static. PropertyGPT (NDSS 2025 Distinguished Paper) combines LLM with formal verification and finds 12 previously unknown zero-day vulnerabilities (Liu et al., NDSS 2025). iAudit's multi-agent architecture with domain-specific fine-tuning achieves 91% F1-score on the same dataset where GPT-4 scores 68% F1 (Ma et al., iAudit, ICSE 2025).
Our task: systematically explore the boundary between "AI can" and "AI cannot." Not wait for models to improve on their own, but find specific architectural solutions (hybrid LLM + formal verification, CPG slicing, cross-contract flow analysis, fine-tuning on domain data) that push this boundary now. For each vulnerability class, measure what works, what doesn't, and where the minimum viable approach for breakthrough lies.
Challenge 2. Reliability: where the industry stands and what's missing
The sharpest problem with AI in security is not that it misses bugs. It's that it generates too many false ones.
The current state of commercial AI audit tools (note: results come from different benchmarks and are not directly comparable):
Academic benchmark (Chen et al., TOSEM 2024), SmartBugs-curated, 142 contracts with labeled vulnerabilities:
| Tool | Precision | Recall | Context |
|---|---|---|---|
| Raw GPT-4 (no tools) | 22.6% | 88.2% | Classification across 9 vulnerability categories |
Sherlock independent benchmark (Sherlock, 2026), Flayer + Moongate audit scope, human-triaged findings (note: Sherlock cautions "this isn't a formal research study"):
| Tool | Precision | Findings | Context |
|---|---|---|---|
| Sherlock AI v2.2 | 55.3% | 21 valid / 38 total | Independently verified by senior researcher |
| ChatGPT 5.2 | ~50% | 4 valid / 8 total | Including 2 High severity |
| Claude Sonnet 4.5 | 6.25% | 1 valid / 16 total | Sherlock notes Claude's findings were "highly compressed"; format penalized, not necessarily detection |
Production audits:
| Tool | Metric | Context |
|---|---|---|
| Nethermind AuditAgent | 30% recall overall (42% critical, 43% high) | 29 real audits, precision not published (Nethermind, 2025) |
| Zellic V12 | Detection rates vary by project | Precision not published (Zellic, 2025) |
The highest independently verified precision we found for a production AI tool is 55.3% (Sherlock AI). ChatGPT 5.2 achieved ~50% on the same test. Raw LLMs range from 6% to ~50% precision depending on model and benchmark (GPT-4 at 22.6% on SmartBugs-curated, ChatGPT 5.2 at ~50% on Sherlock's test). Most firms don't publish precision at all.
Meanwhile, academic approaches show impressive numbers, but under sterile conditions:
| Approach | Result | Source |
|---|---|---|
| LLM + static analysis confirmation | 57-90% precision, 70-83% recall | GPTScan, ICSE 2024 |
| Multi-agent + domain fine-tuning | 91% F1 (precision + recall) | iAudit, ICSE 2025 |
| Structured prompts alone | 30-76% FP reduction (varies by model; >60% for 2 of 5 tested) | Xiao et al., 2025 |
| Model ensemble with weighted voting | Abstract claims 60% top-5 accuracy (+19% over baselines), but Table 4 shows ensemble at 56%, same as best single model (DeepSeek FT) | LLMBugScanner, 2025 |
These academic results are likely too optimistic. They are measured on curated datasets with known vulnerability labels, pre-classified contracts, and controlled conditions. Essentially, the model is tested on bugs someone has already found. Production reality is fundamentally different: novel code, novel bugs, novel protocol designs that no dataset has seen before. The gap between academic results (91% F1) and production tools (~55% precision) reflects this difference. We orient toward production metrics as our north star, because that's where the actual value is.
Key insight: self-correction without external feedback makes results worse. In general reasoning tasks (math, QA), LLMs cannot self-correct without external input; performance often degrades rather than improves (Huang et al., ICLR 2024). This finding applies specifically to "intrinsic self-correction" without tools; agentic systems with external verification are a different category. The implication for security: verification through an external oracle works (static analysis, formal verification, symbolic execution), but LLM-only self-review does not.
At the same time, the task is not simply "reduce FP." Aggressive false positive filtering kills recall: precision of 96.6% is achieved at recall of only 37.8%, meaning the model misses 2/3 of real bugs. On the positive side, the same study found ~60% of LLM-generated vulnerability reports included usable proof-of-concept exploits (Du & Tang, 2024).
Our task: improve production results on real-world codebases. Not chase academic scores on curated benchmarks, but raise actual precision and recall on live audits with novel code. Production tools today show ~55% precision at most. We take this as our starting point and work to push it higher by combining LLM reasoning with external tool verification, domain-specific fine-tuning, and systematic evaluation on real audits, not lab datasets.
Challenge 3. Fragmentation: 200+ tools, minimal integration
A 2024 survey of the field cataloged over 200 tools for smart contract security analysis (Iuliano & Di Nucci, 2024). And yet:
- F1-score of the best among 17 tested scanners is at most 73% (on reentrancy detection specifically; varies by vulnerability type). Most are below 50% (Sendner et al., 2023).
- Existing tools could have prevented only 8% of high-impact attacks across 127 incidents totaling $2.3B in losses, though 75% of those attacks involved vulnerability types entirely outside the tools' design scope. Semi-automated approaches could potentially have prevented 37% (Chaliasos et al., 2023).
- Almost 49% of findings resist automated detection entirely (Trail of Bits, 246 Findings, 2019).
Tools are not integrated. Trail of Bits' Crytic ecosystem, the closest to integration: crytic-compile + Slither + Echidna + Medusa + Manticore (now in maintenance mode). But these are separate tools in different languages (Python, Haskell, Go), without a unified pipeline, without a common report format, without monitoring, without a knowledge base. Trail of Bits has since added AI tooling (Buttercup, 2nd place at DARPA AIxCC; Slither-MCP) but these remain separate from the core static/fuzzing stack.
There is no standard for exchanging results. FORGE (ICSE 2026) processed 6,454 audit reports and found 296 CWE categories. Classification is inconsistent across datasets (FORGE, 2026). FORGE itself demonstrates the problem is solvable (achieving 95.6% extraction precision with LLM-based classification), but no industry standard has been adopted yet.
Knowledge lives in individual auditors' heads. Sherlock AI is "trained on the knowledge and instincts" of top researchers, including "record-setting auditors like 0x52" (Sherlock AI). OpenZeppelin contributes massively through their open-source Contracts library (the industry standard, $26T+ in value transferred) and Code Inspector (static analysis + AI), but their internal research process is not fully public. Cyfrin Solodit has aggregated 50K+ vulnerability findings into a searchable database, but it's a structured dataset, not a knowledge graph.
The full stack (static analysis, fuzzing, formal verification, AI, monitoring, knowledge management) is spread across dozens of companies and open-source projects, each covering parts of it. No one has assembled a unified pipeline. The goal is not to build yet another standalone tool. It's to systematically test what already exists and integrate the right combinations. For every pairing of "vulnerability type x approach," measure precision/recall. Determine what works where. Build a pipeline that selects the right tool for the task, rather than throwing everything at the wall.
How we work
Principle 1: Research for practice, not for publications
Every research effort concludes with an artifact: benchmark, adapter, pipeline stage, evaluation report. Not paper for the sake of paper. If a research effort doesn't lead to a measurable improvement in the system, it has failed.
Principle 2: Measure everything
Without metrics there is no progress. Every change to the system goes through the evaluation pipeline:
- Precision/recall by vulnerability type
- False positive rate
- Cost ($ per scan)
- Latency (seconds per scan)
- Per-agent metrics
- A/B tests for every change
If we can't measure it, we don't deploy it.
Principle 3: We may be wrong
Following Anthropic's example ("This whole picture may be completely wrongheaded," Core Views on AI Safety):
- Fine-tuning may turn out to be less important than RAG
- Multi-agent may prove worse than a single agent with a better prompt
- Formal verification may be too expensive for a production pipeline
- The entire "LLM for security" approach may hit a ceiling
We hedge: test multiple approaches in parallel, don't bet everything on one architecture. The modular system (Ports & Adapters) allows replacing any component without rewriting the rest.
Principle 4: Use everything, not just LLMs
Trail of Bits showed in a 2019 study (pre-LLM era): 49% of findings resist automated detection, yet 78% of high-severity, low-difficulty findings (the most dangerous, easiest-to-exploit category) are caught in the best case by combining static + dynamic analysis (246 Findings, Trail of Bits, 2019). These proportions may have shifted with modern tooling, but no comparable study has been published since. Trail of Bits themselves gave LLMs access to Slither via MCP rather than trusting raw LLM output (Slither-MCP, Trail of Bits, 2025).
LLMs are a powerful reasoning tool. But static analysis, fuzzing, formal verification, symbolic execution are powerful verification tools. The best results come when LLM reasoning is confirmed by external tools, not when the LLM checks itself.
Principle 5: Concrete numbers instead of vague promises
We don't promise to "revolutionize smart contract security." We promise to:
- Publish precision/recall of our system on open benchmarks
- Compare against baselines (raw GPT-4, Slither alone, human auditors)
- Show what specifically each component adds to the result
- Acknowledge where we are worse than alternatives
What we don't promise
- We don't promise to replace auditors. Almost 49% of findings resist automated detection entirely (Trail of Bits, 2019). The goal is augmentation, not replacement.
- We don't promise zero false positives. Even the best approaches (iAudit, 91% F1) have ~9% error rate. The goal is to minimize while preserving recall.
- We don't promise to find every bug. Even 10 human audits from 6 firms didn't prevent the Euler Finance exploit. $197M drained before the attacker ultimately returned the funds. Notably, the vulnerability was introduced in a code update (eIP-14) after most audits' scope (AnChain.AI). Security is a probabilistic process.
- We don't promise our approach is the only correct one. The industry moves fast. Exploit revenue doubles roughly every 1.3 months (Anthropic SCONE-bench), though this trend is expected to plateau. What works today may be obsolete in six months.
How we measure progress
| Metric | Current baseline (industry) | Our target |
|---|---|---|
| Precision (production AI) | ~55% (Sherlock AI) | Step 1: >70% / Step 2: >85% |
| Precision (raw LLM) | 22.6-50% depending on model and benchmark (Chen et al., Sherlock) | >50% (with hybrid pipeline) |
| Recall (overall) | 30% (Nethermind) | Step 1: >40% / Step 2: >60% |
| Recall (critical/high) | 42-43% (Nethermind) | Step 1: >55% / Step 2: >70% |
| False positive rate | 45-85% depending on model (Xiao et al.) | Step 1: <30% / Step 2: <15% |
| Cost per scan | $5K-$250K+ (manual audit, depending on complexity) | <$100 (automated) |
| Time per scan | Days-weeks (manual) | <1 hour (automated) |
Every metric is measured on open benchmarks. Results are published.
Summary
Three challenges:
Capability boundaries. AI catches patterns at 73.7-100% F1 on lab benchmarks, but real-world recall drops to ~48%. It breaks on business logic, cross-contract, and novel vulnerabilities. We systematically explore and push this boundary through hybrid architectures (LLM + formal verification + static analysis + CPG slicing).
Reliability. Production tools at ~55% precision. Academic systems claim 91% F1, but on sterile benchmarks, not real-world code. We focus on improving production results on live audits. Through external tool verification, fine-tuning, ensemble voting, constrained decoding. Each approach measured on real codebases, not lab datasets.
Fragmentation. 200+ tools, minimal integration. We test, catalog, and integrate the best of what exists into a unified pipeline with measurable results.
The path: through knowledge of everything the industry has (AI and beyond), testing all existing approaches, and research aimed at practice and immediate results.