[ > ] WEB3 AI RESEARCH GROUP

AI Web3 Research

Benchmarks, audit pipelines, and measurable results for tools and workflows

[ > ] WHAT WE OBSERVE

The tipping point

AI exploits 72% of known bugs yet the best AI defender shows only 55% precision. Simultaneously powerful and unreliable.

AI audit precision

30pp gap to close

~55%Current best (Sherlock AI)

85%Target

92%

exploited post-audit

92% of exploited contracts in 2022 had already been audited by at least one firm. Audits alone don’t prevent exploits. (AnChain.AI, 2023)

~55%

best AI precision

The highest independently verified precision for a production AI audit tool is 55.3% (Sherlock AI v2.2, 2026). Raw LLMs range from 6% to ~50%.

200+

tools, minimal integration

Over 200 tools catalogued for smart contract security (Iuliano & Di Nucci, 2024). Best F1 among 17 tested scanners: 73%. Most below 50%.

Hypothesis

We exist because we see in this gap not a verdict, but a research problem. Hybrid architectures combining LLM reasoning with external verification push boundaries that neither can cross alone.

Read Manifesto

[ > ] THREE CHALLENGES

What defines our work

CHALLENGE 01

Capability boundaries

AI catches patterns at 73.7-100% F1 on lab benchmarks, but real-world recall drops to ~48%. It breaks on business logic, cross-contract attacks, and novel vulnerabilities.

Definition of Done

Systematically explore the boundary. Find specific architectural solutions (hybrid LLM + formal verification, CPG slicing) that push capability now.

CHALLENGE 02

Reliability

Production tools at ~55% precision. Academic systems claim 91% F1, but on sterile benchmarks. Self-correction without external feedback makes results worse.

Definition of Done

Precision >85% on production code. Every finding verified by external oracle (static analysis, formal verification, symbolic execution).

CHALLENGE 03

Fragmentation

200+ tools catalogued, minimal integration. Existing tools could have prevented only 8% of high-impact attacks. 49% of findings resist automated detection entirely.

Definition of Done

Unified pipeline that selects the right tool for the task. For every pairing of vulnerability type and approach, measure precision/recall.

Read Manifesto

Latest Publications

84 Tools, 60 Papers, One Question: Is AI Auditing Ready?

LONGREAD · Mar 2026

84 tools, 16 benchmarks, 60+ academic papers. Independent recall at 30–40%, exploit rate at 72.2%, ~80% of bugs beyond AI reach.

Research Manifesto: AI Security at a Tipping Point

MANIFESTO · Mar 2026

Reducing False Positives in Vulnerability Detection

SURVEY / PDF · Mar 2026

View All

Interested in our research?

View Research

Want to collaborate?

Get in Touch