Overview

The Sentry benchmark is a small, qualitative readout for Warden’s security review behavior. It compares runs against known vulnerabilities from the public getsentry/sentry repository.

This is not an exhaustive eval and it is not a proof that Warden will catch every future issue. It is a way to compare implementations, prompts, models, and runtimes against the same historical security corpus.

What It Is

The corpus currently contains 86 validated vulnerabilities across 79 files and 6 historical Sentry commits. A benchmark run checks out each commit and scans only the files tied to known vulnerabilities at that commit.

That keeps the run focused. We are measuring whether Warden can recognize the same root causes, not whether it can discover unrelated issues across the whole Sentry repository.

Comparison Matrix

The score table is the headline and sorts by known-corpus recall. The cost and timing tables show what each run costs to operate. This matrix only includes complete runs with no failed chunks and per-chunk timing data.

Run Known Findings Cost

GPT 5.5 (Pi)

high

Known corpus 41/86 47.7%

Total findings 72

Cost $148.63

GPT 5.5 (Pi)

low

Known corpus 28/86 32.6%

Total findings 38

Cost $39.36

Claude Sonnet 4.6 (Pi)

Known corpus 25/86 29.1%

Total findings 32

Cost $19.84

Claude Sonnet 4.6 (Claude SDK)

Known corpus 24/86 27.9%

Total findings 32

Cost $103.59

Claude Opus 4.6 (Pi)

high

Known corpus 23/86 26.7%

Total findings 24

Cost $36.86

DeepSeek V4 Pro (Pi)

xhigh

Known corpus 23/86 26.7%

Total findings 30

Cost $18.70

Claude Sonnet 5 (Pi)

Known corpus 22/86 25.6%

Total findings 27

Cost $23.46

Claude Opus 4.8 (Pi)

high

Known corpus 21/86 24.4%

Total findings 24

Cost $21.31

Claude Opus 4.8 (Pi)

medium

Known corpus 18/86 20.9%

Total findings 19

Cost $14.50

DeepSeek V4 Flash (Pi)

xhigh

Known corpus 18/86 20.9%

Total findings 27

Cost $10.11

Claude Opus 4.8 (Claude SDK)

high

Known corpus 17/86 19.8%

Total findings 17

Cost $79.56

GLM 5.2 (Pi)

high

Known corpus 15/86 17.4%

Total findings 18

Cost $5.26

Claude Opus 4.7 (Pi)

medium

Known corpus 6/86 7.0%

Total findings 7

Cost $4.39

Cost

Lowest cost first.

Run Cost Input Output

Claude Opus 4.7 (Pi)

medium

Cost $4.39

Input 1.53m

Output 20.77k

GLM 5.2 (Pi)

high

Cost $5.26

Input 8.37m

Output 426.24k

DeepSeek V4 Flash (Pi)

xhigh

Cost $10.11

Input 74.35m

Output 2.09m

Claude Opus 4.8 (Pi)

medium

Cost $14.50

Input 4.62m

Output 225.33k

DeepSeek V4 Pro (Pi)

xhigh

Cost $18.70

Input 65.51m

Output 1.85m

Claude Sonnet 4.6 (Pi)

Cost $19.84

Input 9.67m

Output 508.84k

Claude Opus 4.8 (Pi)

high

Cost $21.31

Input 6.52m

Output 376.36k

Claude Sonnet 5 (Pi)

Cost $23.46

Input 20.37m

Output 800.84k

Claude Opus 4.6 (Pi)

high

Cost $36.86

Input 16.14m

Output 585.84k

GPT 5.5 (Pi)

low

Cost $39.36

Input 18.71m

Output 390.01k

Claude Opus 4.8 (Claude SDK)

high

Cost $79.56

Input 31.84m

Output 386.17k

Claude Sonnet 4.6 (Claude SDK)

Cost $103.59

Input 65.67m

Output 1.09m

GPT 5.5 (Pi)

high

Cost $148.63

Input 127.9m

Output 986.84k

Timing

Lowest P50 chunk duration first.

Run P50 P90 Total

Claude Opus 4.7 (Pi)

medium

P50 1.2s

P90 9.6s

Total 6.6m

Claude Opus 4.8 (Pi)

medium

P50 11.9s

P90 51.7s

Total 42.4m

Claude Opus 4.8 (Pi)

high

P50 20.9s

P90 1.1m

Total 31.4m

Claude Opus 4.8 (Claude SDK)

high

P50 21.6s

P90 1.1m

Total 34.3m

GPT 5.5 (Pi)

low

P50 34.2s

P90 56.4s

Total 55.2m

GLM 5.2 (Pi)

high

P50 39.4s

P90 6.6m

Total 173.9m

Claude Sonnet 4.6 (Pi)

P50 41.9s

P90 1.9m

Total 53.6m

Claude Opus 4.6 (Pi)

high

P50 49.6s

P90 2.5m

Total 67.6m

Claude Sonnet 5 (Pi)

P50 1.0m

P90 3.8m

Total 88.8m

Claude Sonnet 4.6 (Claude SDK)

P50 1.9m

P90 26.6m

Total 448.4m

DeepSeek V4 Flash (Pi)

xhigh

P50 2.9m

P90 18.5m

Total 494.5m

GPT 5.5 (Pi)

high

P50 3.0m

P90 5.6m

Total 163.9m

DeepSeek V4 Pro (Pi)

xhigh

P50 3.8m

P90 19.3m

Total 1056.9m

Reading Results

Known is the headline score: corpus entries where scoring confirmed the same bug in roughly the same location.
Findings is review volume before benchmark scoring. More findings can improve recall, but they also create more review work.
Cost is the recorded provider cost for the run. It is not normalized pricing or cost per finding.
P50 and P90 are per-analysis-chunk durations. Total includes verifier work, provider latency, queueing, retries, and runtime overhead.
Scoring is semantic. Same-file findings about different bugs do not count, duplicates do not double-count, and one finding can cover multiple corpus entries when it catches the same root bug.

Analysis

Sonnet 5 High on Pi

Sonnet 5 high found 22 of 86 known entries and emitted 27 findings. That makes it competitive, but not better than Sonnet 4.6 high on this corpus. It costs more than Sonnet 4.6 on Pi, emits fewer final findings, and trails Sonnet 4.6 by three known matches.

Sonnet 4.6: Claude SDK vs Pi

This is the clearest runtime comparison. Pi found 25 of 86 known entries. The Claude SDK found 24 of 86. Both emitted 32 findings.

The quality result is close. The operating profile is not. Claude SDK recorded $103.59 total cost, compared to $19.84 for Pi. The trace summaries point to larger repeated context in Claude SDK runs, not a matching gain in recall.

Opus 4.8 High: Claude SDK vs Pi

Pi found 21 of 86 known entries and emitted 24 findings. Claude SDK found 17 and emitted 17. Pi was also cheaper: $21.31 total versus $79.56.

The trace shape differs from Sonnet 4.6. Pi does more turns and more tool executions here, but each turn carries much less input context. Claude SDK’s extra cost is mostly context volume, not more tool fanout.

Opus 4.8 High vs Opus 4.6 High

The direct Pi comparison favors Opus 4.6 high on recall. Opus 4.6 found 23 of 86 known entries. Opus 4.8 found 21. Both emitted 24 findings.

Opus 4.8 is more selective under the current prompt and corpus. It exits more investigations earlier, which lowers cost and tool fanout, but it misses enough known vulnerabilities to trail Opus 4.6 here.

DeepSeek V4 XHigh

DeepSeek V4 Pro found 23 of 86 known entries and emitted 30 findings. V4 Flash found 18 and emitted 27.

Flash is cheaper because the model price is lower, not because it does less work. It used more turns, more tool executions, and more scan input tokens than V4 Pro. The result is not just a cheaper Opus-shaped run; it explores much more context and lands on a different set of known findings.

GLM 5.2 High

GLM 5.2 uses Pi through OpenRouter as openrouter/z-ai/glm-5.2 with explicit --effort high. OpenRouter reports high as the model’s default reasoning effort, with xhigh also available. The recorded row scans the same 156 chunks and leaves Warden’s finding verifier enabled. It found 15 of 86 known corpus entries and emitted 18 total findings.

The main result is lower recall, not noisy output. Fifteen of the 18 emitted findings matched known corpus entries. The three non-matches were same-file or nearby security findings that did not match the corpus issue: a LaunchDarkly timing-unsafe compare rather than the Statsig timestamp freshness bug, a Bitbucket forwarded-IP/signature bypass rather than invalid-signature HMAC logging, and a Sentry App issue-link SSRF rather than the event-scope corpus issue.

Operationally, GLM 5.2 exposed a Warden compatibility problem. Many clean no-finding chunks returned prose instead of the required {"findings":[]} JSON. Those records had traces, usage, and zero findings, but Warden marked them as extraction_no_findings_json. Four shards therefore use combined-clean artifacts: traced zero-finding extraction failures were normalized to empty ok chunks, and targeted repair records were used where reruns produced cleaner records. One large seer_rpc.py chunk also exceeded OpenRouter’s effective 1M-token context limit in the full shard; rerunning the failed target set with --parallel 1 removed the context failure.

Recorded cost for the validated artifacts is $5.26: $4.94 scan cost plus $0.32 post-processing and verification overhead. That excludes the abandoned xhigh attempt and dirty failed rerun artifacts. GLM 5.2 used 8.3M input tokens and 422k output tokens across the validated row, with a 39.4-second P50 chunk time and a 6.6-minute P90. The row is useful, but the parser issue should be fixed before treating GLM 5.2 as a routine unattended benchmark target.

Corpus

The Sentry vulnerability corpus lists the known issues used for scoring. Each entry includes the repository SHA, the affected file, a short vulnerability description, and the relevant code snippet.

Run It

Use the running guide to reproduce the benchmark, add a new model run, and record sanitized result metadata.