No LLM-judge in the success path. Exact match (canary string extraction), span-IoU (CUAD), McNemar chi-squared (ASR comparisons). The judge is a function, not a model.
I build agentic systems. I evaluate them honestly.
Graphic designer turned AI-native engineer. Four production artifacts, one shared orchestration kernel, deterministic scoring, adversarial verification, and honest nulls in every case study.
Reasoning model is more robust pre-defense. The full defense stack erases that gap. That is the finding.
The substrate thesis
Builds agentic systems AND evaluates them honestly. The four artifacts below all vendor a shared orchestration kernel (Quorum core/): a real "I built a substrate and proved it on multiple problems" narrative.
Target roles: frontier-lab Applied AI / Forward-Deployed Engineer / Agent Engineer / Design Engineer. Tone: confident, precise, low-ego, technical without jargon-soup.
- Deterministic scoring in the success path. No LLM-judge in the hot loop. Exact match, span-IoU, McNemar.
- Adversarial verification: K skeptic agents per finding, prompt-injection traps included in every labeled set.
- Cost-gated reproducible runs.
make eval-dryreproduces Quorum offline. ~$0.25 per production run. - Honest nulls in every case study. The FieldAgent agentic-chunking lift collapsed from +0.45 to +0.07 on fair rerun. This honesty is the point.
Quorum
Task-aware agent orchestrator with cost-aware model routing, adversarial multi-agent verification, and a full trace UI.
K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps (DeepSeek-v4-pro).
| Metric | Baseline | K=3 verify | Note |
|---|---|---|---|
| False positives | 27.8% | 0.0% | 95% CI [0, 0] |
| Recall | 100% | 77.8% | Precision-recall trade |
| Held-out real target | N/A | 3/3 bugs found | 0 surviving FP |
| Cost per run | ~$0.25 | Cost-routing harness committed; live multi-tier number gated on Anthropic key | |
| Test suite | 58 tests | ruff + mypy + CI green |
Fans out finders per file, then K skeptic agents per finding (concurrency cap 8). Reproduces offline: make eval-dry. Prompt-injection traps included in the labeled set.
Aegis
Adaptive attacker agent red-teams a target on two harmless proxies. Scored deterministically (exact match, no LLM judge). Vendors Quorum core/.
A reasoning model is significantly more robust pre-defense: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. BUT the full defense stack erases the gap: 1.7% vs 2.8% (p=0.40, not significant). Defense is the equalizer.
| Metric | Standard model | Reasoning model | p-value |
|---|---|---|---|
| Injection ASR (pre-defense) | 68.1% | 49.3% | 0.0012 (sig.) |
| Canary ASR (pre-defense) | 21.5% | 10.4% | 0.010 (sig.) |
| Injection ASR (post-defense) | 2.8% | 1.7% | 0.40 (n.s.) |
| Overall ASR (pre-defense) | 0.0002 (sig.) | ||
| Defense reduction | 29.2% to 4.2% | input-classifier workhorse | |
| Adaptation lift | 24.0% to 29.9% | significant only at scale (McNemar b=17/c=0) | |
| Test suite | 78 tests | CI + Pages green |
FieldAgent
Agent reads a real commercial contract, flags risk-bearing clauses (span + severity + plain-English risk), graded span-IoU against CUAD gold. No LLM judge. Vendors Quorum core/.
The agentic-chunking lift looked like +0.45 F1 on DeepSeek only because of a truncation artifact. A fair rerun collapses it to +0.07 (CIs overlap). It ties on Claude Sonnet. This honesty is the point.
| Metric | Value | Note |
|---|---|---|
| Detection F1 | 0.548 | P=0.741, R=0.435; 95% CI [0.460, 0.637] |
| Lift over keyword floor | +0.21 F1 | Robust, baseline-independent |
| Agentic chunking lift (fair) | +0.07 | Was +0.45; collapsed on truncation audit |
| Held-out CUAD contracts | 20 | Party names / $ figures redacted in demo |
| Test suite | 47 tests | CI green |
Skill-Tuning Council
A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary, editors, merger, council, escalate-on-disagreement. 576 tests. Internal infra, no public URL. A methodology and systems-design piece.
- Adversary generates candidate improvement
- Editors refine and scope the change
- Merger produces a clean diff
- Council votes (4 proxies, majority required)
- Escalate-on-disagreement to arbiter
How I build and evaluate
Four constraints apply to every artifact. They are not aspirational. They are enforced in CI.
K skeptic agents per finding. Prompt-injection traps included in every labeled set. Quorum cut FP 27.8% to 0% on the injected set. Aegis uses an adaptive attacker that improves across rounds.
make eval-dry reproduces Quorum offline. ~$0.25 per production run. Cost-routing harness committed; live multi-tier number gated on an Anthropic key and presented honestly as such.
The FieldAgent agentic-chunking lift was +0.45 on DeepSeek due to a truncation artifact. The fair rerun: +0.07, CIs overlap. Published. Nulls read as more credible at a frontier lab, not less.