Applied AI / Agent Engineer

I build agentic systems. I evaluate them honestly.

Graphic designer turned AI-native engineer. Four production artifacts, one shared orchestration kernel, deterministic scoring, adversarial verification, and honest nulls in every case study.

Aegis: honest null

Reasoning model is more robust pre-defense. The full defense stack erases that gap. That is the finding.

Injection ASR (reasoning)49.3%vs 68.1% standard, p=0.0012
Post-defense ASR1.7% vs 2.8%p=0.40 (n.s.)
Defense reduction29.2% to 4.2%−25 pp
ScoringDeterministicno LLM judge
Scroll to compress the cluster. The null lands at p=0.40.

Builds agentic systems AND evaluates them honestly. The four artifacts below all vendor a shared orchestration kernel (Quorum core/): a real "I built a substrate and proved it on multiple problems" narrative.

Target roles: frontier-lab Applied AI / Forward-Deployed Engineer / Agent Engineer / Design Engineer. Tone: confident, precise, low-ego, technical without jargon-soup.

  • Deterministic scoring in the success path. No LLM-judge in the hot loop. Exact match, span-IoU, McNemar.
  • Adversarial verification: K skeptic agents per finding, prompt-injection traps included in every labeled set.
  • Cost-gated reproducible runs. make eval-dry reproduces Quorum offline. ~$0.25 per production run.
  • Honest nulls in every case study. The FieldAgent agentic-chunking lift collapsed from +0.45 to +0.07 on fair rerun. This honesty is the point.
Flagship artifact

Quorum

Task-aware agent orchestrator with cost-aware model routing, adversarial multi-agent verification, and a full trace UI.

GitHub
Primary finding

K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps (DeepSeek-v4-pro).

MetricBaselineK=3 verifyNote
False positives27.8%0.0%95% CI [0, 0]
Recall100%77.8%Precision-recall trade
Held-out real targetN/A3/3 bugs found0 surviving FP
Cost per run~$0.25Cost-routing harness committed; live multi-tier number gated on Anthropic key
Test suite58 testsruff + mypy + CI green
Live trace UIOpen live

Fans out finders per file, then K skeptic agents per finding (concurrency cap 8). Reproduces offline: make eval-dry. Prompt-injection traps included in the labeled set.

Adaptive red-team gauntlet

Aegis

Adaptive attacker agent red-teams a target on two harmless proxies. Scored deterministically (exact match, no LLM judge). Vendors Quorum core/.

Honest null (the sophisticated finding)

A reasoning model is significantly more robust pre-defense: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. BUT the full defense stack erases the gap: 1.7% vs 2.8% (p=0.40, not significant). Defense is the equalizer.

MetricStandard modelReasoning modelp-value
Injection ASR (pre-defense)68.1%49.3%0.0012 (sig.)
Canary ASR (pre-defense)21.5%10.4%0.010 (sig.)
Injection ASR (post-defense)2.8%1.7%0.40 (n.s.)
Overall ASR (pre-defense)0.0002 (sig.)
Defense reduction29.2% to 4.2%input-classifier workhorse
Adaptation lift24.0% to 29.9%significant only at scale (McNemar b=17/c=0)
Test suite78 testsCI + Pages green
Live demoOpen live
CUAD contract red-flag finder

FieldAgent

Agent reads a real commercial contract, flags risk-bearing clauses (span + severity + plain-English risk), graded span-IoU against CUAD gold. No LLM judge. Vendors Quorum core/.

Honest finding (null on agentic lift)

The agentic-chunking lift looked like +0.45 F1 on DeepSeek only because of a truncation artifact. A fair rerun collapses it to +0.07 (CIs overlap). It ties on Claude Sonnet. This honesty is the point.

MetricValueNote
Detection F10.548P=0.741, R=0.435; 95% CI [0.460, 0.637]
Lift over keyword floor+0.21 F1Robust, baseline-independent
Agentic chunking lift (fair)+0.07Was +0.45; collapsed on truncation audit
Held-out CUAD contracts20Party names / $ figures redacted in demo
Test suite47 testsCI green
Live deploymentOpen live

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary, editors, merger, council, escalate-on-disagreement. 576 tests. Internal infra, no public URL. A methodology and systems-design piece.

$ council --run skill-edit --candidate patch-42
Loading 4 proxies: taste / pragmatism / intent / anti-drift
[taste] REJECT -- deviation from established pattern
[pragmatism] APPROVE -- cost neutral, scope clean
[intent] APPROVE -- aligns with goal specification
[anti-drift] REJECT -- regression risk on eval-43
Vote: 2 APPROVE / 2 REJECT -- escalating to Opus arbiter
[arbiter] REJECT -- anti-drift signal outweighs pragmatism delta
patch-42 shelved. 576 tests re-run. No regression.
  • Adversary generates candidate improvement
  • Editors refine and scope the change
  • Merger produces a clean diff
  • Council votes (4 proxies, majority required)
  • Escalate-on-disagreement to arbiter

Four constraints apply to every artifact. They are not aspirational. They are enforced in CI.

Deterministic scoring

No LLM-judge in the success path. Exact match (canary string extraction), span-IoU (CUAD), McNemar chi-squared (ASR comparisons). The judge is a function, not a model.

Adversarial verification

K skeptic agents per finding. Prompt-injection traps included in every labeled set. Quorum cut FP 27.8% to 0% on the injected set. Aegis uses an adaptive attacker that improves across rounds.

Cost-gated reproducibility

make eval-dry reproduces Quorum offline. ~$0.25 per production run. Cost-routing harness committed; live multi-tier number gated on an Anthropic key and presented honestly as such.

Honest nulls

The FieldAgent agentic-chunking lift was +0.45 on DeepSeek due to a truncation artifact. The fair rerun: +0.07, CIs overlap. Published. Nulls read as more credible at a frontier lab, not less.

Get in touch

Email is the primary channel. No X, no LinkedIn.