Applied AI / Agent Engineer

I build agentic systems. I evaluate them honestly.

Graphic designer turned AI-native engineer. Four production artifacts, one shared orchestration kernel, deterministic scoring, adversarial verification, and honest nulls in every case study.

thomas@thomaspeng.ca GitHub View work

Aegis: honest null

Reasoning model is more robust pre-defense. The full defense stack erases that gap. That is the finding.

Injection ASR (reasoning)49.3%vs 68.1% standard, p=0.0012

Post-defense ASR1.7% vs 2.8%p=0.40 (n.s.)

Defense reduction29.2% to 4.2%−25 pp

ScoringDeterministicno LLM judge

Scroll to compress the cluster. The null lands at p=0.40.

The substrate thesis

Builds agentic systems AND evaluates them honestly. The four artifacts below all vendor a shared orchestration kernel (Quorum core/): a real "I built a substrate and proved it on multiple problems" narrative.

Target roles: frontier-lab Applied AI / Forward-Deployed Engineer / Agent Engineer / Design Engineer. Tone: confident, precise, low-ego, technical without jargon-soup.

Deterministic scoring in the success path. No LLM-judge in the hot loop. Exact match, span-IoU, McNemar.
Adversarial verification: K skeptic agents per finding, prompt-injection traps included in every labeled set.
Cost-gated reproducible runs. make eval-dry reproduces Quorum offline. ~$0.25 per production run.
Honest nulls in every case study. The FieldAgent agentic-chunking lift collapsed from +0.45 to +0.07 on fair rerun. This honesty is the point.

Flagship artifact

Quorum

Task-aware agent orchestrator with cost-aware model routing, adversarial multi-agent verification, and a full trace UI.

GitHub

Primary finding

K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps (DeepSeek-v4-pro).

Metric	Baseline	K=3 verify	Note
False positives	27.8%	0.0%	95% CI [0, 0]
Recall	100%	77.8%	Precision-recall trade
Held-out real target	N/A	3/3 bugs found	0 surviving FP
Cost per run		~$0.25	Cost-routing harness committed; live multi-tier number gated on Anthropic key
Test suite		58 tests	ruff + mypy + CI green

Live trace UIOpen live

Fans out finders per file, then K skeptic agents per finding (concurrency cap 8). Reproduces offline: make eval-dry. Prompt-injection traps included in the labeled set.

Adaptive red-team gauntlet

Aegis

Adaptive attacker agent red-teams a target on two harmless proxies. Scored deterministically (exact match, no LLM judge). Vendors Quorum core/.

GitHub Demo

Honest null (the sophisticated finding)

A reasoning model is significantly more robust pre-defense: injection ASR 49.3% vs 68.1% (p=0.0012), canary 10.4% vs 21.5% (p=0.010), overall p=0.0002. BUT the full defense stack erases the gap: 1.7% vs 2.8% (p=0.40, not significant). Defense is the equalizer.

Metric	Standard model	Reasoning model	p-value
Injection ASR (pre-defense)	68.1%	49.3%	0.0012 (sig.)
Canary ASR (pre-defense)	21.5%	10.4%	0.010 (sig.)
Injection ASR (post-defense)	2.8%	1.7%	0.40 (n.s.)
Overall ASR (pre-defense)			0.0002 (sig.)
Defense reduction	29.2% to 4.2%		input-classifier workhorse
Adaptation lift	24.0% to 29.9%		significant only at scale (McNemar b=17/c=0)
Test suite		78 tests	CI + Pages green

Live demoOpen live

CUAD contract red-flag finder

FieldAgent

Agent reads a real commercial contract, flags risk-bearing clauses (span + severity + plain-English risk), graded span-IoU against CUAD gold. No LLM judge. Vendors Quorum core/.

GitHub Live

Honest finding (null on agentic lift)

The agentic-chunking lift looked like +0.45 F1 on DeepSeek only because of a truncation artifact. A fair rerun collapses it to +0.07 (CIs overlap). It ties on Claude Sonnet. This honesty is the point.

Metric	Value	Note
Detection F1	0.548	P=0.741, R=0.435; 95% CI [0.460, 0.637]
Lift over keyword floor	+0.21 F1	Robust, baseline-independent
Agentic chunking lift (fair)	+0.07	Was +0.45; collapsed on truncation audit
Held-out CUAD contracts	20	Party names / $ figures redacted in demo
Test suite	47 tests	CI green

Live deploymentOpen live

Skill-Tuning Council

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary, editors, merger, council, escalate-on-disagreement. 576 tests. Internal infra, no public URL. A methodology and systems-design piece.

$ council --run skill-edit --candidate patch-42

Loading 4 proxies: taste / pragmatism / intent / anti-drift

[taste] REJECT -- deviation from established pattern

[pragmatism] APPROVE -- cost neutral, scope clean

[intent] APPROVE -- aligns with goal specification

[anti-drift] REJECT -- regression risk on eval-43

Vote: 2 APPROVE / 2 REJECT -- escalating to Opus arbiter

[arbiter] REJECT -- anti-drift signal outweighs pragmatism delta

patch-42 shelved. 576 tests re-run. No regression.

Adversary generates candidate improvement
Editors refine and scope the change
Merger produces a clean diff
Council votes (4 proxies, majority required)
Escalate-on-disagreement to arbiter

How I build and evaluate

Four constraints apply to every artifact. They are not aspirational. They are enforced in CI.

Deterministic scoring

No LLM-judge in the success path. Exact match (canary string extraction), span-IoU (CUAD), McNemar chi-squared (ASR comparisons). The judge is a function, not a model.

Adversarial verification

K skeptic agents per finding. Prompt-injection traps included in every labeled set. Quorum cut FP 27.8% to 0% on the injected set. Aegis uses an adaptive attacker that improves across rounds.

Cost-gated reproducibility

make eval-dry reproduces Quorum offline. ~$0.25 per production run. Cost-routing harness committed; live multi-tier number gated on an Anthropic key and presented honestly as such.

Honest nulls

The FieldAgent agentic-chunking lift was +0.45 on DeepSeek due to a truncation artifact. The fair rerun: +0.07, CIs overlap. Published. Nulls read as more credible at a frontier lab, not less.

Get in touch

Email is the primary channel. No X, no LinkedIn.

thomas@thomaspeng.ca github.com/7P3ng