03 — Ship an Agent That Does a Real Job (Week 4)¶

Mission¶

Ship an agent that completes a genuine multi-step job you currently do by hand — research and summarize, triage and route, monitor and act, gather and reconcile — using tools you designed, with every run traced, and a task completion rate measured over ≥20 scenarios in the README.

Why this rung¶

Agents are where the market is, and where the hype-to-competence gap is widest. Most "agent" content is framework tutorials; almost nobody building them can answer "what's your completion rate?" The durable skills are exactly three: the loop (model → tool call → result → model, until done), tool design (the API you give the model is the product), and observability (when a 30-step run fails, you need the trace, not vibes). Frameworks rot; these three transfer to everything.

Build the loop yourself this week — raw API with tool calling, or a minimal SDK at most. Not because frameworks are bad, but because you're buying the mental model, and it costs about 100 lines.

Pick a job with real stakes but low blast radius: log/alert investigation with a written verdict, dependency-vuln triage across your repos (read manifests, cross-check advisories, propose bumps), CVE-watch for your stack, or a cloud posture sweep (read resource configs, flag the public bucket / open security group). Read the world with one tool, act with a gated one.

The mental model¶

Strip the mystique and an agent is a while-loop with a model choosing the next tool call. All the "agency" lives in three design surfaces, and this week is about owning each one. First, the loop itself: what ends it (goal met, budget spent, model says done), what bounds it, what state carries forward. Second, the tools: this is API design where the consumer is a model — the description is documentation the model actually reads, the schema is the affordance, and the output is everything the model will ever perceive about what happened. Tool outputs are the agent's senses; return a 50KB JSON dump and you've blindfolded it with noise, return a crisp summary plus the salient error and you've given it eyes. Third, the accumulating context: the transcript is the program state. There is no hidden logic anywhere else — which is why debugging an agent means reading the trace, the way debugging a program means reading the stack, and why the spec makes tracing non-negotiable.

The math that should shape your design: per-step reliability compounds. A model that's right 95% of the time per decision completes a 10-step chain about 60% of the time. You don't fix that with a better prompt; you fix it structurally — fewer, chunkier steps; tools that accomplish more per call; checkpoints where the agent (or a gate) verifies before proceeding. When your completion rate disappoints, count the steps before you blame the model.

The practitioner translation for tools: you are building a UI for a non-human user. The same empathy you'd spend on a human interface — what do they see, what would confuse them, what's the error message like — is spent on the model. Most agent failures that look like "the model is dumb" are actually "the tool told it nothing useful," and the fix is in your code.

The gotcha — agents fail forward: a model uncertain what to do next will do something — call a plausible tool, invent a step, declare victory — rather than stop and say so. Silence is not in its nature; that's why the spec includes a nothing-to-do scenario and why "knows when to quit" is graded. Design the exits as deliberately as the loop, or the agent will find its own.

The path¶

Start here (the first hour): the bare loop running — model, one trivial tool (get_time is fine), dispatch, transcript printed. Watching the model choose to call your tool and act on the result is the moment agents stop being magic; get there in hour one, before any real tools exist.

Default pick (if you haven't chosen in 30 minutes): the dependency-vuln triage agent — point it at a repo; tools to read the dependency manifest, look up a package's known advisories, and write a triaged report (what's vulnerable, severity, safe bump). Real stakes (you'll act on it), naturally multi-step, and the acting tool (opening an issue / writing the report) is the perfect thing to gate.

Build order — each step feeds the next:

[ ] Mon — the loop, for real. Conversation state, tool dispatch, max-iteration guard, stop condition, JSONL trace of every call. ~150 lines, still toy tools.
[ ] Tue — the three tools. One that reads the world, one that acts on it, one support. First full end-to-end run on the real job. (Hint: write each description as if it's the only documentation the model will ever see — because it is.)
[ ] Wed — first 10 scenarios. Checkable outcomes, run through the eval template, completion rate recorded. It will be humbling; that's the baseline, not the verdict.
[ ] Thu — read traces, fix tools. Build the failure taxonomy from Wednesday, fix the top failure (usually a tool output the model couldn't use), re-measure. This before/after is the documented tool-iteration the spec demands.
[ ] Fri — 20 scenarios, adversarial included. Ambiguous ask, tool failure mid-run, nothing-to-do. Guardrails on the acting tool (dry-run / confirm / allowlist).
[ ] Sat — final measured run. Completion rate, cost, median steps; annotate one interesting trace end to end while you still remember why it's interesting.
[ ] Sun — publish. Architecture sketch, the taxonomy, the numbers, build-log entry.

Spec — must-haves¶

[ ] The loop, written by you (~100–200 lines): conversation state, tool dispatch, max-iteration guard, a stop condition.
[ ] ≥3 tools you designed — descriptions, schemas, and outputs shaped for the model (concise, informative errors; no 50KB JSON dumps). At least one tool that reads the world and one that acts on it.
[ ] Guardrails: the acting tool is gated (dry-run mode, confirm-before-execute, or an allowlist). A runaway loop hard-stops.
[ ] Secure it — the lethal trifecta. An agent that combines untrusted input + access to private data + an outbound channel can be turned against its owner. Name which of the three your agent has, and cut one: least-privilege tools, no raw secrets in context, and the acting/exfil-capable tool gated. One scenario in the suite is an injection attempt (a poisoned log line / advisory) that must not trip a real action.
[ ] Tracing on every run — hosted (Langfuse) or structured JSONL you can actually read: every model call, tool call, token count, timing.
[ ] A scenario suite of ≥20 cases with objectively checkable outcomes, run through your eval template. Include ≥3 adversarial/edge scenarios (ambiguous ask, tool failure mid-run, nothing-to-do).
[ ] The failure taxonomy in the README: for every failed scenario, which step and why (bad plan? bad tool call? bad tool output? gave up early?).

Eval bar¶

Completion rate over the ≥20 scenarios in the README, with cost and median steps-per-run.
One concrete tool-design iteration documented: the before description, the failure it caused, the after — and the completion-rate delta it bought.
The nothing-to-do scenario passes: the agent says so and stops, rather than inventing work.

JIT learning — pull when stuck¶

Anthropic — Building effective agents — the reread. This time the workflow-vs-agent distinction and "start simple" will land differently (~20 min).
Anthropic — Writing tools for agents — the best concrete guidance in print on tool descriptions, output shaping, and evaluating tools (~25 min).
Lilian Weng — LLM-powered autonomous agents — the conceptual map (planning, memory, tools); skim for vocabulary, not implementation (~30 min).
Langfuse docs — tracing with a generous free tier; the observability quickstart is a 30-minute retrofit.
Simon Willison — the lethal trifecta — the clearest statement of the agent-security failure mode you're designing against; read before you give any tool write-access (~15 min).
MCP docs — if you expose your Week-1 server's tools to this agent, you get tool reuse for free and you'll feel why the protocol exists.

Key ideas¶

An agent = a while-loop + a model choosing tool calls; the magic is all in design surfaces you own.
Tool design is API design for a model consumer: description = docs, output = its senses.
The transcript is the program state; debugging an agent means reading the trace.
Reliability compounds: 95% per step ≈ 60% over ten — fix step count, not just prompts.
Most "dumb model" failures are "mute tool" failures; the fix is in your code.
Agents fail forward — design the exits (stop conditions, nothing-to-do) deliberately.

Check yourself¶

Your agent's completion rate is 55% on 12-step runs. What structural fixes do you try before touching the prompt?
What makes a tool output good? Give the two properties that matter most to the model.
Why is "the agent did something reasonable-looking" a failure mode rather than a comfort?

Publish¶

The agent repo: architecture sketch (10 lines of ASCII beats a diagram you won't draw), the completion-rate table, the failure taxonomy, one full annotated trace of an interesting run.
Build-log entry.

Stretch¶

Multi-agent. Add a second, cheaper model as a subagent for a mechanical subtask — the orchestrator-worker pattern. Measure the cost delta at equal completion rate, and notice the new failure mode: coordination overhead and context lost at the hand-off.
Memory. Give the agent persistent state across runs (a scratchpad file or a small vector store of past findings) so run N benefits from run N-1. Measure whether it actually helps or just accumulates noise — memory that isn't pruned is a liability.
Framework rep (marketable). Now that you've built the loop raw, rebuild this same agent in LangGraph and write up the tradeoffs: what it gave you (state machine, checkpointing, streaming) versus what it hid (the control flow you now understand cold). This is the honest way to earn the field's most JD-common agent framework — you can name it and say where it helps and where it gets in the way. Keep both versions in the repo.
Schedule it (cron) and let it run unattended for the rest of the program — every later build-log entry gets a line on how it behaved with nobody watching.

Proof¶

"I've built an agent from the raw loop up — my own tools, traces on every run, and a measured completion rate over 20 scenarios, including the adversarial ones."