Toolkit — the Python you're assumed to use¶
The stack the ship ladder runs on, grouped by layer and mapped to the weeks that need it. Three principles shape the whole list, and all are deliberate:
- We only name what we dig into. If no ship uses a tool hands-on, it's not here — naming marketable modules you never actually touch (Pinecone, MLflow, Airflow, LangChain, CrewAI…) is résumé cosplay an interviewer unpicks in one question. Depth over keyword breadth: better to own the twenty tools we use than to recognize two hundred.
- No framework in the core path — you build the raw loop. The agent loop, the RAG orchestration, the eval harness are ~100–200 lines you write against the model SDK. Frameworks rot; the mental model doesn't. Frameworks (LangGraph, cloud AI platforms) get one genuine rep as optional stretches, after you've built the thing raw — see Not in the core path.
- Python by ecosystem, agnostic at the app layer. The model-touching weeks (fine-tune, multimodal, serving) are Python-only because the libraries are; the app layer (LLM feature, agent, browser, MCP) is genuinely polyglot. See Language stance.
Versions are pinned per ship-repo at build time and date-stamped, not here — this file is the map, not the lockfile. Model recommendations rot fastest; teach the durable mechanism, date the specific pick.
Language stance¶
- Model layer (Weeks 2 embeddings, 5, 6, serving): Python, no real alternative. PEFT,
TRL, Unsloth,
transformers, Whisper, diffusion — these ecosystems don't have first-class ports. You will read and run Python here regardless of who types it. - App layer (Weeks 1, 3, 4, MCP, big-swing product): polyglot. The Anthropic/OpenAI SDKs, the MCP SDK, and the agent loop have first-class TypeScript/Go equivalents. If your production stack is TS, do these there with no loss.
- The real prerequisite is reading Python, not writing it. The standing rule is "AI writes, you review every line" — so you need enough fluency to judge a training loop, not to author one cold. That is a lower bar, and it's why an engineer from any language can start.
Environment & the loop's success signal¶
The self-verifying agentic loop (Week 0) only closes on its own when the task has a cheap, honest check to run. These are that check — they belong in the toolkit as much as any model library.
| Tool | Role | Where |
|---|---|---|
uv |
env + dependency manager; the thing everything installs through | all |
ruff |
lint + format — a fast, honest pass/fail the agent loops against | all |
pyright / mypy |
type-check — catches a whole class of AI slop for free, and it's the loop's other signal | all |
pytest |
the eval-template runner and the Week-7 regression gates | 00, 07, all |
Core / cross-cutting (essentially every ship)¶
anthropicand/oropenai— the frontier SDK: chat, streaming, tool use, vision.litellm— one interface across Anthropic / OpenAI / Ollama. Makes "swap the model, re-run the eval" a one-liner; the backbone of the Week-1 comparison and the Week-7 cross-model gauntlet.pydantic— schema-validated structured outputs and eval-case schemas.httpx/requests,python-dotenv— HTTP for tools/feeds; key handling per the.envhygiene rule.tenacity— retry/backoff around flaky API calls; you'll want it by Week 1.
Agentic / MCP / observability (Weeks 0, 3, 4, 7)¶
mcp— the Python MCP SDK (your host-recon server; tool reuse later).langfuse— tracing/observability (or hand-rolled JSONL).sqlite(stdlib) — persistence for the agent-memory stretch (Week 3).- The loop itself is hand-written against the SDK — no agent framework.
LLM app + RAG (Weeks 1, 2)¶
instructor— Pydantic-validated outputs with retries (alternative to raw tool-use).gradio— UIs and HF Spaces deploys.sentence-transformers— embeddings and the cross-encoder reranker.- A vector store:
qdrant-clientorpgvector(via psycopg) orchromadb— pick one and go deep; you don't need three. rank_bm25— keyword search for the hybrid step.trafilatura— HTML → clean text when your corpus is web pages (Week 2 ingestion, Week 4 DOM route).ragas— RAG metric definitions (Week 2); you can also reimplement them in the eval template.
Security & data hygiene (the "Secure it" thread; Weeks 2, 3, 4, 6, 8)¶
presidio— PII detection/scrubbing before indexing or committing (the License-&-data rule).garak— LLM red-teaming / vuln scanner for the Week-8 OWASP LLM Top 10 pass.
Local & open models (Week 1 fallback, 5, 6, 8)¶
ollama(Python client) — local serving and the fallback path.huggingface_hub— pulling and publishing models/datasets/Spaces.transformers(+accelerate) — loading/running open models.
Fine-tuning (Week 5) — the unavoidably-Python layer¶
unsloth— QLoRA on free GPUs, wrappingpeft(LoRA),trl(SFT/DPO),transformers,bitsandbytes(4-bit),datasets.torch— the substrate under all of it.wandb(or tensorboard) — log the training run; keep the loss curves.lm-eval(lm-eval-harness) — standardized eval of the tuned model.- GGUF export →
llama.cpp/ Ollama for the local quantized run.
Multimodal (Week 6)¶
openai-whisper/faster-whisper— STT.- A TTS library (voice out) +
Pillowand a VLM viatransformers(vision in). gradioAudio/Image components.
Serving, perf & production (big swing; JIT-only)¶
fastapi/uvicorn— the product's real API (Gradio is for demos).vllm,torch.compile— pulled only if a ship outgrows Ollama.
Data & write-ups (light; no data-science track here)¶
numpy/pandas— assumed baseline literacy, light wrangling only.pyarrow/ Parquet — dataset formats (Week 5).matplotlib— charts for the eval tables and latency histograms in write-ups.
Not in the core path (and why)¶
We don't do cm-deep coverage. Marketable tools we don't dig into aren't listed as keywords — you'd pick them up on the job faster than a name-drop here would help, and an interviewer can tell the difference between "I've used X" and "I've heard of X." Two of them, though, are worth one genuine rep as optional stretches, done after the raw build so you leave with an opinion, not just a line:
- LangChain / LangGraph — the raw agent loop (Week 3) is what you dig into. The Week-3 stretch then has you rebuild that same loop in LangGraph and write up the tradeoffs — now you can name the most JD-common agent framework and say where it helps and where it hides things. Frameworks-as-crutch is out; frameworks-from-understanding is the rep.
- A cloud AI platform (AWS Bedrock / Azure AI / Google Vertex) — the big-swing stretch deploys your product behind one, for whoever has credits or a job that uses it. On-theme with the cloud domain focus and the single most marketable enterprise line — earned by actually shipping on it, not by listing it.
Genuinely out of scope (named so their absence is a choice, not an oversight):
- LlamaIndex and other RAG frameworks — you build retrieval from
sentence-transformers+ a vector store so the mechanics stay visible. - The classical-ML stack — scikit-learn, XGBoost, Optuna, SHAP. The classical-ML / Kaggle track was cut; these return only if you deepen there on day 91.
- MLflow, Pinecone/Weaviate, Airflow/Prefect, LangSmith, Ray, BentoML, DSPy, CrewAI/ AutoGen and the rest of the marketable long tail — real tools, but nothing here digs into them, so they're not listed. Add one to your vocabulary the week a real project needs it, the same JIT rule as everything else.
Per-week module fragments¶
Starting points for each ship's requirements.txt (pin + date-stamp at build). Every ship
also gets the Environment row (uv, ruff, pyright, pytest) and the Core set.
- 00 Arm yourself —
mcp,anthropic/openai,pytest(+ env & core) - 01 LLM feature —
pydantic,instructor,gradio,litellm,ollama - 02 RAG —
sentence-transformers,qdrant-client,rank_bm25,ragas,presidio,trafilatura - 03 Agent —
langfuse,sqlite(memory stretch); LangGraph only if you take the framework stretch - 04 Browser agent —
playwright,trafilatura - 05 Fine-tune —
unsloth,peft,trl,transformers,bitsandbytes,datasets,accelerate,torch,wandb,lm-eval,huggingface_hub - 06 Multimodal —
faster-whisper, a TTS lib,Pillow,transformers,gradio - 07 Eval gauntlet —
litellm,pytest,matplotlib(extends the Week-0 template) - 08 Big swing —
fastapi,uvicorn,garak, plus whatever the chosen product pulls