JW

Designing Agents and Skills for Writing a Physics Paper

4/20/2026

The Problem: Single-Thread Collapse

Matthew Schwartz’s “Vibe Physics” experiment showed what a single-thread approach can accomplish: Claude Opus 4.5 completed a year-long theoretical physics calculation in two weeks, producing a published paper on C-parameter resummation through 110+ drafts and 36 million tokens. Schwartz guided Claude entirely through text prompts, never editing files directly. The results were impressive, but Schwartz also reported that Claude fabricated results and took mathematical shortcuts that required his domain expertise to catch. He was the verification layer, and everything depended on him staying in the loop for every step. A math PhD student documenting his daily experience with GPT Pro’s extended thinking mode reached similar conclusions: the verification burden dominates the workflow, hour-long thinking times create idle periods, and the output has “pure intellect rather than a wise or insightful interpretation.” In a follow-up, he fed GPT Pro his own research on random matrix theory, waited 35 minutes, and got back what he’d given it rewritten in polar coordinates. The model couldn’t make progress, but it also couldn’t say so; it produced an expensive paraphrase instead.

Some of these are genuine model limitations: LLMs cannot yet do novel mathematics autonomously. But many are workflow limitations: no automated verification, no persistent context across sessions, no parallelism, no structured way to detect when the model is regurgitating rather than reasoning. The workflow limitations are solvable.

Writing a physics paper involves distinct cognitive tasks: deriving equations, searching literature, writing prose, checking consistency, investigating subtle physics questions. Each requires different tools and different amounts of context. A single conversation thread struggles to manage all of them. You lose context as the window fills. You duplicate work because the model doesn’t remember what it searched yesterday. You can’t parallelize. And when the conversation ends, everything the model learned about your paper vanishes.

We hit these problems while writing a 49-page paper on atom interferometry. The paper had 80+ references, 9 figures, 4 tables, 3 appendices with original calculations, and equations that needed symbolic verification. Rather than working in a single thread with manual oversight, we decomposed the cognitive tasks into specialized agents with persistent documentation and automated verification, the same way a research group divides labor among people with different roles.

Architecture: Two Layers and a Notes Directory

Two Claude Code concepts matter here. An agent is a markdown file that defines a specialized role: what it does, which tools it can use, and what instructions it follows. When you invoke an agent, Claude Code spawns a separate conversation with that agent’s prompt loaded. A skill is a reusable protocol (also a markdown file) that gets preloaded into an agent’s context, giving it a step-by-step procedure for a specific task without spawning a separate conversation. Agents are actors; skills are recipes.

Claude Code supports agent definitions at two levels: user-global (~/.claude/agents/) and project-specific (.claude/agents/). We added a middle tier for academic writing: ~/Write/.claude/agents/, containing agents and skills shared across all writing projects. Each project symlinks to this shared layer and adds its own project-specific agents.

~/Write/.claude/              # Shared across all writing projects
  CLAUDE.md                    # Shared workflow rules
  agents/
    editor.md                  # Full-paper editorial review
    grad-student.md            # Deep physics investigation
    researcher.md              # Literature search
    agent-auditor.md           # Audits the agent setup itself
  skills/
    verify-equation/           # SymPy verification protocol
    make-figure/               # matplotlib figure protocol
    research-notes/            # Literature documentation protocol

Draft2026/.claude/             # Project-specific
  agents/
    writer.md                  # References MirroredAI.tex specifically
    editor.md → ~/Write        # Symlinks to shared agents
    grad-student.md → ~/Write
    researcher.md → ~/Write
  skills/
    verify-equation → ~/Write  # Symlinks to shared skills
    make-figure → ~/Write
    research-notes → ~/Write

The shared layer defines agents that work on any academic paper: an editor that reads full documents and produces structured reports, a researcher that maintains a persistent literature database, a grad student that investigates physics claims. The project layer has a writer that knows the specific LaTeX file, notation conventions, and compilation commands for this paper. This separation means starting a new paper requires writing one project-specific agent definition and a set of symlinks, not rebuilding the entire system.

Everything an agent produces goes into a structured notes/ directory:

notes/
  editor/          # Timestamped editorial review logs
  grad/            # Investigation notebooks
  research/        # Search logs + cumulative paper database
  verification/    # SymPy scripts + README
  plots_figures/   # Figure scripts + README
  plans/           # Implementation plans
  audits/          # Agent setup audit reports

The Agents

Editor (Opus)

The editor reads the entire LaTeX source in sequential chunks, writing notes incrementally to a timestamped log file as it reads rather than summarizing at the end. The output is a structured report with 9 sections: redundancy, logical gaps, ordering issues, sections that are too terse, sections that are too bloated, missing content, notation inconsistencies, figures and tables, and unverified calculations. It preloads the verify-equation skill so it can spot-check suspicious formulas directly.

The key usage pattern: run the editor after every batch of changes, treating it as a regression test. Over 10 editorial passes (each taking roughly 8 minutes and producing a 200-400 line report), the editor caught a wrong Taylor expansion coefficient (1/3 that should have been 1/6), an inconsistent value of Φ_eff across sections, a leftover “butterfly future” reference that should have been “rose future” after a figure update, duplicate LaTeX labels, and missing discussions of Zeeman shifts and mean-field collisions. A human re-read tends to miss these because you remember what you intended to write rather than what you actually wrote.

Grad Student (Opus)

Anthropic titled Schwartz’s blog post “The AI grad student,” and the metaphor is apt. Our grad student agent formalizes that pattern. Its job is to work through a physics problem until it understands it well enough to explain it back. The difference from Schwartz’s approach: the grad student runs in its own context, produces a documented notebook, and uses the verify-equation skill to check its own work symbolically rather than depending on the human to catch fabrications. The typical trigger is an impasse: the writer or editor agents can’t understand the physical geometry of a setup, or they keep making the same error because they lack the conceptual framework to avoid it. You send the grad student away with a specific claim to investigate, the same way you might tell an obstinate graduate student to “go away and think about it until you get it.”

The grad student restates the claim precisely, works through it systematically (using the verify-equation skill for calculations, web search for literature), checks limiting cases, follows implications, and returns a structured verdict. It maintains a lab notebook in notes/grad/. It is instructed to disagree with the advisor when the physics demands it, backed by calculation, and to quantify rather than qualify: “this effect contributes ~10⁻⁹ rad, below the systematic floor” rather than “this effect might matter.”

The MZ interferometer investigation illustrates the pattern. The writer and editor couldn’t correctly reason about why a standard Mach-Zehnder interferometer behaves differently from the butterfly/rose configurations near a surface. The physical geometry was immediately obvious to me, but I couldn’t get the agents past their misunderstanding through direct instruction. So I sent the grad student off to figure out the geometry from first principles. It came back with a clear analysis showing that the atom-shield distance is a free parameter for the standard MZ (only constrained for butterfly/rose), which meant the 10⁴ advantage of the new configurations is entirely from systematic cancellation, not signal geometry. That understanding, formalized in the grad student’s notebook, could then be imported into the writer’s context to resolve the impasse.

A second investigation into atom loss decoherence produced a path-by-path analysis, a factorization proof, and the identification of orphaned amplitudes that dilute fringe contrast. This became Appendix C of the paper.

Researcher

The researcher searches arXiv and journal websites, provides full citations with arXiv IDs, and summarizes key results with specific numbers. It maintains two files using the research-notes skill: per-session search logs and a cumulative paper database (all_papers_reviewed.md) that persists across sessions. Each entry has a status field (cited, not cited, already cited) that gets updated when papers are added to the bibliography. The researcher reads the existing bibliography before suggesting new citations, so it doesn’t re-recommend papers you’ve already referenced.

In one session, it found roughly 35 new references across atom interferometer geometries and short-range force motivations. All were integrated into the paper with natural prose. The cumulative database meant that a second literature search weeks later didn’t re-surface the same 35 papers.

Writer

The writer is the only project-specific agent. It reads relevant sections of the LaTeX source, maintains consistent notation, compiles with pdflatex, and follows the project’s style rules. It preloads the verify-equation skill to check any new equations it writes and the make-figure skill for figure updates. It is told never to guess algebraic results; it must always verify with SymPy. The reason this agent is project-specific while the others are shared: it references the exact LaTeX file, the exact notation conventions (which variables are bold, which are hatted), and the exact compilation pipeline for this paper. Starting a new project means writing a new writer agent; the editor, researcher, and grad student carry over unchanged.

Agent Auditor

The agent auditor is a meta-agent that reads all agent definitions, CLAUDE.md files, the notes structure, and the symlinks, then produces a structured report on the health of the system. It searches the web for current Claude Code best practices and flags discrepancies. We ran it once; it discovered the subagent delegation limitation that led to the most important architectural change in the project.

Skills Solve the Delegation Problem

The original design had a calculator agent that other agents could call to verify equations. This failed because of a Claude Code platform constraint: subagents cannot spawn other subagents. The editor couldn’t delegate a spot-check to the calculator. The grad student couldn’t call the calculator mid-investigation. The agent auditor flagged this as affecting three of our five agents.

We converted shared protocols from agents into skills. The agent does the work itself, following the skill’s protocol. The verify-equation skill provides a step-by-step protocol:

  1. Write a self-contained SymPy script in notes/verification/verify_<topic>.py
  2. Use assert statements with PASS/FAIL output
  3. Run the script
  4. Update notes/verification/README.md with the new entry
  5. Add a %% Verified: notes/verification/<script>.py comment in the LaTeX source

Any agent that preloads this skill follows the same steps, producing the same artifacts. The editor, grad student, and writer all verify equations the same way, building a consistent audit trail. The editor checks for %% Verified: tags and flags unverified equations in its report.

There’s no context-switching overhead; the agent has full access to its own conversation history while running the verification. The artifacts are uniform. And any new agent inherits the protocol by adding one line to its frontmatter.

We built three skills: verify-equation (SymPy verification), make-figure (matplotlib with consistent styling and colorblind-friendly palettes), and research-notes (literature search documentation with cumulative database).

What the Notes Directory Buys You

Every agent writes to notes/, and each subdirectory has a README explaining its contents and naming conventions. The practical benefit: future sessions start with full context. The researcher checks the cumulative database before searching and skips papers already reviewed. The editor compares against previous review logs and tracks which issues persist versus which were fixed. Verification scripts connect paper claims to symbolic proofs. When a conversation runs out of context, the notes survive.

We navigated the notes directory in Obsidian, which makes the connections between files visible as a graph rather than a flat listing. An investigation notebook links to the verification scripts it produced; those scripts link back to the LaTeX sections they validate; the editor’s report references both. With 30+ files across 6 subdirectories, a browsable knowledge graph is more useful than ls.

What Worked, What Didn’t

Editor as regression test. Running the editor after every batch of changes catches inconsistencies that accumulate during multi-day writing. The editor reads the entire document systematically every time, which you won’t.

Grad student as context-isolation pattern. When the main conversation can’t get past a conceptual block, offloading the problem to a separate agent with its own context window lets it work through the reasoning without the baggage of the failed attempts. The result comes back as a clean notebook entry that the writer can import. The grad student doesn’t need to be smarter than you; it needs to be able to think about one thing without distraction.

Verification audit trail. Every important equation has a %% Verified: tag pointing to a SymPy script. When the editor flags an “unverified calculation,” there’s a clear path to verification. When a formula needs to change, the verification script catches the error before it propagates.

Should have started with skills. We built the calculator and figure-maker as agents first, then restructured them as skills after the auditor discovered the subagent limitation. Starting with skills for shared protocols would have avoided the restructuring.

Editor needs persistent issue tracking. The editor re-reads the entire paper each pass and sometimes re-flags resolved issues. A persistent “known issues” file that the editor reads at the start and updates at the end would reduce false positives. The V_CP/ℏ value was flagged 4 times despite being correct each time.

Grad student’s numerical iteration was messy. The atom loss investigation spawned 10 verification scripts (v1 through v10) as the grad student iterated on the correct velocity-dependent phase formula. A more structured numerical workflow, with clear intermediate results saved and tested, would have been more efficient.

The Numbers

Paper: 49 pages, ~3,300 lines of LaTeX, 80+ references, 9 figures, 4 tables, 3 appendices. Editorial passes: 10 (each ~8 minutes, ~300-line report). Grad student investigations: 3 (MZ distance, atom loss decoherence, negative standoff). Verification scripts: 8+ (all passing). Literature search: ~35 new references found and integrated. Agents: 5 (editor, grad student, researcher, writer, agent auditor). Skills: 3 (verify-equation, make-figure, research-notes). Notes files: 30+ across 6 subdirectories. None of this required a single conversation that held everything in context simultaneously; each agent worked in its own window, and the notes directory held the project together.