How I use Claude agents in a 200K-line Xcode project
Six months ago we turned on autonomous AI agents in our main iOS codebase — 200K lines, eight years of accumulated architecture, eleven engineers — and let them ship real code. This is what worked, what didn't, and the five workflow patterns I've kept.
The setup, briefly
Our app is a mature B2B iOS product. Swift, UIKit + a growing SwiftUI surface, a custom dependency-injection layer, three feature modules each with their own owner. The kind of codebase where "make a small change" is rarely small.
The agent setup: Claude as the primary, DeepSeek as the cost-conscious fallback for routine work, and an explicit hand-off protocol I'll describe below. Agents run from the chat panel during business hours; I don't let them run unattended overnight.
Lesson 1: the project file is the danger zone
The first thing that broke was project.pbxproj. The agent added a new file, the file references landed in the wrong target, and a teammate's build started failing two days later because their target was missing the file. The agent's reasoning was correct; its edit to a complex Apple-internal XML format was not.
What I do now:
- Snapshots on every
.pbxprojtouch. I never accept a project-file edit without the snapshot being green. Recovery is one click; debugging an unexplained build break is half a day. - Manual review for new files. When the agent adds a file, I check the Xcode target membership manually before merging. Trust but verify; the verify part is fast.
Lesson 2: separate the "think" from the "do"
The single biggest workflow win was making "plan first, execute second" a habit. Plan mode → read-only analysis → review the proposed plan → switch to acceptEdits mode → let it execute.
The plan step catches the kind of bad decision that's hard to undo: choosing the wrong abstraction, splitting a refactor at the wrong boundary, deciding to "just rewrite this." The execute step then runs fast because you're past the judgment calls.
"Show me what you'd do before you do it" is the prompt I type more than any other.
Lesson 3: providers aren't interchangeable
I ran the same six prompts on Claude and DeepSeek and graded the results. Claude won on complex refactors, multi-file edits, and anything requiring conventions inference. DeepSeek won on raw code generation and the kind of mechanical work (apply this pattern across 40 files) where the constraint is throughput, not judgment.
The split I ended up with:
- Claude for design conversations, architecture choices, anything that touches the dependency-injection layer.
- DeepSeek for "lint and migrate" tasks: format passes, naming consistency, adopting a new utility everywhere it should be used.
- Switch mid-conversation when an agent stalls; usually the second model gets unstuck faster than I can rewrite the prompt.
Cost matters at our usage levels but it's not the main factor. Match the model to the task and you reduce both cost and latency.
Lesson 4: subagents are for the boring scary stuff
I write more subagents than I expected. The pattern: any task that the main chat keeps getting wrong because of context dilution becomes a subagent with a tight system prompt and a restricted tool set.
Two subagents I rely on weekly:
- pre-merge-review — read-only review with tools
Read, Grep, Glob, Bashonly. Can't propose edits. Surfaces flaky tests, weird error handling, and the "you forgot to add a test for X" gap. - doc-aligner — when our README and code drift, this subagent fixes the README. Its only tools touch markdown files. Can't break code by definition.
The narrower the surface, the more useful the subagent. Counterintuitive at first; obvious after you've written three.
Lesson 5: worktrees changed how I run experiments
I used to fight the impulse to let the agent try something speculative because "what if it makes a mess." Now I fork a worktree, let it run, and either keep what it built or delete the whole tree.
The unlock isn't the git plumbing — I knew worktrees existed before LingCode. It's that the IDE makes the fork-experiment-merge loop fast enough that I'll actually do it. Five minutes from "I want to try X" to "I've tried X and decided."
Where it still breaks
Not everything works. Three categories where I still drive manually:
- Performance work. The agent will optimize the wrong thing. "Make this faster" is a question about your production traces, not your source code, and the agent doesn't have those.
- Cross-team boundary changes. When the change spans modules owned by different people, the social and code review surface area exceeds what the agent can usefully reason about.
- Anything depending on a third-party SDK in flux. The agent's prior beats current reality. Always verify against the actual SDK's current docs.
What I'd tell a teammate starting tomorrow
Three rules:
- Plan mode first; never accept a non-trivial edit without a plan you read.
- Snapshots are not optional on config files. The cost is zero; the upside is one full day of work you don't lose.
- When the agent gets stuck, switch models. It's faster than re-prompting in the same model.
The mental model that took the longest to internalize: the agent isn't a junior engineer who happens to type fast. It's a fundamentally different kind of collaborator with different strengths, different blind spots, and a different cost profile. Treating it like a junior leads to over-supervision; treating it like infrastructure leads to under-supervision. The right setting is somewhere in between, and it took me three months to find it.
Comments