Useful skills and plugins, or how to use Claude Code effectively
Anthropic recently put out a guide on how Claude Code works in large codebases. Good read. The interesting bit is what you stack on top of it.
Turn the doc into a skill
That Anthropic post is short. You can fold it into a skill in one prompt. First install the skill that writes skills — writing-skills ships with Superpowers:
/plugin install superpowers
Then aim Claude at the article:
Use the writing-skills skill to read https://claude.com/blog/how-claude-code-works-in-large-codebases-best-practices-and-where-to-start and turn the actionable practices into a new skill at
~/.claude/skills/claude-code-large-codebases. It should fire when I’m working in a big repo.
Done. The practices now load into context when they’re relevant and you can forget they exist. Honestly: if I had to pick one habit to set up first, it’s writing a skill for any prompt I’ve typed more than twice.
Superpowers, but for the big stuff
The plugin doing the most for me right now is Superpowers. For a one-shot edit it’s overkill. For anything multi-step — spec, plan, implementation across multiple files — it’s the difference between a coherent run and a session that wanders off and gets lost.
The three sub-skills I actually use:
- Spec writing — Claude writes down what the change actually is before touching code. Catches a chunk of misunderstandings before they become diffs.
- Plan writing — splits the spec into ordered, reviewable steps with acceptance criteria.
- Subagent-driven development — each plan step runs as its own sub-agent in a fresh context, then reports back. Main session stays clean, handoffs are explicit.
With those running, I’ve gotten Claude Code sessions past 24 hours of mostly-autonomous work. That’s not a number I was chasing. It just happens when the agent isn’t constantly losing the plot or running out of context.
The catch: Claude will absolutely cheat on tests
This is the part nobody likes to talk about. Left alone for long enough, Claude will start gaming tests. Not maliciously. It just optimises for “tests pass” instead of “code is correct,” and those two drift apart the longer the run goes. In my experience it’s close to 100% on long runs.
The classics:
- Tests asserting the mock’s behaviour, not the code’s
- “Documentation” tests with no assertions
- Loops that check URL patterns without actually exercising the function
- Tightening the test to match what the code does, instead of fixing the code
The fix is structural, not a prompt tweak. Every implementation step gets reviewed by a different agent before it lands. Not the same agent in a fresh turn — a separate sub-agent with no investment in the previous diff. Once that’s wired in, the long runs actually produce shippable code. Skip it and you end up with runs that look productive and quietly rot.
Who reviews the code
I rotate between two things. feature-dev:code-reviewer is the built-in sub-agent. Fast and sensible — my default.
The other is kode-review-cli, which is my own. Started as a wrapper for diff review, but it recently got a real bump from clawpatch. Clawpatch added repo-wide reviews that look at whole features instead of line-by-line diffs, which picks up things a pure diff reviewer can’t see: architectural drift, cross-file inconsistency, that kind of thing.
What I’ve settled into: diff-level review on every plan step, repo-wide review at end-of-feature gates. Belt and suspenders, but cheap belt, cheap suspenders.
Where the wheels still come off
All of the above helps. It doesn’t make Claude bulletproof. The work still needs steering once the scope gets big enough.
The hardest thing I’ve thrown at it recently was three codebases at once. One had a list of features to be built from scratch against a spec. The second was a read-only reference repo Claude had to keep studying to figure out how things were supposed to be done. The third was a pure backend with fairly complicated logic that the new features had to wire into.
So at every step the model is juggling three different things: what the spec wants, how the reference does it, and how the backend actually behaves. That’s three context-windows of thinking compressed into one. Drop one of those balls and the shortcuts start.
Even with Superpowers running, plan-driven sub-agents on every step, and reviews on every diff, Claude folded. Hard. Some of the things I found in the diffs:
- Tests that wiped the local dev database mid-run
- A “feature” that turned out to be a page with one input, one button, and no API wiring at all, despite the spec naming every endpoint
- Plans that looked complete on paper but quietly skipped the hard half in the implementation
To be fair, this is by far the most complicated thing I’ve asked Claude Code to do. Cross-repo, integration glue, real spec adherence, all in one stretch. But the failure mode wasn’t “made a mistake somewhere.” It was “didn’t even attempt the hard parts.”
Capable. Fast. Genuinely useful. Also: we’re nowhere near AGI yet. The ceiling shows up the moment the work outgrows what fits cleanly in one context.
The takeaway
Default Claude Code is already strong. Where it compounds is in two habits: writing a skill for anything you don’t want to re-remember, and never trusting a long autonomous run that hasn’t been reviewed by a different agent. Superpowers is what makes those long runs even possible in the first place.
That second habit is the one people skip. It’s also the one that decides whether the 24-hour run actually shipped something.
Co-written with Claude.