An 8-Person Engineering Team Shipped 3.1x More Features After a 2-Week Agentic Coding Setup

The real cost of inconsistency

This team was not slow. In a good sprint they shipped 4-6 features. In a bad sprint, when someone was blocked, or a PR had three rounds of review comments, or a new hire was ramping up, they shipped 1-2.

The CTO described it as “feast or famine.” The product backlog was growing faster than they could clear it. Their sprint velocity had a standard deviation higher than its mean, which made roadmap commitments nearly impossible.

They had tried a few things: stricter PR standards, better sprint planning, mob programming sessions. All helped marginally. The underlying problem was that code quality was person-dependent. When the senior engineers wrote code, it was clean, well-tested, and merged in one round of review. When anyone else wrote code, it was not, not because they were bad engineers, but because the standards lived in the seniors’ heads.

What “agentic coding setup” actually means

An agentic coding setup is not “install Cursor and call it done.” The tool is only as good as the context you give it. Cursor without a CLAUDE.md is like a new hire on their first day with no onboarding, technically capable but missing everything that makes them productive in your specific codebase.

The two-week engagement:

Week 1: Audit and codify

We spent three days reading code and interviewing engineers. What were the most common PR review comments? Six recurring ones: missing input validation, no error handling on async calls, hardcoded strings instead of constants, no loading states on mutations, inconsistent prop naming, missing test coverage on utils. What did the senior engineers know that no one had written down? Quite a lot: specific patterns for their tRPC setup, how they structured Prisma transactions, the conventions around their custom React hooks.

That knowledge became:

A CLAUDE.md covering the full stack (Next.js 15, tRPC, Prisma, Tailwind, Vitest)
Six .cursor/rules/ files for common patterns (one per major code area)
A SECURITY.md for auth and data access conventions
A TESTING.md with the testing philosophy and example patterns

Week 2: Tooling + handover

Set up Claude Code across all machines with the repo-specific config. Configured Cursor workspace rules. Built three custom Claude Code slash commands for their most common tasks: /pr-review (checks for the six recurring issues before the human reviews), /add-tests (generates Vitest tests for a specified file), /update-types (propagates type changes through the codebase).

Ran two pair sessions, one with the most experienced engineer, one with the most junior, to calibrate the prompting approach. Different engineers needed different mental models for working with AI coding tools.

The 8-week retrospective numbers

We measured baseline velocity for 4 weeks before the engagement and tracked for 8 weeks after.

Before: average 3.8 features/sprint, standard deviation 2.1, average PR review cycles 2.8. After (weeks 1-4): average 6.2 features/sprint, standard deviation 1.4, average PR review cycles 1.9. After (weeks 5-8): average 11.8 features/sprint, standard deviation 0.9, average PR review cycles 1.1.

The acceleration in weeks 5-8 surprised everyone. The hypothesis: the first month was engineers learning to work with the tools. The second month was the tools paying compound interest as engineers built intuitions about which tasks were right for AI agent development delegation and which needed full human attention.

The specific things that drove the numbers

The CLAUDE.md solved the onboarding problem first. Within 3 days of deployment, their two junior engineers were writing code that cleared the linter, passed tests, and had error handling. Not because they got smarter, but because the AI now enforced the standards automatically.

The /pr-review command changed the culture. Engineers started running it before requesting human review. The first-pass human review cycle dropped because the mechanical issues were already caught. Reviewers could focus on logic and architecture.

The test coverage jump was the surprise. Before: test coverage on new code was approximately 40% (highly variable). After: 78% average. The /add-tests command made writing tests faster than skipping them.

The standard deviation collapse mattered as much as the mean. Predictable output is more valuable than occasional bursts. With velocity standard deviation below 1.0, the CTO made quarterly roadmap commitments he had not been able to make in 18 months.

What did not work

Not all engineers adopted the tools at the same pace. Two of the eight resisted for the first three weeks, one because they had a bad experience with Copilot two years earlier and assumed this was the same, one because they felt their coding was already fast and did not see the point. Both came around, but it took a specific use case that clicked for each of them. For the skeptic, that moment was watching Claude Code refactor a 400-line file in under 4 minutes that would have taken him two hours. You cannot force adoption. You can only find the moment that makes the value obvious.

Tech used

Cursor · Claude Code · custom CLAUDE.md and SKILL.md configuration · repository-specific .cursor/rules/ · custom slash commands for PR review, test generation, and type propagation