Claude vs GPT-4 for Code Generation: Our Internal Benchmark

We use both Claude and GPT-4 across our projects. Instead of relying on vibes, we ran a proper benchmark on 200 real tasks pulled from our actual codebase.

The test

200 tasks across 5 categories: React components, API endpoints, database queries, bug fixes, and code reviews. Each task was graded on correctness, code quality, and whether it ran on first attempt.

Results summary

Claude wins on:

Long-context tasks (reading 500+ lines of code and making targeted changes)
Following complex instructions with multiple constraints
Code review and identifying subtle bugs
TypeScript and type-heavy code

GPT-4 wins on:

Quick utility functions and one-liners
Python data science code (pandas, numpy)
Regex and string manipulation

Both struggle with:

Complex state management (Redux, Zustand patterns)
Database migrations with data transformations
Code that depends on undocumented internal APIs

What we actually do

We use Claude for our main development workflow — code generation, reviews, refactoring, and architecture discussions. We use GPT-4 for quick data scripts and regex. Neither replaces senior engineering judgment.

The real takeaway

The model matters less than the prompt. A well-structured prompt with context, constraints, and examples gets better results from either model than a lazy prompt gets from the best model.