Claude vs GPT-4 for Code Generation: Our Internal Benchmark
We ran both models on 200 real tasks from our projects. Here's where each one wins — and where both fail.
We use both Claude and GPT-4 across our projects. Instead of relying on vibes, we ran a proper benchmark on 200 real tasks pulled from our actual codebase.
The test
200 tasks across 5 categories: React components, API endpoints, database queries, bug fixes, and code reviews. Each task was graded on correctness, code quality, and whether it ran on first attempt.
Results summary
Claude wins on:
- Long-context tasks (reading 500+ lines of code and making targeted changes)
- Following complex instructions with multiple constraints
- Code review and identifying subtle bugs
- TypeScript and type-heavy code
GPT-4 wins on:
- Quick utility functions and one-liners
- Python data science code (pandas, numpy)
- Regex and string manipulation
Both struggle with:
- Complex state management (Redux, Zustand patterns)
- Database migrations with data transformations
- Code that depends on undocumented internal APIs
What we actually do
We use Claude for our main development workflow — code generation, reviews, refactoring, and architecture discussions. We use GPT-4 for quick data scripts and regex. Neither replaces senior engineering judgment.
The real takeaway
The model matters less than the prompt. A well-structured prompt with context, constraints, and examples gets better results from either model than a lazy prompt gets from the best model.
