Skip to main content
Nathan Fennel
Back to Blog

Same Model, Different Results

Why Claude Sonnet feels completely different in Cursor, Claude Code, and Antigravity. The toolchain is the product.

You can run Claude Sonnet in three different tools today and get three wildly different experiences. Same weights, same training data, completely different results.

This is the part that confuses people. If the model is identical, why does it write better code in one tool than another? Why does it seem to "understand" your project in Cursor but fumble in a raw API call? The answer is not about the model at all. The model is only one layer of the stack. The toolchain wrapping it, the context strategy feeding it, and the agent loop orchestrating it are what actually shape the output.

As Builder.io put it: in 2026, the best LLM for coding is not a model. It is a stack. That framing changes everything about how you should evaluate your tools.

Three Tools, One Model

Consider three tools that all support Claude Sonnet under the hood. Each one takes a fundamentally different approach to the same problem.

Cursor is IDE-native. It lives inside your editor and builds a vector index of your entire codebase. When you ask it a question, it retrieves the most relevant files and feeds them to the model alongside your prompt. Its Composer agent mode can generate code at roughly 250 tokens per second and apply multi-file edits directly in your workspace. You never leave the editor.

Claude Code is terminal-first. It uses Anthropic's Model Context Protocol for tool integration, giving it the ability to read files, run commands, search your codebase, and call external services. It supports extended thinking for complex reasoning and can use the full 200k context window. You work in your terminal, and the agent figures out what it needs to read and do.

Antigravity is Google's agent-first platform. It takes a documentation-driven approach, using artifact-based context and a skills/workflows system to produce structured deliverables. Instead of operating inside your editor or terminal, it builds an understanding of your project through explicit documentation and then generates code against that spec.

Same Claude Sonnet model underneath all three. The experience of using each one is radically different.

Why the Difference?

Four factors explain most of the divergence.

Context strategy is the biggest one. Each tool decides what code to show the model, and that decision changes everything. Cursor uses vector indexing to retrieve semantically relevant files. Claude Code scopes context dynamically, reading files on demand as the agent explores your project. Antigravity uses artifact-based context, building understanding through structured documentation rather than raw code. As Coderide.ai notes: "Without context, the AI guesses. With context, it knows."

Tool harness determines what actions the model can take. Cursor gives the model over ten tools for searching, editing, and running terminal commands inside the IDE. Claude Code uses MCP with dynamic tool loading, meaning the model can call out to external services, databases, or custom integrations. Each toolbox constrains and enables different kinds of work. A model that can only edit files will produce different results than one that can also run your test suite and read the output.

Prompt engineering is the invisible layer. Each product wraps your request in its own system prompts, instructions, and formatting guidelines before it ever reaches the model. Slight wording changes in these system prompts produce significantly different outputs. This is well-documented in OpenAI's prompt engineering guide, and every serious AI tool invests heavily in tuning these hidden prompts. You are never talking directly to the model. You are talking to the product's interpretation of the model.

Agent loop design governs how the orchestrator decides what to do next, when to stop, and how to recover from errors. A tool that retries failed edits three times will produce better results than one that gives up immediately. A tool that validates its own output by running your linter will catch mistakes that a single-pass tool misses. These orchestration decisions compound across a session.

Finding What Fits

The practical takeaway is straightforward: stop asking "which tool is best" and start asking "which tool fits how I work."

Think about your own patterns. Do you think in terminals or GUIs? Do you prefer to stay inside your editor, or do you like working across tools? Is your codebase a monorepo or a collection of small services? Are you exploring a new idea or converging on a specific implementation? Each of these preferences points toward a different tool.

The Qodo comparison of Claude Code and Cursor found something worth internalizing: "output quality is mostly determined by how clearly and structured you plan and describe the task." The tool matters, but the way you communicate with it matters more. A clear, well-structured prompt in any of these tools will outperform a vague request in the "best" one.

If you have not already, try running the same task through multiple tools. Pick a real task from your current project, something non-trivial but well-defined, and run it through two or three of these options. Pay attention to where each tool excels and where it struggles. That direct comparison will teach you more than any benchmark.

The Toolchain is the Product

The model race will continue. There will be a new release every few months, each one slightly better than the last. But the tool you use to harness that model is the actual product you are buying. The context strategy, the agent loop, the tool integrations, the prompt engineering: that is where the real differentiation lives.

Figuring out what works for you, your projects, and your goals is the real skill in this environment. The model is the engine, but the toolchain is the car. And nobody picks a car based solely on the engine spec sheet.