Gemini 3.5 Flash beats last year's Pro on the work builders ship

Google made Gemini 3.5 Flash generally available at I/O on May 19, and the number that matters isn’t a speed claim. It’s that the cheap, fast model now beats the previous generation’s flagship. On Google’s own model card, 3.5 Flash outscores Gemini 3.1 Pro on agentic coding and tool use — the benchmarks that track what production systems actually do — while costing about 40% less per token. Teams that reach for Pro by default are now paying a premium for headroom their workload may not use.

That is the shift worth understanding, and it is not a one-off. Frontier-grade capability is collapsing into the cheap-and-fast tier inside a single release cycle, which turns “which model” back into an economics question — one Google is happy to force, because it sells compute either way.

Flash used to be the compromise tier

The Flash tier has always been the budget option you accepted a quality hit for: fast and cheap, fine for classification and summarization, not where you sent the hard work. Gemini 3.5 Flash breaks that framing. Against Gemini 3.1 Pro on Google’s model card, it wins the benchmarks that map to deployed software:

Terminal-Bench 2.1 — 76.2% vs 70.3%. Agentic coding in a real terminal: editing files, running commands, fixing what breaks.
MCP Atlas — 83.6% vs 78.2%. Multi-step tool use, the core loop of any agent that calls APIs and chains results.
SWE-Bench Pro (public) — 55.1% vs 54.2%. Resolving real software-engineering issues end to end.
CharXiv 84.2% vs 83.3%, MMMU-Pro 83.6% vs 80.5% — chart and document reasoning, the multimodal work that shows up in real document pipelines.

Google also puts output throughput at roughly 4x comparable frontier models. Taken together, the tier you used to route the easy work to now handles the agentic and coding work you used to protect.

The cost math is the real story

List price is $1.50 per million input tokens and $9 per million output, against $2.50 / $15 for Gemini 3.1 Pro — about 40% cheaper on both sides. But the headline rate undersells it, because the workloads where Flash now competes are agent loops, and agent loops pay on every step. A coding agent that makes dozens of tool calls per task multiplies both the token cost and the latency of each one. A model that is ~4x faster and ~40% cheaper compounds across that loop in a way a single-shot price comparison hides.

The other lever is thinking levels. 3.5 Flash exposes adjustable reasoning effort — minimal, low, medium, high — defaulting to medium, and it allocates more compute to harder problems on its own. Used deliberately, that is a cost knob, not just a quality slider: route routine calls to minimal thinking and reserve the expensive reasoning for the requests that earn it. For high-volume production traffic, that per-request control is often worth more than the headline discount.

Where 3.1 Pro still wins

The win is real but narrow, and the launch framing won’t tell you where it ends. The same model card shows last year’s Pro still ahead on the hardest reasoning: Humanity’s Last Exam, 44.4% vs 40.2%, and ARC-AGI-2, 77.1% vs 72.1%. If your product leans on novel, frontier-level reasoning — deep research, hard math, genuinely unfamiliar problems — Flash is a step down, not a free upgrade.

Long context deserves the same skepticism. On MRCR v2 at the full 1M-token window, both models score around 26%. The million-token window is capacity, not recall: it will accept the tokens, but it will not reliably find the one fact you need buried in the middle. For retrieval-critical work over long documents, that number is the one to internalize — keep a real retrieval layer and test hard before you trust the window.

How to actually use it

The practical read is specific, not “switch everything.” Move coding agents, tool-use and agentic workloads, high-volume routine inference, and multimodal document work to 3.5 Flash, and benchmark them on your own tasks before you commit — the model card is Google’s workload, not yours. Hold the line on Pro, or a dedicated reasoning model, for frontier reasoning and for long-context jobs where retrieval accuracy is the product. And wire the thinking levels into your routing from day one, rather than leaving everything at the default.

It is available where you already build: the Gemini app, AI Mode in Search, Google Antigravity, the Gemini API in AI Studio and Android Studio, and Gemini Enterprise. You can see how it lands against the rest of the field on our model leaderboard, and where it fits Google’s wider strategy on our Google company profile.

That strategy is the quiet part. Google built a model cheap and fast enough to run across products used by billions, and good enough to take work away from its own flagship — which also squeezes every rival selling a pricier mid-tier model. For builders, the lesson is operational: re-run your model-selection benchmark every quarter. The price of “good enough” is falling fast, and this cycle it turned last year’s flagship into a budget line item.