Google Just Made Every Other AI Lab Look Slow
Google Just Made Every Other AI Lab Look Slow
Published February 19, 2026
Three months ago Gemini 3 Pro was a perfectly fine model that nobody was excited about. Solid benchmarks. Nice features. The kind of product that gets described as “competitive” in press releases, which is corporate for “we are losing.”
Today Google dropped Gemini 3.1 Pro and the conversation is different.
77.1 percent on ARC-AGI-2. That is more than double what Gemini 3 Pro scored. Not a 10 percent improvement. Not a “modest gain.” They doubled it. In three months. Whatever DeepMind was doing between November and now, it worked, and it probably involved someone finally telling a project manager to stop scheduling alignment meetings and let the researchers cook.
The Numbers That Matter
Gemini 3.1 Pro leads 13 of 16 benchmarks Google tested against. That is not cherry-picking. That is showing up to a 16-event track meet and winning 13 gold medals.
The highlights:
- ARC-AGI-2: 77.1 percent. Claude Opus 4.6 sits at 68.8. This is the test that measures whether a model can reason about patterns it has literally never seen before. Not memorization. Not vibes. Actual thinking. Google ate this one.
- GPQA Diamond: 94.3 percent. Expert-level science questions. Opus 4.6 gets 91.3. GPT-5.2 gets 92.4. Google wins.
- SWE-Bench Verified: 80.6 percent. Agentic coding. The test where the model has to actually fix real bugs in real codebases. First place.
- APEX-Agents: 33.5 percent, nearly double their own previous score of 18.4. Still a low number, but professional task completion is hard and nobody else is publishing better.
The 1 million token context window stays. The output limit jumped to 65,000 tokens. So you can feed it an entire codebase and it will write you back a novel about why your architecture is wrong.
Where Google Still Gets Cooked
Let’s not pretend this is a clean sweep.
Claude still owns GDPval expert tasks by a mile — 1633 Elo versus Google’s 1317. That is not close. On “Humanity’s Last Exam” with tool use, Opus 4.6 edges Gemini 53.1 to 51.4. Small margin, but it is Claude’s win.
And coding? OpenAI’s GPT-5.3-Codex still holds Terminal-Bench 2.0 at 77.3 percent and SWE-Bench Pro at 56.8 versus Gemini’s 54.2. If your life is writing code all day, OpenAI has not been dethroned.
So the picture is: Google dominates reasoning and general intelligence. Anthropic holds expert-level tasks and careful analysis. OpenAI holds the hardest coding benchmarks. Everybody has a thing. Nobody has everything.
Why This Actually Matters
The interesting part is not the benchmarks. It is the velocity.
Google went from “also-ran” to “leading 13 of 16 benchmarks” in a single point release. Not a whole number jump. A 0.1 increment. They called it 3.1, not 4, which is either admirable restraint or the most understated flex in AI history.
This should worry everyone at Anthropic and OpenAI. Not because Gemini 3.1 Pro is unbeatable — it is clearly not — but because of what it says about the rate of improvement. If Google can double their reasoning benchmark in one quarter, what does the next quarter look like? And the one after that?
The model race has been framed as OpenAI vs Anthropic for the last year, with Google as the well-funded underperformer everyone politely nods at during conferences. That framing is dead as of today. Google has the compute. Google has DeepMind. Google has been quietly hiring while Sam Altman does podcast tours. And now Google has the benchmarks to back it up.
The .1 Flex
I want to talk about the versioning for a second because it says a lot about confidence.
Anthropic calls their models Opus 4.6, Sonnet 4.5. Big numbers. OpenAI is on GPT-5.2, GPT-5.3-Codex. Also big numbers. Google looked at their model that beats both of them on 13 benchmarks and called it 3.1. Not even 3.5. Not 4. Three point one.
Either Google’s versioning team is completely disconnected from their benchmarking team, or this is the most disrespectful thing anyone has done in AI since DeepSeek published their training costs. “Yeah we beat your flagship model. We consider it a minor update.”
What You Should Do With This Information
If you are building on top of LLM APIs, the correct response to today is relief. Three companies are now genuinely competitive at the frontier. That means pricing pressure. That means you are not locked into one provider. That means if Claude has an outage, you switch to Gemini. If GPT-5 jacks up rates, you switch to Gemini. Competition is the only thing protecting you from monopoly pricing, and today Google reminded everyone it is actually in this race.
If you are an AI researcher, start paying attention to what DeepMind is publishing. They have been quieter than OpenAI and Anthropic for the last year and apparently they were using that quiet time to actually do research instead of tweeting about it.
If you are Sam Altman, you should probably cancel the next podcast appearance and go talk to your engineering team. Just a thought.
The Bottom Line
Gemini 3.1 Pro is not the best model at everything. But it is the best model at most things, and it got there faster than anyone expected. Google’s strategy of “shut up and ship” just produced the single biggest benchmark jump any lab has shown in 2026.
The three-way race is real now. Anyone telling you one company has this locked up is selling you something. Probably a subscription.