Google Gemini 3.1 Pro Review and Analysis
Gemini 3.1 Pro is Google’s most advanced reasoning model yet, built for deep agentic workflows, large-scale code generation, and multimodal tasks. With 65K output tokens and major benchmark gains, it shifts AI from conversation to autonomous execution.
On February 19, 2026, Google released its first ever ".1" increment model a deliberate departure from its old 0.5 step release cycle. This is not a half-step forward. It is a focused, high-precision reasoning upgrade built for a single purpose, tasks where a simple answer is not enough.
Gemini 3 Pro was already good at answering questions. Gemini 3.1 Pro is built to do things. It writes multi module applications in a single turn. It configures live aerospace dashboards from an API.
It watches a YouTube video and analyses it no upload required. It is the shift from a conversational assistant to an autonomous agent engine.
If you build with AI, this release changes what is possible. Here is everything you need to know.
What Is New in Gemini 3.1 Pro
The upgrade is not a single breakthrough. It is a set of tightly connected improvements that, together, make the model far more reliable for real world, agentic work.
1. Massive Output Capacity
The model retains its 1 million token input context window. The big change is on the output side: Gemini 3.1 Pro now supports up to 65,000 output tokens in a single response. That means entire technical manuals, full codebases with multiple modules, or exhaustive research reports can be generated without splitting into multiple calls.
2. Thinking Levels
Google introduced four configurable thinking modes: Minimal, Low, Medium, and High. This gives developers direct control over cost and latency.
The new Medium level handles moderately complex tasks without burning through tokens on full deep reasoning. High remains the default for the hardest problems. It is a practical addition for teams optimizing production pipelines.
3. Agentic Reliability with Custom Tools
A new specialized endpoint, gemini-3.1-pro-preview-customtools, was built specifically to reduce hallucinations when the model interacts with local file systems.
It prioritizes tools like bash commands, view_file, and search_code. For developers building coding agents or file-management workflows, this is one of the most important additions in the release.
4. Thought Signatures
Thought signatures are encrypted tokens that preserve the model's internal reasoning state during multi-turn tool calls. When the model pauses to execute an external function and then resumes, it does not lose context.
It picks up exactly where it left off. This directly solves one of the core reliability problems in long agentic workflows.
5. Media and File Upgrades
The API file upload limit has grown from 20MB to 100MB a fivefold increase. More importantly, the model now accepts direct YouTube URLs as media input.
You can feed it a video link and it will watch, process, and reason about the content without you uploading anything. This opens up a new class of multimodal tasks around video analysis.
Gemini 3.1 Pro Benchmark Scores
Numbers can be gamed. These cannot. Google's strongest claim for 3.1 Pro is its performance on ARC-AGI-2 a benchmark that tests whether a model can solve logic problems it has never seen before.
Gemini 3.1 Pro scored 77.1%. Its predecessor, Gemini 3 Pro, scored 31.1%. That is more than double the reasoning performance in a single release.
| Benchmark | Gemini 3.1 Pro Thinking (High) | Gemini 3 Pro Thinking (High) | Sonnet 4.6 Thinking (Max) | Opus 4.6 Thinking (Max) | GPT-5.2 Thinking (xhigh) | GPT-5.3-Codex Thinking (xhigh) |
|---|---|---|---|---|---|---|
| Humanity’s Last Exam Academic reasoning (full set, text + MM) No tools Search (blocklist) + Code | 44.4% 51.4% | 37.5% 45.8% | 33.2% 49.0% | 40.0% 53.1% | 34.5% 45.5% | — — |
| ARC-AGI-2 Abstract reasoning puzzles ARC Prize Verified | 77.1% | 31.1% | 58.3% | 68.8% | 52.9% | — |
| GPQA Diamond Scientific knowledge No tools | 94.3% | 91.9% | 89.9% | 91.3% | 92.4% | — |
| Terminal-Bench 2.0 Agentic terminal coding Terminus-2 harness Other best self-reported harness | 68.5% — | 56.9% — | 59.1% — | 65.4% — | 54.0% 62.2% (Codex) | 64.7% 77.3% (Codex) |
| SWE-Bench Verified Single attempt | 80.6% | 76.2% | 79.6% | 80.8% | 80.0% | — |
| SWE-Bench Pro (Public) Single attempt | 54.2% | 43.3% | — | — | 55.6% | 56.8% |
| LiveCodeBench Pro Elo | 2887 | 2439 | — | — | 2393 | — |
| SciCode Scientific research coding | 59% | 56% | 47% | 52% | 52% | — |
| APEX-Agents Long horizon professional tasks | 33.5% | 18.4% | — | 29.8% | 23.0% | — |
| GDPval-AA Elo Expert tasks | 1317 | 1195 | 1633 | 1606 | 1462 | — |
| t2-bench Retail Telecom | 90.8% 99.3% | 85.3% 98.0% | 91.7% 97.9% | 91.9% 99.3% | 82.0% 98.7% | — — |
| MCP Atlas Multi-step workflows using MCP | 69.2% | 54.1% | 61.3% | 59.5% | 60.6% | — |
| BrowseComp Search + Python + Browse | 85.9% | 59.2% | 74.7% | 84.0% | 65.8% | — |
| MMMU Pro Multimodal understanding and reasoning | 80.5% | 81.0% | 74.5% | 73.9% | 79.5% | — |
| MMMLU Multilingual Q&A | 92.6% | 91.8% | 89.3% | 91.1% | 89.6% | — |
| MRCR v2 (8-needle) 128k (average) 1M (pointwise) | 84.9% 26.3% | 77.0% 26.3% | 84.9% Not supported | 84.0% Not supported | 83.8% Not supported | — — |
| Methodology: deepmind.google/models/evals-methodology/gemini-3-1-pro | ||||||