What benchmark results has Gemini 2.5 Pro Deep Think achieved?

Gemini 2.5 Pro Deep Think achieved 34.8% on Humanity's Last Exam (HLE) without tools — surpassing xAI Grok 4 (25.4%) and OpenAI o3 (20.3%). It also achieved gold medal-level performance at the International Mathematical Olympiad (IMO) 2025 in its research version, top scores on LiveCodeBench 6, and the highest score on the 2025 USAMO mathematics competition. The generally available version achieves bronze-level IMO performance due to the removal of multi-hour reasoning processes.

What is the difference between Deep Think's parallel thinking and standard AI reasoning?

Standard AI models follow a single sequential reasoning path — evaluating one approach at a time. Deep Think runs multiple reasoning paths simultaneously, evaluating different approaches in parallel, then selecting the best result. It also uses a multi-agent architecture where multiple AI agents work on the same problem concurrently. This is computationally more expensive but produces higher-quality answers for complex problems in mathematics, science, and competitive programming.

Gemini 2.5 Pro Deep Think: IMO Gold Medal, 34.8% on HLE, and Multi-Agent Reasoning Explained

This is Hamamoto from TIMEWELL.

In 2026, Gemini 2.5 Pro Deep Think reached the top of the reasoning AI benchmark rankings. The model scored 34.8% on Humanity's Last Exam — significantly ahead of Grok 4 (25.4%) and o3 (20.3%) — and achieved gold medal-level performance at the International Mathematical Olympiad 2025. This article covers the architecture behind these results, the benchmark details, a comparison with competing models, and where Deep Think fits in practical business workflows.

Current Status at a Glance

Item	Details
Model name	Gemini 2.5 Pro Deep Think
General availability	August 1, 2025
Architecture	Sparse Mixture-of-Experts Transformer
Max input tokens	1 million
Max output tokens	192,000
Humanity's Last Exam	34.8% (no tools)
IMO 2025	Gold medal level (research version)
Pricing	Google Ultra $250/month
Key features	Parallel thinking, multi-agent

Deep Think: What Parallel Thinking Actually Means

How Standard AI Reasons vs. Deep Think

Standard AI models reason sequentially — following a single chain of thought from start to finish, evaluating one approach at a time.

Deep Think operates differently:

Generates multiple candidate approaches simultaneously
Evaluates different reasoning paths in parallel
Refines and merges approaches over time
Selects the optimal final answer from the parallel search

The analogy: a standard model is one person thinking through a problem step by step. Deep Think is multiple people working the problem simultaneously and comparing notes.

Multi-Agent Architecture

Gemini 2.5 Pro Deep Think is the first multi-agent model Google has released publicly.

How multi-agent differs from single-agent:

A single question spawns multiple AI agent instances
Each agent works on the problem independently and in parallel
Results are compared and consolidated
Higher computational cost than single-agent reasoning
Higher output quality on complex problems

Best suited for:

Iterative design and development tasks
Scientific and mathematical research
Complex competitive programming problems
Business analysis requiring multiple analytical angles

Benchmark Results

Humanity's Last Exam (HLE)

HLE is a benchmark covering difficult problems across mathematics, humanities, and sciences — designed to test the outer limits of current AI reasoning capability.

Model	Score (no tools)
Gemini 2.5 Pro Deep Think	34.8%
xAI Grok 4	25.4%
OpenAI o3	20.3%

Google describes this as the current state-of-the-art performance.

International Mathematical Olympiad (IMO) 2025

Research version: Gold medal level
Generally available version: Bronze level (multi-hour reasoning processes removed for practical responsiveness)

The generally available model trades the most computationally intensive reasoning steps for more practical response times. Users requiring maximum mathematical reasoning can access the research-tier capability through the API.

Additional Benchmarks

Benchmark	Result
2025 USAMO	Highest score (mathematics)
LiveCodeBench 6	Highest score (competitive programming)
MMMU	84.0% (multimodal reasoning)

Technical Specifications

Item	Specification
Architecture	Sparse Mixture-of-Experts Transformer
Input modalities	Text, images, audio
Max input tokens	1,000,000
Max output tokens	192,000

Safety testing results:

Content safety: improved over Gemini 2.5 Pro
Tone objectivity: improved
Note: slightly higher tendency to refuse benign requests; prompt adjustment may be needed

Pricing and Access

Plan	Price	Deep Think Access
Google Ultra	$250/month	Available
Standard Gemini	Free and up	Limited

Access steps:

Open Gemini app (Web, Android, iOS)
Select "Gemini 2.5 Pro" from the model dropdown
Toggle "Deep Think" on in the prompt bar
Note: daily prompt limits apply

API access: Available through Vertex AI and Google AI Studio. Higher computational cost means API usage is priced accordingly. Best used for complex tasks where the quality improvement justifies the cost.

Evolution: Then vs. Now

Item	Feb 2024 (Gemini 1.0 Ultra)	Jan 2026 (Gemini 2.5 Pro Deep Think)
Reasoning approach	Single-pass	Parallel thinking + multi-agent
HLE score	Not measured	34.8% (top score)
IMO	Not entered	Gold medal level
Max input tokens	128K	1M
Max output tokens	8K	192K
Multimodal	Limited	Text, images, audio
Price	Gemini Advanced $20/month	Ultra $250/month

Competitor Comparison

vs. OpenAI o3

Item	Gemini 2.5 Pro Deep Think	OpenAI o3
HLE	34.8%	20.3%
IMO	Gold medal level	Not disclosed
Reasoning approach	Multi-agent	Single-agent
Max input tokens	1M	200K
Max output tokens	192K	100K
Pricing	Ultra $250/month	Pro $200/month

vs. Claude Opus 4.5

Item	Gemini 2.5 Pro Deep Think	Claude Opus 4.5
Strength	Math and scientific reasoning	Long-running tasks, code generation
Architecture	Multi-agent	Extended thinking
Max input tokens	1M	1M
Multimodal	Text, images, audio	Text, images
Ecosystem	Google Workspace	Claude Code

When to Use Each

Gemini 2.5 Pro Deep Think is the better choice for:

Complex mathematical and scientific problems
Tasks requiring multi-angle analysis
Competitive programming-level code generation
Google Workspace integration workflows

Other models may be better for:

Long-running autonomous tasks (Claude Opus 4.5)
General-purpose conversation (GPT-5.2)
Cost-efficiency priority (Gemini 2.5 Flash)

Google Workspace Integration

Gemini 2.5 Pro integrates across the Google Workspace suite:

Gmail: AI-assisted email drafting and replies
Google Docs: Document summarization, generation, editing
Google Sheets: Data analysis and formula generation
Google Slides: Automatic presentation generation
Google Meet: Meeting summarization and action item extraction

Business Use Cases for Deep Think

Complex analysis reports: Multi-angle financial data analysis
Technical design: Evaluating multiple architectural options
Strategic planning: Competitive analysis and strategy option evaluation
R&D: Scientific hypothesis validation

Adoption Considerations

Advantages

Highest-tier reasoning: top scores on HLE, IMO, LiveCodeBench
Multi-agent flexibility: multiple analytical perspectives in a single query
Google ecosystem integration: seamless Workspace and NotebookLM connectivity
Massive context window: 1M input tokens for complex, long-context tasks

Limitations

Cost: Ultra at $250/month is higher than competing plans
Response speed: Deep Think takes longer than standard generation; unsuitable for real-time applications
Over-refusal: Slightly elevated tendency to refuse benign requests; prompt engineering may be required

Summary

Gemini 2.5 Pro Deep Think leads the 2026 reasoning AI benchmark rankings:

Humanity's Last Exam: 34.8% — ahead of Grok 4 (25.4%) and o3 (20.3%)
IMO 2025: gold medal level (research version)
LiveCodeBench 6 and 2025 USAMO: top scores
Architecture: parallel thinking across multiple simultaneous reasoning paths
Multi-agent: multiple AI agents working the same problem concurrently
Context: 1M input tokens, 192K output tokens
Available on Google Ultra at $250/month
Deep Google Workspace integration

From Gemini 1.0 Ultra in February 2024 to the 2026 release, Google has moved to the front of the reasoning AI competition through architectural innovation. For tasks requiring complex problem analysis across multiple angles — mathematics, science, competitive programming — Gemini 2.5 Pro Deep Think is currently one of the strongest available options.

Gemini 2.5 Pro Deep Think: IMO Gold Medal, 34.8% on HLE, and Multi-Agent Reasoning Explained

Current Status at a Glance

Deep Think: What Parallel Thinking Actually Means

How Standard AI Reasons vs. Deep Think

Multi-Agent Architecture

Benchmark Results

Humanity's Last Exam (HLE)

International Mathematical Olympiad (IMO) 2025

Additional Benchmarks

Technical Specifications

Pricing and Access

Evolution: Then vs. Now

Competitor Comparison

vs. OpenAI o3

vs. Claude Opus 4.5

When to Use Each

Google Workspace Integration

Business Use Cases for Deep Think

Adoption Considerations

Advantages

Limitations

Summary

Considering AI adoption for your organization?

Newsletter

あなたのAIリテラシー、診断してみませんか？

Related Knowledge Base

Solutions

Learn More About AIコンサル

Related Articles

The Day the Government Becomes a Startup's 'First Customer': How the New Procurement Package for Japan's 17 Strategic Sectors Changes the Deep Tech Landscape (April 2026 Update)

Management Strategy for an AI-Driven Society — Fujitsu CTO Takagi on the Reality of "Human-Centered AI x Corporate Transformation" [SusHi Tech Tokyo 2026]

AI x Education for Well-being in the Intelligent Age | The Vision of UTokyo President Fujii and Mongolia-born AI Academia at SusHi Tech Tokyo 2026

Newsletter