Google has introduced Gemini 3 Deep Think mode, an enhanced version of its Gemini 3 Pro model aimed at users who need stronger reasoning performance for complex math, science, and logic tasks. The launch follows a steady string of model updates from the company, positioning Deep Think as a higher-compute option within the broader Gemini 3 lineup. It is now available to Google AI Ultra subscribers through the Gemini app.
According to Google’s benchmarks, Deep Think posts improved results on several well-known reasoning tests. On Humanity’s Last Exam, a difficult evaluation used in AI research circles, it reached 41 percent. It also scored 45.1 percent with code execution on ARC-AGI-2, and 93.8 percent on the GPQA Diamond benchmark, a measure of scientific knowledge. These numbers place the mode near the top of publicly reported results, though benchmarks only capture part of how such systems perform in practical use.
Google attributes the gains to advanced parallel reasoning, allowing the model to test multiple hypotheses at once rather than follow a single linear chain of thought. The company says related variants of the system earned gold-level performance at the International Mathematical Olympiad and the International Collegiate Programming Contest World Finals, completing multi-hour exams without external tools and generating full natural-language proofs. These results point to the increasing overlap between competitive academic problem-solving and the direction of large-scale AI research, even as real-world applicability will depend on how consistently the model performs outside controlled evaluation settings.
Deep Think mode can be accessed by selecting the Gemini 3 Pro model and toggling the new option in the prompt bar. Its release also adds pressure within the competitive AI landscape. Earlier this year, OpenAI noted that an experimental reasoning model of its own had reached similar gold-medal capability, though that system has not yet been made available to the public. With Google choosing to release a version of its high-performing model, questions remain about how quickly competitors will respond—and how these systems will translate peak benchmark scores into dependable day-to-day use for subscribers.
