Quick Read
- Google’s Gemini 3 Deep Think has set new SOTA records, outperforming Claude Opus 4.6 in reasoning and competitive programming.
- MiniMax’s M2.5 and M2.5 Lightning models offer comparable or superior performance to Claude Opus 4.6 at significantly lower costs.
- Gemini 3 Deep Think achieved 84.6% on ARC-AGI-2 and a 3455 Elo score on Codeforces, surpassing Opus 4.6’s 68.8% on ARC-AGI-2.
- MiniMax M2.5 matches Claude Opus 4.6’s speeds on SWE-Bench Verified and costs up to 1/20th of its price.
- These developments signal a shift towards more affordable and specialized AI agents for enterprise applications.
The landscape of artificial intelligence is undergoing a significant transformation, with established high-performance models like Claude Opus 4.6 now facing formidable challenges from new entrants in both capabilities and cost-efficiency. Recent announcements from Google and Chinese AI startup MiniMax signal a critical shift, as their latest models—Gemini 3 Deep Think and M2.5, respectively—are not only achieving state-of-the-art benchmarks but also drastically reducing the operational costs of advanced AI, directly impacting the market dominance previously held by models like Claude Opus 4.6.
This intensified competition, particularly evident in the past few months, highlights a maturing AI market where raw performance is increasingly being paired with economic viability and specialized application. The focus is shifting from simply achieving intelligence to making that intelligence accessible and affordable for widespread enterprise deployment, moving AI from sophisticated chatbot functionality to a powerful, cost-effective workforce.
Google’s Gemini 3 Deep Think Outperforms Claude Opus 4.6
Google has made a significant stride in the AI race with the major upgrade of its Gemini 3 Deep Think model, which has shattered several state-of-the-art (SOTA) records, positioning it as a direct challenger to existing top-tier models like Claude Opus 4.6. Launched amid what Google officials described as ‘fierce attacks’ from rivals, the new Deep Think is a reasoning mode specifically developed to push the frontiers of intelligence in science, research, and engineering.
On Codeforces, a competitive programming benchmark, Gemini 3 Deep Think achieved an astonishing 3455 Elo score, placing it among the top eight programmers globally. This score significantly surpasses the previous highest of 2727 Elo, achieved a year prior. Furthermore, its capabilities extend to complex reasoning, setting a record of 84.6% on ARC-AGI-2, a leading benchmark for AI reasoning ability. This performance markedly outshines Claude Opus 4.6, which scored 68.8% on the same test, and even the first-generation Deep Think’s 45.1% just three months prior, as reported by 36kr.com. Gemini 3 Deep Think also refreshed the SOTA on the Humanity’s Last Exam (HLE) with a score of 48.4% and achieved gold-medal-level results in the 2025 International Mathematical, Physics, and Chemistry Olympiads.
Beyond benchmarks, Deep Think’s practical applications are already being demonstrated. Rutgers University mathematician Lisa Carbone utilized the model to review a specialized mathematical paper, where it successfully identified a subtle logical flaw missed in previous manual peer reviews. At Duke University’s Wang An Laboratory, Gemini 3 Deep Think optimized crystal growth methods, designing a process capable of growing thin films over 100 microns thick—a precision difficult to achieve with prior methods. Researchers from the DeepSeek multimodal team also lauded its ability to handle long-tail scientific tasks, accurately calculating molecular formulas from complex structural images.
Remarkably, while performance has soared, the reasoning cost has also significantly decreased. The cost per task for the first-generation Deep Think was $77.16, but this new upgrade has reduced it by 82% to just $13.62 per task, making advanced reasoning more economically viable.
MiniMax M2.5: Cost-Effective Rival to Claude Opus 4.6
Adding another layer to the competitive landscape, Chinese AI startup MiniMax has unveiled its M2.5 and M2.5 Lightning models, which are positioned as direct rivals to Claude Opus 4.6, particularly on the basis of cost-efficiency and specialized agentic capabilities. MiniMax claims M2.5 delivers near state-of-the-art performance while costing up to 1/20th of Claude Opus 4.6, according to VentureBeat.
The M2.5 model, with only 10 billion activated parameters facilitated by a Mixture of Experts (MoE) architecture, demonstrates flagship-level performance. On SWE-Bench Verified, it scored 80.2%, matching Claude Opus 4.6 speeds, and achieved a SOTA of 51.3% on Multi-SWE-Bench, surpassing its rival in multi-language, complex coding environments. It also boasts 76.3% on BrowseComp for industry-leading search and tool use, and 76.8% on BFCL for high-precision agentic workflows, as cited by Pandaily.
MiniMax emphasizes the shift from AI as a ‘chatbot’ to AI as a ‘worker,’ with M2.5 designed for ‘production-grade native agent models.’ The model exhibits an ‘Architect Mindset,’ proactively planning project structures before coding. This efficiency is reflected in its pricing: the M2.5-Lightning version delivers 100 tokens per second for $0.30 per 1M input tokens and $2.40 per 1M output tokens, while the standard M2.5, optimized for cost, runs at 50 tokens per second for $0.15 per 1M input tokens and $1.20 per 1M output tokens. This translates to running four AI agents continuously for an entire year for approximately $10,000, making it significantly more affordable than proprietary models like Claude Opus 4.6, which costs $30.00 for comparable tasks.
Internally, MiniMax has already integrated M2.5 into its operations, with 30% of all tasks at MiniMax HQ completed by the model, and a staggering 80% of newly committed code generated by M2.5. This internal deployment underscores the model’s practical utility and cost-saving potential for enterprises.
Shifting AI Landscape: Beyond Claude Opus 4.6’s Dominance
The emergence of Google’s Gemini 3 Deep Think and MiniMax’s M2.5 models signifies a critical evolution in the artificial intelligence sector. These developments challenge the established benchmarks set by models like Claude Opus 4.6, not merely by incremental improvements but by fundamentally reshaping the value proposition of advanced AI. The focus is expanding beyond raw computational power to encompass efficiency, cost-effectiveness, and specialized agentic capabilities.
For technical leaders and enterprises, this means a new operational playbook. The pressure to ‘optimize’ prompts to save money is diminishing, allowing for the deployment of high-context, high-reasoning models for routine tasks that were previously cost-prohibitive. The speed improvements, such as M2.5’s 37% faster end-to-end task completion, enable ‘agentic’ pipelines—where models interact with other models—to operate fast enough for real-time user applications. Moreover, M2.5’s strong performance in financial modeling suggests its capability to handle the ‘tacit knowledge’ of specialized industries like law and finance with minimal human oversight.
This new wave of AI models, led by Google and MiniMax, is redefining the competitive landscape by offering unparalleled performance-to-cost ratios and specialized intelligence, effectively shifting the industry’s focus towards more accessible and integrated AI solutions for a broader range of real-world applications.

