anthropic

Claude Opus 4.5

Claude Opus 4.5 is Anthropic's most powerful frontier model, delivering record-breaking 80.9% SWE-bench performance and advanced autonomous agency for coding.

anthropic logoanthropicClaude 4November 24, 2025
Context
200Ktokens
Max Output
64Ktokens
Input Price
$5.00/ 1M
Output Price
$25.00/ 1M
Modality:TextImage
Capabilities:VisionToolsStreamingReasoning
Benchmarks
GPQA
87%
GPQA: Graduate-Level Science Q&A. A rigorous benchmark with 448 multiple-choice questions in biology, physics, and chemistry created by domain experts. PhD experts only achieve 65-74% accuracy, while non-experts score just 34% even with unlimited web access (hence 'Google-proof'). Claude Opus 4.5 scored 87% on this benchmark.
HLE
20%
HLE: High-Level Expertise Reasoning. Tests a model's ability to demonstrate expert-level reasoning across specialized domains. Evaluates deep understanding of complex topics that require professional-level knowledge. Claude Opus 4.5 scored 20% on this benchmark.
MMLU
90.8%
MMLU: Massive Multitask Language Understanding. A comprehensive benchmark with 16,000 multiple-choice questions across 57 academic subjects including math, philosophy, law, and medicine. Tests broad knowledge and reasoning capabilities. Claude Opus 4.5 scored 90.8% on this benchmark.
MMLU Pro
89.5%
MMLU Pro: MMLU Professional Edition. An enhanced version of MMLU with 12,032 questions using a harder 10-option multiple choice format. Covers Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, and Computer Science. Claude Opus 4.5 scored 89.5% on this benchmark.
SimpleQA
36%
SimpleQA: Factual Accuracy Benchmark. Tests a model's ability to provide accurate, factual responses to straightforward questions. Measures reliability and reduces hallucinations in knowledge retrieval tasks. Claude Opus 4.5 scored 36% on this benchmark.
IFEval
92%
IFEval: Instruction Following Evaluation. Measures how well a model follows specific instructions and constraints. Tests the ability to adhere to formatting rules, length limits, and other explicit requirements. Claude Opus 4.5 scored 92% on this benchmark.
AIME 2025
87%
AIME 2025: American Invitational Math Exam. Competition-level mathematics problems from the prestigious AIME exam designed for talented high school students. Tests advanced mathematical problem-solving requiring abstract reasoning, not just pattern matching. Claude Opus 4.5 scored 87% on this benchmark.
MATH
95.2%
MATH: Mathematical Problem Solving. A comprehensive math benchmark testing problem-solving across algebra, geometry, calculus, and other mathematical domains. Requires multi-step reasoning and formal mathematical knowledge. Claude Opus 4.5 scored 95.2% on this benchmark.
GSM8k
95%
GSM8k: Grade School Math 8K. 8,500 grade school-level math word problems requiring multi-step reasoning. Tests basic arithmetic and logical thinking through real-world scenarios like shopping or time calculations. Claude Opus 4.5 scored 95% on this benchmark.
MGSM
92.5%
MGSM: Multilingual Grade School Math. The GSM8k benchmark translated into 10 languages including Spanish, French, German, Russian, Chinese, and Japanese. Tests mathematical reasoning across different languages. Claude Opus 4.5 scored 92.5% on this benchmark.
MathVista
51.9%
MathVista: Mathematical Visual Reasoning. Tests the ability to solve math problems that involve visual elements like charts, graphs, geometry diagrams, and scientific figures. Combines visual understanding with mathematical reasoning. Claude Opus 4.5 scored 51.9% on this benchmark.
SWE-Bench
80.9%
SWE-Bench: Software Engineering Benchmark. AI models attempt to resolve real GitHub issues in open-source Python projects with human verification. Tests practical software engineering skills on production codebases. Top models went from 4.4% in 2023 to over 70% in 2024. Claude Opus 4.5 scored 80.9% on this benchmark.
HumanEval
92%
HumanEval: Python Programming Problems. 164 hand-written programming problems where models must generate correct Python function implementations. Each solution is verified against unit tests. Top models now achieve 90%+ accuracy. Claude Opus 4.5 scored 92% on this benchmark.
LiveCodeBench
70.3%
LiveCodeBench: Live Coding Benchmark. Tests coding abilities on continuously updated, real-world programming challenges. Unlike static benchmarks, uses fresh problems to prevent data contamination and measure true coding skills. Claude Opus 4.5 scored 70.3% on this benchmark.
MMMU
80.7%
MMMU: Multimodal Understanding. Massive Multi-discipline Multimodal Understanding benchmark testing vision-language models on college-level problems across 30 subjects requiring both image understanding and expert knowledge. Claude Opus 4.5 scored 80.7% on this benchmark.
MMMU Pro
33%
MMMU Pro: MMMU Professional Edition. Enhanced version of MMMU with more challenging questions and stricter evaluation. Tests advanced multimodal reasoning at professional and expert levels. Claude Opus 4.5 scored 33% on this benchmark.
ChartQA
83.4%
ChartQA: Chart Question Answering. Tests the ability to understand and reason about information presented in charts and graphs. Requires extracting data, comparing values, and performing calculations from visual data representations. Claude Opus 4.5 scored 83.4% on this benchmark.
DocVQA
88.4%
DocVQA: Document Visual Q&A. Document Visual Question Answering benchmark testing the ability to extract and reason about information from document images including forms, reports, and scanned text. Claude Opus 4.5 scored 88.4% on this benchmark.
Terminal-Bench
59.3%
Terminal-Bench: Terminal/CLI Tasks. Tests the ability to perform command-line operations, write shell scripts, and navigate terminal environments. Measures practical system administration and development workflow skills. Claude Opus 4.5 scored 59.3% on this benchmark.
ARC-AGI
37.6%
ARC-AGI: Abstraction & Reasoning. Abstraction and Reasoning Corpus for AGI - tests fluid intelligence through novel pattern recognition puzzles. Each task requires discovering the underlying rule from examples, measuring general reasoning ability rather than memorization. Claude Opus 4.5 scored 37.6% on this benchmark.

About Claude Opus 4.5

Learn about Claude Opus 4.5's capabilities, features, and how it can help you achieve better results.

Claude Opus 4.5 is the flagship model from Anthropic, released in late 2025. It is specifically designed for complex software engineering and high-stakes reasoning. The model achieved a record-breaking 80.9% on the SWE-bench Verified benchmark, making it a primary choice for autonomous debugging and system refactoring. It introduces a refined persona emphasizing diplomatic honesty and nuanced helpfulness.

Multimodal and Agentic Optimization

The architecture supports a 200,000-token context window and a 64,000-token output limit. Developers can use a specialized effort parameter to scale reasoning depth against computational costs. This flexibility allows for high-intensity logic tasks or faster, more economical creative drafting. The model is multimodal, excelling at interpreting architectural diagrams and dense UI layouts.

Engineering and Tool Use

Optimized for agentic workflows, it navigates terminal environments via Claude Code to perform system-wide audits. It reduces input and output pricing significantly compared to earlier flagship iterations. Its ability to maintain coherence across long-horizon tasks positions it as a reliable partner for professional engineering teams and complex data analysis.

Claude Opus 4.5

Use Cases

Discover the different ways you can use Claude Opus 4.5 to achieve great results.

Autonomous Software Engineering

Automating end-to-end debugging and system-wide refactoring with a record-breaking 80.9% SWE-bench score.

Agentic Research Workflows

Synthesizing vast amounts of technical data into actionable business strategies using the 200k context window.

High-Fidelity UI/UX Vision

Converting complex Figma designs and architectural diagrams into production-ready frontend code with pixel-perfect accuracy.

Multi-Agent Orchestration

Serving as the central brain for teams of sub-agents to manage long-horizon projects across disparate codebases.

Advanced Data Analysis

Automating complex financial modeling and Excel workflows with high precision and reasoning depth.

Literary and Creative Drafting

Producing nuanced prose that adheres to specific writerly tastes and complex human-centric design principles.

Strengths

Limitations

Elite Coding Performance: The first model to break the 80% barrier on SWE-bench Verified (80.9%), outperforming all other frontier models.
Math Benchmark Gaps: While elite at coding, it trails slightly behind specialized models in PhD-level mathematics.
Flexible Reasoning Control: The effort parameter gives developers granular control over computational cost and reasoning depth for specific workflows.
Planning Latency: Setting the effort parameter to high can result in significantly longer thinking phases before the first token.
Natural Conversational Nuance: Recognized for a refined persona that handles ambiguity and follows complex background settings without robotic guidance.
Context Token Caps: System prompts and tool definitions can consume a large portion of the window before processing begins.
Significant Cost Efficiency: The $5/$25 pricing makes Opus-level intelligence accessible for high-volume enterprise production.
Factual Recall Gaps: On specialized accuracy tests like SimpleQA, it can still occasionally fabricate details compared to search-heavy competitors.

API Quick Start

anthropic/claude-opus-4.5

View Documentation
anthropic SDK
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

const msg = await anthropic.messages.create({
  model: 'claude-opus-4-5-20251101',
  max_tokens: 4096,
  effort: 'high',
  messages: [{ role: 'user', content: 'Analyze this system architecture for race conditions.' }],
});

console.log(msg.content[0].text);

Install the SDK and start making API calls in minutes.

Community Feedback

See what the community thinks about Claude Opus 4.5

Claude Opus 4.5 feels less like a stateless assistant and more like a persistent teammate. It can trace assumptions across multiple files in a way that feels clearly stronger.
Federal-Piano8695
reddit
Watching your AI agent develop a social media persona that resonates with real people in ways you cannot explain. Infrastructure matters more than prompts.
auxten
twitter
Opus is the best performing model in this aspect. Its discussion is most natural, and it truly follows along with you in discussion.
ArchMeta1868
reddit
Opus 4.5 hits the most little nuances. It is the only model to successfully include an inline trailer mechanism in the first pass.
Matt Berman
youtube
The 80.9% SWE-bench score is probably real but also kinda misleading. It requires clear environment setup to hit those numbers consistently.
testingcatalog
twitter
SWE-bench Verified: 80.9% (Opus 4.5) vs 71.3% (Claude 3-Opus). This is a massive jump for real-world reliability.
Daniel Garcia
medium

Related Videos

Watch tutorials, reviews, and discussions about Claude Opus 4.5

Opus 4.5 hits the most little nuances

It was the only model to successfully include an inline trailer mechanism in the first pass

An agent-driven code evaluation confirms this subjective feeling, scoring Opus at 7/10 for feature completeness

The reasoning is far more logical than previous versions when handling edge cases

It maintains codebase consistency over 30 minute sessions

The price is now three times cheaper. It is only going to be $5 for a million input tokens

Input is $5 and output is $25 for a million tokens

Opus 4.5 scored higher than any human candidate has ever scored on Anthropic's own take-home exam

This is the first model to break the 80 percent barrier on SWE-bench

It handles autonomous 30-minute coding sessions without human intervention

Think of Claude Opus 4.5 as a persuasion layer and an absolute agentic monster

It is an absolute agentic and harness coding monster

Engineers end up preferring working with Claude Opus 4.5 because they get those tight feedback loops

The reasoning effort parameter is the standout feature for developers

It feels more like a collaborator than a tool in long-form discussions

More than just prompts

Supercharge your workflow with AI Automation

Automatio combines the power of AI agents, web automation, and smart integrations to help you accomplish more in less time.

AI Agents
Web Automation
Smart Workflows

Pro Tips

Expert tips to help you get the most out of Claude Opus 4.5 and achieve better results.

Toggle Reasoning Effort

Use the effort parameter to select high for complex logic or coding tasks and medium for standard creative writing.

Vision-Native Design

Upload high-resolution screenshots of UI bugs as the model is tuned to identify visual discrepancies that text descriptions miss.

Structured System Prompts

Define clear agentic roles and effort levels in your system prompts to prevent the model from overthinking simpler procedural tasks.

Context Compaction

Summarize history in long-running sessions to keep the 200k context window focused on the most relevant information.

Testimonials

What Our Users Say

Join thousands of satisfied users who have transformed their workflow

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Related AI Models

google

Gemini 3.1 Flash-Lite

Google

Gemini 3.1 Flash-Lite is Google's fastest, most cost-efficient model. Features 1M context, native multimodality, and 363 tokens/sec speed for scale.

1M context
$0.25/$1.50/1M
xai

Grok-4

xAI

Grok-4 by xAI is a frontier model featuring a 2M token context window, real-time X platform integration, and world-record reasoning capabilities.

2M context
$3.00/$15.00/1M
moonshot

Kimi K2.5

Moonshot

Discover Moonshot AI's Kimi K2.5, a 1T-parameter open-source agentic model featuring native multimodal capabilities, a 262K context window, and SOTA reasoning.

256K context
$0.60/$3.00/1M
zhipu

GLM-5

Zhipu (GLM)

GLM-5 is Zhipu AI's 744B parameter open-weight powerhouse, excelling in long-horizon agentic tasks, coding, and factual accuracy with a 200k context window.

200K context
$1.00/$3.20/1M
openai

GPT-5.1

OpenAI

GPT-5.1 is OpenAI’s advanced reasoning flagship featuring adaptive thinking, native multimodality, and state-of-the-art performance in math and technical...

400K context
$1.25/$10.00/1M
openai

GPT-5.2

OpenAI

GPT-5.2 is OpenAI's flagship model for professional tasks, featuring a 400K context window, elite coding, and deep multi-step reasoning capabilities.

400K context
$1.75/$14.00/1M
alibaba

Qwen3.5-397B-A17B

alibaba

Qwen3.5-397B-A17B is Alibaba's flagship open-weight MoE model. It features native multimodal reasoning, a 1M context window, and a 19x decoding throughput...

1M context
$0.40/$2.40/1M
moonshot

Kimi K2 Thinking

Moonshot

Kimi K2 Thinking is Moonshot AI's trillion-parameter reasoning model. It outperforms GPT-5 on HLE and supports 300 sequential tool calls autonomously for...

256K context
$0.60/$2.50/1M

Frequently Asked Questions

Find answers to common questions about Claude Opus 4.5