anthropic

Claude 3.7 Sonnet

Claude 3.7 Sonnet is Anthropic's first hybrid reasoning model, delivering state-of-the-art coding capabilities, a 200k context window, and visible thinking.

anthropic logoanthropicClaude 3February 24, 2025
Context
200Ktokens
Max Output
128Ktokens
Input Price
$3.00/ 1M
Output Price
$15.00/ 1M
Modality:TextImage
Capabilities:VisionToolsStreamingReasoning
Benchmarks
GPQA
70.3%
GPQA: Graduate-Level Science Q&A. A rigorous benchmark with 448 multiple-choice questions in biology, physics, and chemistry created by domain experts. PhD experts only achieve 65-74% accuracy, while non-experts score just 34% even with unlimited web access (hence 'Google-proof'). Claude 3.7 Sonnet scored 70.3% on this benchmark.
HLE
34.1%
HLE: High-Level Expertise Reasoning. Tests a model's ability to demonstrate expert-level reasoning across specialized domains. Evaluates deep understanding of complex topics that require professional-level knowledge. Claude 3.7 Sonnet scored 34.1% on this benchmark.
MMLU
88.8%
MMLU: Massive Multitask Language Understanding. A comprehensive benchmark with 16,000 multiple-choice questions across 57 academic subjects including math, philosophy, law, and medicine. Tests broad knowledge and reasoning capabilities. Claude 3.7 Sonnet scored 88.8% on this benchmark.
MMLU Pro
78.4%
MMLU Pro: MMLU Professional Edition. An enhanced version of MMLU with 12,032 questions using a harder 10-option multiple choice format. Covers Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, and Computer Science. Claude 3.7 Sonnet scored 78.4% on this benchmark.
SimpleQA
40.5%
SimpleQA: Factual Accuracy Benchmark. Tests a model's ability to provide accurate, factual responses to straightforward questions. Measures reliability and reduces hallucinations in knowledge retrieval tasks. Claude 3.7 Sonnet scored 40.5% on this benchmark.
IFEval
88.5%
IFEval: Instruction Following Evaluation. Measures how well a model follows specific instructions and constraints. Tests the ability to adhere to formatting rules, length limits, and other explicit requirements. Claude 3.7 Sonnet scored 88.5% on this benchmark.
AIME 2025
80%
AIME 2025: American Invitational Math Exam. Competition-level mathematics problems from the prestigious AIME exam designed for talented high school students. Tests advanced mathematical problem-solving requiring abstract reasoning, not just pattern matching. Claude 3.7 Sonnet scored 80% on this benchmark.
MATH
90%
MATH: Mathematical Problem Solving. A comprehensive math benchmark testing problem-solving across algebra, geometry, calculus, and other mathematical domains. Requires multi-step reasoning and formal mathematical knowledge. Claude 3.7 Sonnet scored 90% on this benchmark.
GSM8k
96.4%
GSM8k: Grade School Math 8K. 8,500 grade school-level math word problems requiring multi-step reasoning. Tests basic arithmetic and logical thinking through real-world scenarios like shopping or time calculations. Claude 3.7 Sonnet scored 96.4% on this benchmark.
MGSM
94%
MGSM: Multilingual Grade School Math. The GSM8k benchmark translated into 10 languages including Spanish, French, German, Russian, Chinese, and Japanese. Tests mathematical reasoning across different languages. Claude 3.7 Sonnet scored 94% on this benchmark.
MathVista
72.1%
MathVista: Mathematical Visual Reasoning. Tests the ability to solve math problems that involve visual elements like charts, graphs, geometry diagrams, and scientific figures. Combines visual understanding with mathematical reasoning. Claude 3.7 Sonnet scored 72.1% on this benchmark.
SWE-Bench
62.1%
SWE-Bench: Software Engineering Benchmark. AI models attempt to resolve real GitHub issues in open-source Python projects with human verification. Tests practical software engineering skills on production codebases. Top models went from 4.4% in 2023 to over 70% in 2024. Claude 3.7 Sonnet scored 62.1% on this benchmark.
HumanEval
94%
HumanEval: Python Programming Problems. 164 hand-written programming problems where models must generate correct Python function implementations. Each solution is verified against unit tests. Top models now achieve 90%+ accuracy. Claude 3.7 Sonnet scored 94% on this benchmark.
LiveCodeBench
65.4%
LiveCodeBench: Live Coding Benchmark. Tests coding abilities on continuously updated, real-world programming challenges. Unlike static benchmarks, uses fresh problems to prevent data contamination and measure true coding skills. Claude 3.7 Sonnet scored 65.4% on this benchmark.
MMMU
69.1%
MMMU: Multimodal Understanding. Massive Multi-discipline Multimodal Understanding benchmark testing vision-language models on college-level problems across 30 subjects requiring both image understanding and expert knowledge. Claude 3.7 Sonnet scored 69.1% on this benchmark.
MMMU Pro
55.4%
MMMU Pro: MMMU Professional Edition. Enhanced version of MMMU with more challenging questions and stricter evaluation. Tests advanced multimodal reasoning at professional and expert levels. Claude 3.7 Sonnet scored 55.4% on this benchmark.
ChartQA
91.2%
ChartQA: Chart Question Answering. Tests the ability to understand and reason about information presented in charts and graphs. Requires extracting data, comparing values, and performing calculations from visual data representations. Claude 3.7 Sonnet scored 91.2% on this benchmark.
DocVQA
93.5%
DocVQA: Document Visual Q&A. Document Visual Question Answering benchmark testing the ability to extract and reason about information from document images including forms, reports, and scanned text. Claude 3.7 Sonnet scored 93.5% on this benchmark.
Terminal-Bench
58.2%
Terminal-Bench: Terminal/CLI Tasks. Tests the ability to perform command-line operations, write shell scripts, and navigate terminal environments. Measures practical system administration and development workflow skills. Claude 3.7 Sonnet scored 58.2% on this benchmark.
ARC-AGI
12%
ARC-AGI: Abstraction & Reasoning. Abstraction and Reasoning Corpus for AGI - tests fluid intelligence through novel pattern recognition puzzles. Each task requires discovering the underlying rule from examples, measuring general reasoning ability rather than memorization. Claude 3.7 Sonnet scored 12% on this benchmark.

About Claude 3.7 Sonnet

Learn about Claude 3.7 Sonnet's capabilities, features, and how it can help you achieve better results.

Hybrid Reasoning Design

Claude 3.7 Sonnet uses a new architecture that lets users choose between speed and depth. It is the first model to offer a toggle for extended thinking, allowing the system to work through complex logic before providing an answer. This transparency lets developers see exactly how the model reaches a conclusion, reducing the chance of hidden errors in technical work.

Technical Problem Solving

This model is built for high-level software engineering. It scores 62.1% on the SWE-bench Verified benchmark, showing a strong ability to fix real GitHub issues. When used with tools like Claude Code, it manages file editing and command execution across large repositories. It handles math and coding tasks with a level of precision that matches or exceeds current top-tier reasoning models.

Massive Context Capacity

With a 200,000-token context window, the model processes large sets of documentation or codebases in one go. It supports up to 128,000 tokens of output when the thinking mode is active, making it useful for generating long scripts or detailed reports. The model is also multimodal, meaning it can interpret charts and diagrams alongside text.

Claude 3.7 Sonnet

Use Cases

Discover the different ways you can use Claude 3.7 Sonnet to achieve great results.

Agentic Software Engineering

Using the terminal tool to fix bugs and refactor code across massive file structures.

Math Proof Verification

Solving difficult math problems by letting the model think through logical steps.

Repository Analysis

Extracting data and identifying patterns from entire technical codebases in one prompt.

Visual Data Parsing

Converting complex charts, flowcharts, and technical diagrams into structured JSON data.

System Architecture Planning

Designing software systems with detailed logic checks using the extended thinking mode.

Automated Git Workflows

Managing commit messages, code reviews, and test execution through agentic tool use.

Strengths

Limitations

Hybrid Thinking Options: The first model allowing users to toggle between fast standard responses and deep reasoning modes.
Reasoning Latency: Enabling the thinking mode significantly increases the time it takes for the model to respond.
Premier Coding Agent: Top-tier performance on SWE-bench Verified with a 62.1% score for fixing production issues.
Thinking Cost: Internal reasoning tokens are billed at the $15 per million output rate, which adds up during long tasks.
Extreme Output Capacity: Generates up to 128,000 tokens in a single response, facilitating massive code and document generation.
No Video Support: Unlike some competitors, it cannot natively ingest or analyze raw video files through the API.
Transparent Logic: Externalized chain-of-thought allows users to audit and debug the model's internal reasoning process.
Knowledge Cutoff: The training data only goes up to October 2024, missing recent industry developments.

API Quick Start

anthropic/claude-3-7-sonnet

View Documentation
anthropic SDK
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

const message = await anthropic.messages.create({
  model: "claude-3-7-sonnet-20250219",
  max_tokens: 4096,
  thinking: {
    type: "enabled",
    budget_tokens: 2048
  },
  messages: [{ role: "user", content: "Analyze this architectural flaw..." }],
});

console.log(message.content);

Install the SDK and start making API calls in minutes.

Community Feedback

See what the community thinks about Claude 3.7 Sonnet

Claude Code plus 3.7 Sonnet is basically a junior developer on steroids in my terminal. It's the first time agentic AI felt real.
dev_guru_99
reddit
The hybrid reasoning is a major update. I don't always need it to think for 30 seconds, but when I'm debugging, it's incredible.
TechLead_X
twitter
Anthropic managed to make a model that competes with o1 on math while staying useful for everyday chat.
logic_fanatic
hackernews
Claude delivers comprehensive, beautifully formatted reports with citations in under five minutes.
ThinkingDeeplyAI_mod
reddit
The 128k output limit is a sleeper feature. Finally a model that doesn't cut off halfway through a long script.
code_monk_42
reddit
Claude 3.7 + MCP is the closest thing to Jarvis right now. It actually uses my local tools correctly.
julie_codes_it
twitter

Related Videos

Watch tutorials, reviews, and discussions about Claude 3.7 Sonnet

Claude 3.7 is straight gas. The new base model beat itself to become even better at programming.

The new 3.7 model absolutely crushed all other models including OpenAI o3 mini.

It is capable of solving 70% of GitHub issues.

Extended thinking allows the model to ponder a problem before outputting code.

This is a massive win for the developer experience.

Chat bots give you advice, but Claude Code takes actions. It can create files, build websites, and install packages.

Extended thinking is Claude reasoning before it actually takes any actions.

The tool is optimized for the terminal environment.

MCP connectivity is what really separates this from standard ChatGPT.

The model understands intent behind vague terminal commands.

The integration with the terminal via Claude Code is a level of agency we haven't seen yet.

Claude 3.7 Sonnet's ability to show its thought process is far more transparent than competitors.

On SWE-bench Verified, it hits a remarkable 62%.

Hybrid reasoning means you don't pay the latency penalty when you don't need it.

It maintains the high-quality writing style of previous Claude models.

More than just prompts

Supercharge your workflow with AI Automation

Automatio combines the power of AI agents, web automation, and smart integrations to help you accomplish more in less time.

AI Agents
Web Automation
Smart Workflows

Pro Tips

Expert tips to help you get the most out of Claude 3.7 Sonnet and achieve better results.

Set Reasoning Budgets

Use the API thinking parameter to limit the number of reasoning tokens to manage costs.

Review Thought Blocks

Check the internal chain-of-thought in responses to verify the logic of complex answers.

Use MCP Connectors

Connect the model to local databases and cloud storage for real-time project context.

Context Refreshing

Use summary commands in long agentic loops to keep the context window focused on relevant data.

Testimonials

What Our Users Say

Join thousands of satisfied users who have transformed their workflow

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Related AI Models

anthropic

Claude 4.5 Sonnet

Anthropic

Anthropic's Claude Sonnet 4.5 delivers world-leading coding (77.2% SWE-bench) and a 200K context window, optimized for the next generation of autonomous agents.

200K context
$3.00/$15.00/1M
openai

GPT-5.3 Codex

OpenAI

GPT-5.3 Codex is OpenAI's 2026 frontier coding agent, featuring a 400K context window, 77.3% Terminal-Bench score, and superior logic for complex software...

400K context
$1.75/$14.00/1M
deepseek

DeepSeek-V3.2-Speciale

DeepSeek

DeepSeek-V3.2-Speciale is a reasoning-first LLM featuring gold-medal math performance, DeepSeek Sparse Attention, and a 131K context window. Rivaling GPT-5...

131K context
$0.28/$0.42/1M
openai

GPT-5.4

OpenAI

GPT-5.4 is OpenAI's frontier model featuring a 1.05M context window and Extreme Reasoning. It excels at autonomous UI interaction and long-form data analysis.

1M context
$2.50/$15.00/1M
moonshot

Kimi K2 Thinking

Moonshot

Kimi K2 Thinking is Moonshot AI's trillion-parameter reasoning model. It outperforms GPT-5 on HLE and supports 300 sequential tool calls autonomously for...

256K context
$0.60/$2.50/1M
google

Gemini 3.1 Flash Live Preview

Google

Gemini 3.1 Flash Live Preview is Google's ultra-low-latency, audio-to-audio model featuring a 131K context window, high-fidelity multimodal reasoning, and...

131K context
$0.75/$4.50/1M
openai

GPT-4o mini

OpenAI

OpenAI's most cost-efficient small model, GPT-4o mini offers multimodal intelligence and high-speed performance at a significantly lower price point.

128K context
$0.15/$0.60/1M
google

Gemini 3.1 Flash-Lite

Google

Gemini 3.1 Flash-Lite is Google's fastest, most cost-efficient model. Features 1M context, native multimodality, and 363 tokens/sec speed for scale.

1M context
$0.25/$1.50/1M

Frequently Asked Questions

Find answers to common questions about Claude 3.7 Sonnet