openai

GPT-5.1

GPT-5.1 is OpenAI’s advanced reasoning flagship featuring adaptive thinking, native multimodality, and state-of-the-art performance in math and technical...

openai logoopenaiGPT-5November 12, 2025
Context
400Ktokens
Max Output
128Ktokens
Input Price
$1.25/ 1M
Output Price
$10.00/ 1M
Modality:TextImage
Capabilities:VisionToolsStreamingReasoning
Benchmarks
GPQA
88.1%
GPQA: Graduate-Level Science Q&A. A rigorous benchmark with 448 multiple-choice questions in biology, physics, and chemistry created by domain experts. PhD experts only achieve 65-74% accuracy, while non-experts score just 34% even with unlimited web access (hence 'Google-proof'). GPT-5.1 scored 88.1% on this benchmark.
HLE
68%
HLE: High-Level Expertise Reasoning. Tests a model's ability to demonstrate expert-level reasoning across specialized domains. Evaluates deep understanding of complex topics that require professional-level knowledge. GPT-5.1 scored 68% on this benchmark.
MMLU
87.3%
MMLU: Massive Multitask Language Understanding. A comprehensive benchmark with 16,000 multiple-choice questions across 57 academic subjects including math, philosophy, law, and medicine. Tests broad knowledge and reasoning capabilities. GPT-5.1 scored 87.3% on this benchmark.
MMLU Pro
85%
MMLU Pro: MMLU Professional Edition. An enhanced version of MMLU with 12,032 questions using a harder 10-option multiple choice format. Covers Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, and Computer Science. GPT-5.1 scored 85% on this benchmark.
SimpleQA
54%
SimpleQA: Factual Accuracy Benchmark. Tests a model's ability to provide accurate, factual responses to straightforward questions. Measures reliability and reduces hallucinations in knowledge retrieval tasks. GPT-5.1 scored 54% on this benchmark.
IFEval
93%
IFEval: Instruction Following Evaluation. Measures how well a model follows specific instructions and constraints. Tests the ability to adhere to formatting rules, length limits, and other explicit requirements. GPT-5.1 scored 93% on this benchmark.
AIME 2025
99.6%
AIME 2025: American Invitational Math Exam. Competition-level mathematics problems from the prestigious AIME exam designed for talented high school students. Tests advanced mathematical problem-solving requiring abstract reasoning, not just pattern matching. GPT-5.1 scored 99.6% on this benchmark.
MATH
94%
MATH: Mathematical Problem Solving. A comprehensive math benchmark testing problem-solving across algebra, geometry, calculus, and other mathematical domains. Requires multi-step reasoning and formal mathematical knowledge. GPT-5.1 scored 94% on this benchmark.
GSM8k
97.1%
GSM8k: Grade School Math 8K. 8,500 grade school-level math word problems requiring multi-step reasoning. Tests basic arithmetic and logical thinking through real-world scenarios like shopping or time calculations. GPT-5.1 scored 97.1% on this benchmark.
MGSM
96%
MGSM: Multilingual Grade School Math. The GSM8k benchmark translated into 10 languages including Spanish, French, German, Russian, Chinese, and Japanese. Tests mathematical reasoning across different languages. GPT-5.1 scored 96% on this benchmark.
MathVista
75%
MathVista: Mathematical Visual Reasoning. Tests the ability to solve math problems that involve visual elements like charts, graphs, geometry diagrams, and scientific figures. Combines visual understanding with mathematical reasoning. GPT-5.1 scored 75% on this benchmark.
SWE-Bench
76.3%
SWE-Bench: Software Engineering Benchmark. AI models attempt to resolve real GitHub issues in open-source Python projects with human verification. Tests practical software engineering skills on production codebases. Top models went from 4.4% in 2023 to over 70% in 2024. GPT-5.1 scored 76.3% on this benchmark.
HumanEval
94.2%
HumanEval: Python Programming Problems. 164 hand-written programming problems where models must generate correct Python function implementations. Each solution is verified against unit tests. Top models now achieve 90%+ accuracy. GPT-5.1 scored 94.2% on this benchmark.
LiveCodeBench
94%
LiveCodeBench: Live Coding Benchmark. Tests coding abilities on continuously updated, real-world programming challenges. Unlike static benchmarks, uses fresh problems to prevent data contamination and measure true coding skills. GPT-5.1 scored 94% on this benchmark.
MMMU
76.4%
MMMU: Multimodal Understanding. Massive Multi-discipline Multimodal Understanding benchmark testing vision-language models on college-level problems across 30 subjects requiring both image understanding and expert knowledge. GPT-5.1 scored 76.4% on this benchmark.
MMMU Pro
62%
MMMU Pro: MMMU Professional Edition. Enhanced version of MMMU with more challenging questions and stricter evaluation. Tests advanced multimodal reasoning at professional and expert levels. GPT-5.1 scored 62% on this benchmark.
ChartQA
83%
ChartQA: Chart Question Answering. Tests the ability to understand and reason about information presented in charts and graphs. Requires extracting data, comparing values, and performing calculations from visual data representations. GPT-5.1 scored 83% on this benchmark.
DocVQA
84%
DocVQA: Document Visual Q&A. Document Visual Question Answering benchmark testing the ability to extract and reason about information from document images including forms, reports, and scanned text. GPT-5.1 scored 84% on this benchmark.
Terminal-Bench
55%
Terminal-Bench: Terminal/CLI Tasks. Tests the ability to perform command-line operations, write shell scripts, and navigate terminal environments. Measures practical system administration and development workflow skills. GPT-5.1 scored 55% on this benchmark.
ARC-AGI
90.5%
ARC-AGI: Abstraction & Reasoning. Abstraction and Reasoning Corpus for AGI - tests fluid intelligence through novel pattern recognition puzzles. Each task requires discovering the underlying rule from examples, measuring general reasoning ability rather than memorization. GPT-5.1 scored 90.5% on this benchmark.

About GPT-5.1

Learn about GPT-5.1's capabilities, features, and how it can help you achieve better results.

Reasoning Architecture

GPT-5.1 features a System 2 thinking architecture. This allows the model to adjust its processing time based on the complexity of the query. For mathematical proofs, it applies deep logical deductions, while simple conversational tasks maintain low latency. The adaptive reasoning system ensures compute is allocated where it provides the most value.

Multimodal Performance

The model uses an omni multimodal framework for text and vision inputs. It provides 84% lower latency on enterprise document extraction tasks compared to its predecessor. Improved memory retention ensures that context is maintained throughout long-horizon agentic workflows, making it suitable for large-scale software engineering projects.

Personalization Systems

A new engine enables tone and trait steering. Users can configure the model to be professional, casual, or expressive through explicit system instructions. These traits allow developers to deploy bots that better match specific brand identities and user preferences without extensive few-shot prompting.

GPT-5.1

Use Cases

Discover the different ways you can use GPT-5.1 to achieve great results.

Agentic Software Engineering

The model automates complex refactors across large codebases using high-accuracy reasoning.

PhD-Level Research

It solves intricate problems in biology and physics that require verified multi-step deductions.

Enterprise Document Analysis

The system extracts structured data from massive sets of tabular documents with high visual precision.

Personalized Customer Support

Developers deploy bots with specific brand traits like quirky or professional to match user sentiment.

Mathematical Problem Solving

The model utilizes its 99.6% AIME scores to verify proofs and tutor students in advanced mathematics.

Vision-Based Business Intelligence

It analyzes complex charts and financial reports to generate executive summaries with visual context.

Strengths

Limitations

Elite Mathematical Reasoning: The model achieved a 99.6% score on AIME 2025, outperforming almost all previous competitive models.
High Output Latency: High-effort reasoning can extend response times to over 20 seconds for complex queries.
Adaptive Processing: Dynamic compute scaling reduces latency by 84% on simple enterprise document tasks.
No Native Audio: It lacks the built-in speech-to-speech capabilities found in competitors like Gemini 2.0.
Enhanced Personality Control: Native tone steering makes interactions feel warmer and more human than the original GPT-5.
Output Pricing: At $10 per million tokens, the cost of long-form reasoning outputs is significantly higher than instant models.
Large Scale Context: A 400,000 token window combined with 24-hour caching allows for massive agentic workflows.
Persistent Stylistic Quirks: Users report the model still struggles to avoid specific punctuation patterns despite explicit memory instructions.

API Quick Start

openai/gpt-5.1

View Documentation
openai SDK
import OpenAI from 'openai';

const openai = new OpenAI();

const response = await openai.chat.completions.create({
  model: "gpt-5.1",
  messages: [{ role: "user", content: "Analyze the security of this smart contract." }],
  reasoning_effort: "high",
});

console.log(response.choices[0].message.content);

Install the SDK and start making API calls in minutes.

Community Feedback

See what the community thinks about GPT-5.1

GPT-5.1 etc in Codex is still the best reviewer for planning and code review tasks.
darrenjr
twitter
Our evals found GPT-5 performed up to 190% better than other leading models in complex reasoning.
CodeRabbit
twitter
GPT-5.1 is better calibrated to prompt difficulty, consuming far fewer tokens on easy inputs.
Tech Titans
facebook
This release is all about the personality and making ChatGPT feel less clinical and sterile.
Theo
youtube
The 400k context window is a lifesaver for our entire repo analysis.
RedditUser99
reddit
Still no native audio is a bummer, but the reasoning gains are real.
HackerNewsGuy
hackernews

Related Videos

Watch tutorials, reviews, and discussions about GPT-5.1

GPT 5.1 is here. It is faster. It is more accurate. It is more conversational.

For the first time, GPT 5.1 Instant can use adaptive reasoning to decide when to think.

The logic here is significantly better than the standard GPT 5 model.

It manages to maintain a warmer tone than we saw in the previous preview versions.

If you are a developer, the extended prompt caching is going to save you a ton of money.

It's even more personalizable than ever before.

The tone sounds a lot more natural... 5.1 is much better for energy.

I noticed it doesn't hallucinate as much during complex workflow steps.

The speed of the instant mode is almost equivalent to GPT 4o mini but with more smarts.

Personalization features mean you can actually tell it to stop being so formal.

This is probably one of the most relaxed iterative updates to a Frontier AI model.

It produced a successful bumper car game result compared to GPT5 thinking.

The vision processing on handwritten documents is noticeably sharper.

I think the reasoning effort toggle is the best feature for managing API costs.

It finally feels like a model you can talk to without it sounding like a textbook.

More than just prompts

Supercharge your workflow with AI Automation

Automatio combines the power of AI agents, web automation, and smart integrations to help you accomplish more in less time.

AI Agents
Web Automation
Smart Workflows

Pro Tips

Expert tips to help you get the most out of GPT-5.1 and achieve better results.

Adjust Reasoning Effort

Use the reasoning_effort parameter to set the thinking level to high for math but none for simple chat to save on latency.

Leverage Large Context

Utilize the 400k context window for entire project folders since the model retains information well in long prompts.

Tone Steering

Enable tone traits in your system instructions to make the model sound less clinical and more like a teammate.

Prompt Caching

Take advantage of 24-hour prompt caching to reduce costs when running repetitive agentic loops on the same codebase.

Testimonials

What Our Users Say

Join thousands of satisfied users who have transformed their workflow

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Related AI Models

alibaba

Qwen3.5-397B-A17B

alibaba

Qwen3.5-397B-A17B is Alibaba's flagship open-weight MoE model. It features native multimodal reasoning, a 1M context window, and a 19x decoding throughput...

1M context
$0.40/$2.40/1M
moonshot

Kimi K2.5

Moonshot

Discover Moonshot AI's Kimi K2.5, a 1T-parameter open-source agentic model featuring native multimodal capabilities, a 262K context window, and SOTA reasoning.

256K context
$0.60/$3.00/1M
xai

Grok-4

xAI

Grok-4 by xAI is a frontier model featuring a 2M token context window, real-time X platform integration, and world-record reasoning capabilities.

2M context
$3.00/$15.00/1M
anthropic

Claude Opus 4.5

Anthropic

Claude Opus 4.5 is Anthropic's most powerful frontier model, delivering record-breaking 80.9% SWE-bench performance and advanced autonomous agency for coding.

200K context
$5.00/$25.00/1M
google

Gemini 3.1 Flash-Lite

Google

Gemini 3.1 Flash-Lite is Google's fastest, most cost-efficient model. Features 1M context, native multimodality, and 363 tokens/sec speed for scale.

1M context
$0.25/$1.50/1M
anthropic

Claude Sonnet 4.6

Anthropic

Claude Sonnet 4.6 offers frontier performance for coding and computer use with a massive 1M token context window for only $3/1M tokens.

1M context
$3.00/$15.00/1M
zhipu

GLM-5

Zhipu (GLM)

GLM-5 is Zhipu AI's 744B parameter open-weight powerhouse, excelling in long-horizon agentic tasks, coding, and factual accuracy with a 200k context window.

200K context
$1.00/$3.20/1M
google

Gemini 3 Flash

Google

Gemini 3 Flash is Google's high-speed multimodal model featuring a 1M token context window, elite 90.4% GPQA reasoning, and autonomous browser automation tools.

1M context
$0.50/$3.00/1M

Frequently Asked Questions

Find answers to common questions about GPT-5.1