openai

GPT-5.4

GPT-5.4 is OpenAI's frontier model featuring a 1.05M context window and Extreme Reasoning. It excels at autonomous UI interaction and long-form data analysis.

OpenAIGPT-51M ContextReasoningMultimodal
openai logoopenaiGPT-5March 4, 2026
Context
1.1Mtokens
Max Output
128Ktokens
Input Price
$2.50/ 1M
Output Price
$15.00/ 1M
Modality:TextImage
Capabilities:VisionToolsStreamingReasoning
Benchmarks
GPQA
84.2%
GPQA: Graduate-Level Science Q&A. A rigorous benchmark with 448 multiple-choice questions in biology, physics, and chemistry created by domain experts. PhD experts only achieve 65-74% accuracy, while non-experts score just 34% even with unlimited web access (hence 'Google-proof'). GPT-5.4 scored 84.2% on this benchmark.
HLE
42%
HLE: High-Level Expertise Reasoning. Tests a model's ability to demonstrate expert-level reasoning across specialized domains. Evaluates deep understanding of complex topics that require professional-level knowledge. GPT-5.4 scored 42% on this benchmark.
MMLU
91%
MMLU: Massive Multitask Language Understanding. A comprehensive benchmark with 16,000 multiple-choice questions across 57 academic subjects including math, philosophy, law, and medicine. Tests broad knowledge and reasoning capabilities. GPT-5.4 scored 91% on this benchmark.
MMLU Pro
76%
MMLU Pro: MMLU Professional Edition. An enhanced version of MMLU with 12,032 questions using a harder 10-option multiple choice format. Covers Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, and Computer Science. GPT-5.4 scored 76% on this benchmark.
SimpleQA
56.7%
SimpleQA: Factual Accuracy Benchmark. Tests a model's ability to provide accurate, factual responses to straightforward questions. Measures reliability and reduces hallucinations in knowledge retrieval tasks. GPT-5.4 scored 56.7% on this benchmark.
IFEval
92%
IFEval: Instruction Following Evaluation. Measures how well a model follows specific instructions and constraints. Tests the ability to adhere to formatting rules, length limits, and other explicit requirements. GPT-5.4 scored 92% on this benchmark.
AIME 2025
100%
AIME 2025: American Invitational Math Exam. Competition-level mathematics problems from the prestigious AIME exam designed for talented high school students. Tests advanced mathematical problem-solving requiring abstract reasoning, not just pattern matching. GPT-5.4 scored 100% on this benchmark.
MATH
88.6%
MATH: Mathematical Problem Solving. A comprehensive math benchmark testing problem-solving across algebra, geometry, calculus, and other mathematical domains. Requires multi-step reasoning and formal mathematical knowledge. GPT-5.4 scored 88.6% on this benchmark.
GSM8k
99%
GSM8k: Grade School Math 8K. 8,500 grade school-level math word problems requiring multi-step reasoning. Tests basic arithmetic and logical thinking through real-world scenarios like shopping or time calculations. GPT-5.4 scored 99% on this benchmark.
MGSM
96%
MGSM: Multilingual Grade School Math. The GSM8k benchmark translated into 10 languages including Spanish, French, German, Russian, Chinese, and Japanese. Tests mathematical reasoning across different languages. GPT-5.4 scored 96% on this benchmark.
MathVista
74%
MathVista: Mathematical Visual Reasoning. Tests the ability to solve math problems that involve visual elements like charts, graphs, geometry diagrams, and scientific figures. Combines visual understanding with mathematical reasoning. GPT-5.4 scored 74% on this benchmark.
SWE-Bench
52.8%
SWE-Bench: Software Engineering Benchmark. AI models attempt to resolve real GitHub issues in open-source Python projects with human verification. Tests practical software engineering skills on production codebases. Top models went from 4.4% in 2023 to over 70% in 2024. GPT-5.4 scored 52.8% on this benchmark.
HumanEval
85.1%
HumanEval: Python Programming Problems. 164 hand-written programming problems where models must generate correct Python function implementations. Each solution is verified against unit tests. Top models now achieve 90%+ accuracy. GPT-5.4 scored 85.1% on this benchmark.
LiveCodeBench
72.5%
LiveCodeBench: Live Coding Benchmark. Tests coding abilities on continuously updated, real-world programming challenges. Unlike static benchmarks, uses fresh problems to prevent data contamination and measure true coding skills. GPT-5.4 scored 72.5% on this benchmark.
MMMU
84.2%
MMMU: Multimodal Understanding. Massive Multi-discipline Multimodal Understanding benchmark testing vision-language models on college-level problems across 30 subjects requiring both image understanding and expert knowledge. GPT-5.4 scored 84.2% on this benchmark.
MMMU Pro
61%
MMMU Pro: MMMU Professional Edition. Enhanced version of MMMU with more challenging questions and stricter evaluation. Tests advanced multimodal reasoning at professional and expert levels. GPT-5.4 scored 61% on this benchmark.
ChartQA
89%
ChartQA: Chart Question Answering. Tests the ability to understand and reason about information presented in charts and graphs. Requires extracting data, comparing values, and performing calculations from visual data representations. GPT-5.4 scored 89% on this benchmark.
DocVQA
94%
DocVQA: Document Visual Q&A. Document Visual Question Answering benchmark testing the ability to extract and reason about information from document images including forms, reports, and scanned text. GPT-5.4 scored 94% on this benchmark.
Terminal-Bench
55%
Terminal-Bench: Terminal/CLI Tasks. Tests the ability to perform command-line operations, write shell scripts, and navigate terminal environments. Measures practical system administration and development workflow skills. GPT-5.4 scored 55% on this benchmark.
ARC-AGI
52.9%
ARC-AGI: Abstraction & Reasoning. Abstraction and Reasoning Corpus for AGI - tests fluid intelligence through novel pattern recognition puzzles. Each task requires discovering the underlying rule from examples, measuring general reasoning ability rather than memorization. GPT-5.4 scored 52.9% on this benchmark.

About GPT-5.4

Learn about GPT-5.4's capabilities, features, and how it can help you achieve better results.

The Frontier of Long-Context Reasoning

GPT-5.4 represents the high-performance evolution of the GPT-5 series, characterized by its industry-leading 1.05-million-token context window. This model is specifically engineered to handle expansive datasets, such as massive code repositories or multi-year historical logs, without losing the ability to perform high-fidelity reasoning. A standout feature is the interactive "Mid-Response Steering," which allows users to visually monitor and adjust the model's thinking plan in real-time, ensuring the output perfectly aligns with complex, multi-step intents.

Unified Intelligence and Autonomous Action

Technically, GPT-5.4 unifies the world-class coding strengths of the previous Codex-specific branches with the creative nuances of the standard GPT-5 series. It features a specialized "Thinking" mode with adjustable effort levels (Standard, Extended, and Heavy) that utilizes reinforced chain-of-thought processing to solve PhD-level science and logic problems. Beyond text, GPT-5.4 introduces native computer use capabilities, achieving a 75% score on OSWorld-Verified tasks by interpreting high-fidelity visual screenshots and executing coordinate-based clicks.

Efficiency and Reliability

OpenAI reports a significant 33% decrease in claim-level errors compared to its predecessors, making GPT-5.4 a premier choice for autonomous agents and high-stakes decision support. Despite its power, it is engineered for token and energy efficiency, allowing for cheaper long-context processing than previous iterations. Whether managing an entire enterprise codebase or acting as an autonomous scheduling agent, GPT-5.4 sets a new standard for reliability and agentic performance in the generative AI landscape.

GPT-5.4

Use Cases for GPT-5.4

Discover the different ways you can use GPT-5.4 to achieve great results.

Large Codebase Refactoring

Ingesting and analyzing hundreds of source files simultaneously to ensure cross-module consistency and identify deep semantic bugs across entire repositories.

Autonomous Agentic Scheduling

Interacting with email and calendars via visual grounding to autonomously coordinate complex event schedules and send follow-up communications.

High-Fidelity Architectural Design

Generating intricate 3D scenes and structural plans, such as functional subway stations, using over 1,000 lines of precise, simulation-ready code.

Long-Horizon Scientific Planning

Utilizing Extreme Reasoning to solve PhD-level scientific problems and perform multi-step analysis requiring hours of consistent state management.

Cybersecurity Incident Investigation

Processing vast quantities of raw log data within a single 1.05M context session to autonomously identify, investigate, and report security breaches.

Interactive Mid-Response Steering

Correcting the model's course during the internal 'thinking' phase to adjust architectural choices or logic paths without needing to restart the prompt.

Strengths

Limitations

Frontier 1.05M Context Window: Provides industry-leading capacity to reason over massive datasets and codebases in a single prompt without immediate loss of coherence.
Long Context Degradation: Performance on high-complexity reasoning tasks is noted to drop significantly once the context window exceeds the 256K token mark.
Extreme Reasoning Accuracy: Achieves PhD-level science knowledge (84.2% on GPQA) and perfect math scores (100% on AIME 2025) using its high-effort reasoning mode.
Confusing Versioning Scheme: The complex lineup of 5.1, 5.2 Thinking, 5.3 Codex, and 5.4 variants creates significant cognitive load for API developers and Chat users.
Autonomous UI Interaction: State-of-the-art visual grounding allows the model to interact with software and browsers with 75% accuracy on the OSWorld benchmark.
High Latency in Heavy Mode: The highest reasoning effort modes can take over 8 minutes to process internal CoT, making them unsuitable for real-time interactive tasks.
Token and Energy Efficiency: Engineered as OpenAI's most efficient frontier model yet, reducing the energy cost required for complex reasoning compared to the GPT-5.2 release.
Neurotic Alignment: Aggressive safety fine-tuning can lead to contrarian behavior where the model unnecessarily contradicts the user on harmless factual topics.

API Quick Start

openai/gpt-5.4

View Documentation
openai SDK
import OpenAI from 'openai';

const openai = new OpenAI();

async function main() {
  const completion = await openai.chat.completions.create({
    model: "gpt-5.4",
    messages: [{ role: "user", content: "Analyze this 1.05M token log file for security threats." }],
    reasoning_effort: "heavy",
    stream: true,
  });

  for await (const chunk of completion) {
    process.stdout.write(chunk.choices[0]?.delta?.content || '');
  }
}

main();

Install the SDK and start making API calls in minutes.

What People Are Saying About GPT-5.4

See what the community thinks about GPT-5.4

GPT-5 is making a brutally-crushing comeback... every single line of code it generated was fully working.
immortalsol
reddit
The marquee feature is obviously the 1M context window, compared to the ~200k other models support.
Developer
hackernews
Wow, GPT 5.4 is insanely good. It should be a step bump 6.0. Hard to believe Codex has come this far.
Rahul Sood
twitter
GPT-5.4 extra high scores 94.0 on NYT Connections. It just gets stuff right, first try.
senko
hackernews
GPT-5.4 is now on the Artificial Analysis Intelligence Index... Tied with Gemini 3.1 Pro.
AiBattle
twitter
The reasoning depth is finally at the level where it can handle enterprise-scale architectural problems.
CloudArchitect99
reddit

Videos About GPT-5.4

Watch tutorials, reviews, and discussions about GPT-5.4

A 1 million 50,000 token context window... this is a very long context window.

In 5 minutes and 22 seconds of thinking, we then received our result... it did test this more in an agentic manner.

Updating the ability of this to look at high-fidelity images... up to 10.24 million total pixels.

The model actually performs research across the web to verify its own logic.

This is a massive leap for agentic workflows where state needs to persist.

GPT 5.4 has everything... they basically said, okay, 5.2 and GPT 5.3 Codex. Go ahead, have a baby.

The coding capabilities are ridiculous. It's essentially flawless.

Front-end taste is far behind Opus 4.6 and Gemini 3.1 Pro.

It feels like it has a much better understanding of nuanced developer intent.

The price point is competitive considering the 1M token window size.

It's clearly putting pressure on OpenAI to respond with a model that matches that 1 million context capability.

In a single shot, the fact that this model is able to create this Minecraft clone is just remarkable.

We are seeing a 33 percent reduction in factual hallucination rates.

The reasoning modes are categorized into Standard, Extended, and Heavy levels.

The visual grounding on the OSWorld benchmark is just industry leading right now.

More than just prompts

Supercharge your workflow with AI Automation

Automatio combines the power of AI agents, web automation, and smart integrations to help you accomplish more in less time.

AI Agents
Web Automation
Smart Workflows

Pro Tips for GPT-5.4

Expert tips to help you get the most out of GPT-5.4 and achieve better results.

Toggle Reasoning Effort

Use Standard, Extended, or Heavy reasoning efforts depending on the task's complexity to balance computational cost and output quality.

Monitor the Upfront Plan

When using the Thinking variant, watch the upfront plan; you can intervene mid-generation if the model's proposed logic path seems flawed.

Strategic Prompt Caching

Place large, static context blocks at the beginning of your prompt to take advantage of OpenAI's automatic prompt caching for cost savings.

Manage Context Stability

While the 1.05M window is robust, performance is reported to be most stable within the first 256K tokens; keep critical summaries near the prompt end.

Testimonials

What Our Users Say

Join thousands of satisfied users who have transformed their workflow

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Related AI Models

xai

Grok-3

xAI

Grok-3 is xAI's flagship reasoning model, featuring deep logic deduction, a 128k context window, and real-time integration with X for live research and coding.

128K context
$3.00/$15.00/1M
anthropic

Claude 3.7 Sonnet

Anthropic

Claude 3.7 Sonnet is Anthropic's first hybrid reasoning model, delivering state-of-the-art coding capabilities, a 200k context window, and visible thinking.

200K context
$3.00/$15.00/1M
anthropic

Claude Sonnet 4.5

Anthropic

Anthropic's Claude Sonnet 4.5 delivers world-leading coding (77.2% SWE-bench) and a 200K context window, optimized for the next generation of autonomous agents.

200K context
$3.00/$15.00/1M
zhipu

GLM-4.7

Zhipu (GLM)

GLM-4.7 by Zhipu AI is a flagship 358B MoE model featuring a 200K context window, elite 73.8% SWE-bench performance, and native Deep Thinking for agentic...

200K context
$0.60/$2.20/1M
google

Gemini 3.1 Flash-Lite

Google

Gemini 3.1 Flash-Lite is Google's fastest, most cost-efficient model. Features 1M context, native multimodality, and 363 tokens/sec speed for scale.

1M context
$0.25/$1.50/1M
anthropic

Claude Opus 4.5

Anthropic

Claude Opus 4.5 is Anthropic's most powerful frontier model, delivering record-breaking 80.9% SWE-bench performance and advanced autonomous agency for coding.

200K context
$5.00/$25.00/1M
openai

GPT-5.3 Codex

OpenAI

GPT-5.3 Codex is OpenAI's 2026 frontier coding agent, featuring a 400K context window, 77.3% Terminal-Bench score, and superior logic for complex software...

400K context
$1.75/$14.00/1M
xai

Grok-4

xAI

Grok-4 by xAI is a frontier model featuring a 2M token context window, real-time X platform integration, and world-record reasoning capabilities.

2M context
$3.00/$15.00/1M

Frequently Asked Questions About GPT-5.4

Find answers to common questions about GPT-5.4