xai

Grok-4

Grok-4 by xAI is a frontier model featuring a 2M token context window, real-time X platform integration, and world-record reasoning capabilities.

xai logoxaiGrokJuly 9, 2025
Context
2.0Mtokens
Max Output
8Ktokens
Input Price
$3.00/ 1M
Output Price
$15.00/ 1M
Modality:TextImage
Capabilities:VisionToolsStreamingReasoning
Benchmarks
GPQA
87.5%
GPQA: Graduate-Level Science Q&A. A rigorous benchmark with 448 multiple-choice questions in biology, physics, and chemistry created by domain experts. PhD experts only achieve 65-74% accuracy, while non-experts score just 34% even with unlimited web access (hence 'Google-proof'). Grok-4 scored 87.5% on this benchmark.
HLE
44.4%
HLE: High-Level Expertise Reasoning. Tests a model's ability to demonstrate expert-level reasoning across specialized domains. Evaluates deep understanding of complex topics that require professional-level knowledge. Grok-4 scored 44.4% on this benchmark.
MMLU
94%
MMLU: Massive Multitask Language Understanding. A comprehensive benchmark with 16,000 multiple-choice questions across 57 academic subjects including math, philosophy, law, and medicine. Tests broad knowledge and reasoning capabilities. Grok-4 scored 94% on this benchmark.
MMLU Pro
81.2%
MMLU Pro: MMLU Professional Edition. An enhanced version of MMLU with 12,032 questions using a harder 10-option multiple choice format. Covers Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, and Computer Science. Grok-4 scored 81.2% on this benchmark.
SimpleQA
48%
SimpleQA: Factual Accuracy Benchmark. Tests a model's ability to provide accurate, factual responses to straightforward questions. Measures reliability and reduces hallucinations in knowledge retrieval tasks. Grok-4 scored 48% on this benchmark.
IFEval
89.2%
IFEval: Instruction Following Evaluation. Measures how well a model follows specific instructions and constraints. Tests the ability to adhere to formatting rules, length limits, and other explicit requirements. Grok-4 scored 89.2% on this benchmark.
AIME 2025
100%
AIME 2025: American Invitational Math Exam. Competition-level mathematics problems from the prestigious AIME exam designed for talented high school students. Tests advanced mathematical problem-solving requiring abstract reasoning, not just pattern matching. Grok-4 scored 100% on this benchmark.
MATH
92%
MATH: Mathematical Problem Solving. A comprehensive math benchmark testing problem-solving across algebra, geometry, calculus, and other mathematical domains. Requires multi-step reasoning and formal mathematical knowledge. Grok-4 scored 92% on this benchmark.
GSM8k
98.4%
GSM8k: Grade School Math 8K. 8,500 grade school-level math word problems requiring multi-step reasoning. Tests basic arithmetic and logical thinking through real-world scenarios like shopping or time calculations. Grok-4 scored 98.4% on this benchmark.
MGSM
92.1%
MGSM: Multilingual Grade School Math. The GSM8k benchmark translated into 10 languages including Spanish, French, German, Russian, Chinese, and Japanese. Tests mathematical reasoning across different languages. Grok-4 scored 92.1% on this benchmark.
MathVista
72.4%
MathVista: Mathematical Visual Reasoning. Tests the ability to solve math problems that involve visual elements like charts, graphs, geometry diagrams, and scientific figures. Combines visual understanding with mathematical reasoning. Grok-4 scored 72.4% on this benchmark.
SWE-Bench
81%
SWE-Bench: Software Engineering Benchmark. AI models attempt to resolve real GitHub issues in open-source Python projects with human verification. Tests practical software engineering skills on production codebases. Top models went from 4.4% in 2023 to over 70% in 2024. Grok-4 scored 81% on this benchmark.
HumanEval
88%
HumanEval: Python Programming Problems. 164 hand-written programming problems where models must generate correct Python function implementations. Each solution is verified against unit tests. Top models now achieve 90%+ accuracy. Grok-4 scored 88% on this benchmark.
LiveCodeBench
79.4%
LiveCodeBench: Live Coding Benchmark. Tests coding abilities on continuously updated, real-world programming challenges. Unlike static benchmarks, uses fresh problems to prevent data contamination and measure true coding skills. Grok-4 scored 79.4% on this benchmark.
MMMU
75%
MMMU: Multimodal Understanding. Massive Multi-discipline Multimodal Understanding benchmark testing vision-language models on college-level problems across 30 subjects requiring both image understanding and expert knowledge. Grok-4 scored 75% on this benchmark.
MMMU Pro
59.2%
MMMU Pro: MMMU Professional Edition. Enhanced version of MMMU with more challenging questions and stricter evaluation. Tests advanced multimodal reasoning at professional and expert levels. Grok-4 scored 59.2% on this benchmark.
ChartQA
90.5%
ChartQA: Chart Question Answering. Tests the ability to understand and reason about information presented in charts and graphs. Requires extracting data, comparing values, and performing calculations from visual data representations. Grok-4 scored 90.5% on this benchmark.
DocVQA
93.2%
DocVQA: Document Visual Q&A. Document Visual Question Answering benchmark testing the ability to extract and reason about information from document images including forms, reports, and scanned text. Grok-4 scored 93.2% on this benchmark.
Terminal-Bench
54.2%
Terminal-Bench: Terminal/CLI Tasks. Tests the ability to perform command-line operations, write shell scripts, and navigate terminal environments. Measures practical system administration and development workflow skills. Grok-4 scored 54.2% on this benchmark.
ARC-AGI
15.9%
ARC-AGI: Abstraction & Reasoning. Abstraction and Reasoning Corpus for AGI - tests fluid intelligence through novel pattern recognition puzzles. Each task requires discovering the underlying rule from examples, measuring general reasoning ability rather than memorization. Grok-4 scored 15.9% on this benchmark.

Try Grok-4 Free

Chat with Grok-4 for free. Test its capabilities, ask questions, and explore what this AI model can do.

Prompt
Response
xai/grok-4

Your AI response will appear here

About Grok-4

Learn about Grok-4's capabilities, features, and how it can help you achieve better results.

Overview

Grok-4 is the latest frontier AI model from xAI, designed to be a truth-seeking assistant with real-time access to the X platform. Built on the Colossus supercomputer cluster with over 200,000 GPUs, it represents a massive leap in reasoning, mathematical problem-solving, and coding capabilities. It features a unified dual-mode architecture, allowing users to switch between a deep-thinking reasoning mode for complex puzzles and a high-velocity mode for immediate responses.

Technical Innovations

This generational jump in compute has enabled PhD-level performance across all academic disciplines simultaneously. The model is uniquely characterized by its anti-woke alignment strategy, prioritizing objective information over standard safety guardrails. Its massive 2-million-token context window and integration into the Musk ecosystem, including X and Tesla vehicles, provide a distinct competitive moat. While it excels in STEM and technical reasoning, it remains highly efficient for everyday creative tasks and real-time news analysis.

Performance Philosophy

Grok-4 prioritizes first-principles thinking and objective data synthesis. By utilizing the Quasarflux reasoning engine, it can navigate multi-step logical chains that typically derail traditional LLMs. This makes it an essential tool for developers and researchers who require high-fidelity outputs in high-stakes environments where factual accuracy is non-negotiable.

Grok-4

Use Cases for Grok-4

Discover the different ways you can use Grok-4 to achieve great results.

Graduate-Level STEM Research

Utilizing the Thinking mode to solve PhD-level physics problems and verify complex mathematical proofs.

Massive Repository Debugging

Leveraging the 2M context window to ingest entire codebases and identify subtle race conditions.

Real-Time Financial Intelligence

Monitoring the X Firehose to analyze market sentiment and breaking news for trading insights.

Autonomous Agent Workflows

Powering complex agentic tasks through robust function calling for logistics and automation.

Multi-Modal Legal Analysis

Reviewing thousands of pages of discovery documents while analyzing scanned evidentiary photos.

Advanced Academic Tutoring

Providing personalized, first-principles-based tutoring in STEM subjects adapted to student progress.

Strengths

Limitations

Unmatched Math & Logic: Achieved a world-record 100% score on the AIME 2025, making it the premier choice for technical tasks.
Spiky Basic Logic: Despite acing graduate exams, the model can occasionally fail at trivial tasks like counting letters in a word.
Market-Leading Context: The 2-million-token window allows for the analysis of roughly 1,500 pages of text in a single prompt.
High Entry Barrier: Access to the full-power Grok-4 Heavy model and reasoning capabilities requires a premium subscription.
Live Data Pipeline: Exclusive access to the X platform real-time data stream ensures responses are current on global events.
Creative Nuance Gaps: It lags behind Claude 4.5 in creative storytelling, often adopting a more utilitarian or edgy tone.
Emotional Intelligence: High performance on EQ-Bench3 indicates a superior ability to understand nuanced human emotions.
Image Generation Consistency: Internal tools struggle with maintaining visual consistency across multiple panels.

API Quick Start

xai/grok-4

View Documentation
xai SDK
import { xAI } from '@xai/sdk';

const client = new xAI({
  apiKey: process.env.XAI_API_KEY,
});

async function main() {
  const response = await client.chat.completions.create({
    model: 'grok-4',
    messages: [{ role: 'user', content: 'Analyze the latest news about xAI from the Firehose.' }],
    stream: true,
  });

  for await (const chunk of response) {
    process.stdout.write(chunk.choices[0]?.delta?.content || '');
  }
}

main();

Install the SDK and start making API calls in minutes.

What People Are Saying About Grok-4

See what the community thinks about Grok-4

"Grok 4 is officially schooling the competition... proving xAI has built a model that thinks like a predator."
Mario Nawfal
x/twitter
"Grok 4 is a benchmark-slaying, PhD-level genius that occasionally can't count. The duality is wild."
Beginning-Willow-801
reddit
"The jump to 2 million tokens isn't just a gimmick; it fundamentally changes repository debugging."
AI Tech Reviews
youtube
"Grok 4 is clearly the best model in terms of general comprehension, far ahead of GPT-5."
YMist_
reddit
"Usage will spike with Grok 4.20. It's coming out in 3 or 4 weeks."
Elon Musk
x/twitter
"The real-time X integration is the only thing keeping my research relevant in this news cycle."
DataScientist_Alpha
hackernews

Videos About Grok-4

Watch tutorials, reviews, and discussions about Grok-4

The number of words in this response is exactly 43... Super impressive.

Not only was it able to solve the Tower of Hanoi in its chain of thought, but it actually proved it and visualized it with code.

I love this answer. To the point, direct. No sugar coating at all.

The reasoning capabilities here are clearly a step above what we saw in the previous generation.

It's finally a model that doesn't feel like it's holding back on the truth to be polite.

The experimental thinking toggle for Grok was recently removed... leading to characterization as potentially antiquated.

Grok OS was the least impressive, featuring a basic white background and broken icons.

In terms of raw knowledge retrieval, Grok-4 is consistently hitting the mark where GPT-5 misses.

The latency in the reasoning mode is higher, but the quality of the output justifies the wait.

If you are in the Musk ecosystem, the integration here is a massive productivity multiplier.

Nobody wants a super fast model if it can't solve the logic. I can tell you that for free, boys.

I would give this a minus one out of 10... Complete trash. Can't even build a simple Next.js website.

The speed is there, but if the logic is broken, what is the point of the tokens per second?

It feels like they rushed the coder variant just to hit the release cycle.

Stick to the standard reasoning model if you actually want something that works.

More than just prompts

Supercharge your workflow with AI Automation

Automatio combines the power of AI agents, web automation, and smart integrations to help you accomplish more in less time.

AI Agents
Web Automation
Smart Workflows
Watch demo video

Pro Tips

Expert tips to help you get the most out of this model and achieve better results.

Mode Toggling

Use Quasarflux mode for complex logic and Tensor mode for speed to optimize cost and performance.

Real-Time Queries

Explicitly prompt for trending topics on X to leverage the live data pipeline and bypass training cutoffs.

STEM Focus

Prioritize Grok for graduate-level math where it significantly outperforms competitors on zero-shot tasks.

Verify Basic Logic

Double-check simple counting or list ordering as the model can be inconsistent on trivial tasks.

Testimonials

What Our Users Say

Join thousands of satisfied users who have transformed their workflow

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Related AI Models

anthropic

Claude Opus 4.5

anthropic

Claude Opus 4.5 is Anthropic's most powerful frontier model, delivering record-breaking 80.9% SWE-bench performance and advanced autonomous agency for coding.

200K context
$5.00/$25.00/1M
openai

GPT-5.1

openai

GPT-5.1 is OpenAI’s advanced reasoning flagship featuring adaptive thinking, native multimodality, and state-of-the-art performance in math and technical...

400K context
$1.25/$10.00/1M
zhipu

GLM-4.7

zhipu

GLM-4.7 by Zhipu AI is a flagship 358B MoE model featuring a 200K context window, elite 73.8% SWE-bench performance, and native Deep Thinking for agentic...

200K context
$0.60/$2.20/1M
anthropic

Claude 3.7 Sonnet

anthropic

Claude 3.7 Sonnet is Anthropic's first hybrid reasoning model, delivering state-of-the-art coding capabilities, a 200k context window, and visible thinking.

200K context
$3.00/$15.00/1M
google

Gemini 3 Flash

google

Gemini 3 Flash is Google's high-speed multimodal model featuring a 1M token context window, elite 90.4% GPQA reasoning, and autonomous browser automation tools.

1M context
$0.50/$3.00/1M
xai

Grok-3

xai

Grok-3 is xAI's flagship reasoning model, featuring deep logic deduction, a 128k context window, and real-time integration with X for live research and coding.

128K context
$3.00/$15.00/1M
google

Gemini 3 Pro

google

Google's Gemini 3 Pro is a multimodal powerhouse featuring a 1M token context window, native video processing, and industry-leading reasoning performance.

1M context
$2.00/$12.00/1M
anthropic

Claude Sonnet 4.5

anthropic

Anthropic's Claude Sonnet 4.5 delivers world-leading coding (77.2% SWE-bench) and a 200K context window, optimized for the next generation of autonomous agents.

200K context
$3.00/$15.00/1M

Frequently Asked Questions

Find answers to common questions about this model