xai

Grok-4

Grok-4 by xAI is a frontier model featuring a 2M token context window, real-time X platform integration, and world-record reasoning capabilities.

xai logoxaiGrokJuly 9, 2025
Context
2.0Mtokens
Max Output
4Ktokens
Input Price
$3.00/ 1M
Output Price
$15.00/ 1M
Modality:TextImageAudio
Capabilities:VisionToolsStreamingReasoning
Benchmarks
GPQA
87.5%
GPQA: Graduate-Level Science Q&A. A rigorous benchmark with 448 multiple-choice questions in biology, physics, and chemistry created by domain experts. PhD experts only achieve 65-74% accuracy, while non-experts score just 34% even with unlimited web access (hence 'Google-proof'). Grok-4 scored 87.5% on this benchmark.
HLE
51%
HLE: High-Level Expertise Reasoning. Tests a model's ability to demonstrate expert-level reasoning across specialized domains. Evaluates deep understanding of complex topics that require professional-level knowledge. Grok-4 scored 51% on this benchmark.
MMLU
92.1%
MMLU: Massive Multitask Language Understanding. A comprehensive benchmark with 16,000 multiple-choice questions across 57 academic subjects including math, philosophy, law, and medicine. Tests broad knowledge and reasoning capabilities. Grok-4 scored 92.1% on this benchmark.
MMLU Pro
83%
MMLU Pro: MMLU Professional Edition. An enhanced version of MMLU with 12,032 questions using a harder 10-option multiple choice format. Covers Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, and Computer Science. Grok-4 scored 83% on this benchmark.
SimpleQA
42%
SimpleQA: Factual Accuracy Benchmark. Tests a model's ability to provide accurate, factual responses to straightforward questions. Measures reliability and reduces hallucinations in knowledge retrieval tasks. Grok-4 scored 42% on this benchmark.
IFEval
95%
IFEval: Instruction Following Evaluation. Measures how well a model follows specific instructions and constraints. Tests the ability to adhere to formatting rules, length limits, and other explicit requirements. Grok-4 scored 95% on this benchmark.
AIME 2025
100%
AIME 2025: American Invitational Math Exam. Competition-level mathematics problems from the prestigious AIME exam designed for talented high school students. Tests advanced mathematical problem-solving requiring abstract reasoning, not just pattern matching. Grok-4 scored 100% on this benchmark.
MATH
94%
MATH: Mathematical Problem Solving. A comprehensive math benchmark testing problem-solving across algebra, geometry, calculus, and other mathematical domains. Requires multi-step reasoning and formal mathematical knowledge. Grok-4 scored 94% on this benchmark.
GSM8k
99%
GSM8k: Grade School Math 8K. 8,500 grade school-level math word problems requiring multi-step reasoning. Tests basic arithmetic and logical thinking through real-world scenarios like shopping or time calculations. Grok-4 scored 99% on this benchmark.
MGSM
98%
MGSM: Multilingual Grade School Math. The GSM8k benchmark translated into 10 languages including Spanish, French, German, Russian, Chinese, and Japanese. Tests mathematical reasoning across different languages. Grok-4 scored 98% on this benchmark.
MathVista
78%
MathVista: Mathematical Visual Reasoning. Tests the ability to solve math problems that involve visual elements like charts, graphs, geometry diagrams, and scientific figures. Combines visual understanding with mathematical reasoning. Grok-4 scored 78% on this benchmark.
SWE-Bench
75%
SWE-Bench: Software Engineering Benchmark. AI models attempt to resolve real GitHub issues in open-source Python projects with human verification. Tests practical software engineering skills on production codebases. Top models went from 4.4% in 2023 to over 70% in 2024. Grok-4 scored 75% on this benchmark.
HumanEval
97%
HumanEval: Python Programming Problems. 164 hand-written programming problems where models must generate correct Python function implementations. Each solution is verified against unit tests. Top models now achieve 90%+ accuracy. Grok-4 scored 97% on this benchmark.
LiveCodeBench
79%
LiveCodeBench: Live Coding Benchmark. Tests coding abilities on continuously updated, real-world programming challenges. Unlike static benchmarks, uses fresh problems to prevent data contamination and measure true coding skills. Grok-4 scored 79% on this benchmark.
MMMU
87.5%
MMMU: Multimodal Understanding. Massive Multi-discipline Multimodal Understanding benchmark testing vision-language models on college-level problems across 30 subjects requiring both image understanding and expert knowledge. Grok-4 scored 87.5% on this benchmark.
MMMU Pro
65%
MMMU Pro: MMMU Professional Edition. Enhanced version of MMMU with more challenging questions and stricter evaluation. Tests advanced multimodal reasoning at professional and expert levels. Grok-4 scored 65% on this benchmark.
ChartQA
93%
ChartQA: Chart Question Answering. Tests the ability to understand and reason about information presented in charts and graphs. Requires extracting data, comparing values, and performing calculations from visual data representations. Grok-4 scored 93% on this benchmark.
DocVQA
95%
DocVQA: Document Visual Q&A. Document Visual Question Answering benchmark testing the ability to extract and reason about information from document images including forms, reports, and scanned text. Grok-4 scored 95% on this benchmark.
Terminal-Bench
60%
Terminal-Bench: Terminal/CLI Tasks. Tests the ability to perform command-line operations, write shell scripts, and navigate terminal environments. Measures practical system administration and development workflow skills. Grok-4 scored 60% on this benchmark.
ARC-AGI
15.9%
ARC-AGI: Abstraction & Reasoning. Abstraction and Reasoning Corpus for AGI - tests fluid intelligence through novel pattern recognition puzzles. Each task requires discovering the underlying rule from examples, measuring general reasoning ability rather than memorization. Grok-4 scored 15.9% on this benchmark.

About Grok-4

Learn about Grok-4's capabilities, features, and how it can help you achieve better results.

Model Overview

Grok-4 is the frontier multimodal model from xAI. It is built to prioritize first-principles reasoning and real-time information retrieval. The model gains a significant competitive edge through its native integration with the X social media platform. This allows it to analyze live global conversations and news as they happen. It utilizes the Colossus supercomputer for training, resulting in high-tier performance across mathematical and technical domains.

Technical Capabilities

The architecture supports a 2-million-token context window in its reasoning variants. This capacity enables the processing of massive codebases and dense technical documentation without data loss. It features a dual-mode system where users choose between a high-velocity mode for quick interactions and a deep-thinking mode for multi-step logical tasks. The model manages a hallucination rate of roughly 4% by employing a multi-agent consensus mechanism in its Heavy configuration.

Ecosystem Integration

Beyond simple text generation, Grok-4 is designed for native tool use and complex function calling. It supports image and audio processing, making it a versatile choice for developers building multimodal applications. Its alignment strategy focuses on objective truth-seeking rather than standard industry safety guardrails. This results in fewer refusals for controversial or edgy topics compared to other frontier models.

Grok-4

Use Cases

Discover the different ways you can use Grok-4 to achieve great results.

Real-Time Sentiment Analysis

Analyzes live posts on X to determine public reaction to breaking news or product launches.

Large-Scale Repository Auditing

Evaluates entire software repositories using the 2M token window to find architectural flaws.

Olympiad-Level Math Solving

Provides step-by-step solutions for complex mathematical proofs and AIME-level problems.

Unfiltered Creative Content

Generates character-driven scripts and humor without the restrictive filters of other AI providers.

Scientific Research Synthesis

Summarizes multiple PhD-level academic papers simultaneously while maintaining technical accuracy.

Technical Debugging

Identifies obscure bugs in production code and suggests fixes based on current best practices.

Strengths

Limitations

Elite Mathematical Reasoning: Achieved a perfect 100% on the AIME 2025 benchmark, outclassing most frontier models in logic.
Heavy Mode Latency: The multi-agent reasoning mode can take several minutes to produce a single high-accuracy response.
Industry-Leading Context: The 2M token window allows for unprecedented depth in document analysis and large-scale coding projects.
Incomplete Video Support: While text and image capabilities are top-tier, native frame-by-frame video processing is not yet available.
Live Social Intelligence: Direct access to the X platform provides real-time information that static training data cannot replicate.
Restricted Regional Access: Persistent memory features are currently disabled in the European Union due to regulatory requirements.
Low Refusal Rate: A more permissive safety architecture allows for honest, objective dialogue on controversial subjects.
Vision Precision Limits: Creators acknowledge the model remains partially blind when interpreting extremely high-fidelity visual details.

API Quick Start

xai/grok-4

View Documentation
xai SDK
import OpenAI from "openai";

const grok = new OpenAI({
  apiKey: process.env.XAI_API_KEY,
  baseURL: "https://api.x.ai/v1",
});

async function main() {
  const completion = await grok.chat.completions.create({
    model: "grok-4",
    messages: [{ role: "user", content: "Search X for the latest news on SpaceX." }],
    stream: true,
  });

  for await (const chunk of completion) {
    process.stdout.write(chunk.choices[0]?.delta?.content || "");
  }
}

main();

Install the SDK and start making API calls in minutes.

Community Feedback

See what the community thinks about Grok-4

Grok 4 fast has a 2M token window!!! Why have we been struggling and settling with ChatGPT I really don't know anymore.
myfuturewifee
reddit
15.88% on the ARC-AGI v2 private subset is insane. Grok 4 is the first model to break that 10% barrier in months.
Greg (ARC-AGI Lead)
twitter
The multi-agent study group approach in Grok 4 Heavy is the right way to use test-time compute. It actually finds the trick to the problem.
Tony_xAI
twitter
Grok 4: 79 on LiveCodeBench... benchmarks don't tell you how it feels to code with a model, but this feels trustworthy.
thankzr3ddit
reddit
The model is postgraduate like PhD level in everything. It's scarily smart and faster than any human can learn.
Elon Musk
youtube
The real-time search isn't just scraping headlines; it's analyzing content across multiple sources.
BitBiasedAI
youtube

Related Videos

Watch tutorials, reviews, and discussions about Grok-4

Grok 4 heavy is for more logic and reasoning intensive tasks, while regular Grok 4 handles others.

It completely accurately tracked my hand and fingers to draw on the screen.

Grok 4 found the password I hid deep in the context window after only 15 seconds of thinking.

The accuracy on the 2 million token needle in a haystack test was 100%.

This model is finally a real alternative for those who found Gemini's context window unreliable.

Grok 4 is postgraduate like PhD level in everything, better than most PhDs.

Grok 4 Heavy spawns multiple agents in parallel... it's like a study group.

It's on the API and has a 256k contact length, with plans for much more.

The training on the Colossus cluster has given it a reasoning capability we haven't seen.

It's designed to be the most truth-seeking AI that currently exists.

Grok 4 Heavy runs up to 32 parallel AI models on your single prompt.

The real-time search isn't just scraping headlines; it's analyzing content across multiple sources.

Think Mode spends additional computational time planning and catching potential errors before responding.

You can actually see the agents debating each other in the logs if you have API access.

The multimodal performance with audio is noticeably faster than the previous generation.

More than just prompts

Supercharge your workflow with AI Automation

Automatio combines the power of AI agents, web automation, and smart integrations to help you accomplish more in less time.

AI Agents
Web Automation
Smart Workflows

Pro Tips

Expert tips to help you get the most out of Grok-4 and achieve better results.

Use Search Keywords

Include specific hashtags or accounts in your prompt to direct the model's real-time X search.

Switch to Heavy Mode

Activate Grok-4 Heavy for tasks where accuracy is more critical than response speed.

Provide Detailed Personas

Leverage the permissive safety alignment by defining specific, edgy personas for creative writing.

Analyze External Links

Paste live URLs directly into the chat for the model to retrieve and summarize fresh web content.

Testimonials

What Our Users Say

Join thousands of satisfied users who have transformed their workflow

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Related AI Models

moonshot

Kimi K2.5

Moonshot

Discover Moonshot AI's Kimi K2.5, a 1T-parameter open-source agentic model featuring native multimodal capabilities, a 262K context window, and SOTA reasoning.

256K context
$0.60/$3.00/1M
anthropic

Claude Opus 4.5

Anthropic

Claude Opus 4.5 is Anthropic's most powerful frontier model, delivering record-breaking 80.9% SWE-bench performance and advanced autonomous agency for coding.

200K context
$5.00/$25.00/1M
google

Gemini 3.1 Flash-Lite

Google

Gemini 3.1 Flash-Lite is Google's fastest, most cost-efficient model. Features 1M context, native multimodality, and 363 tokens/sec speed for scale.

1M context
$0.25/$1.50/1M
openai

GPT-5.1

OpenAI

GPT-5.1 is OpenAI’s advanced reasoning flagship featuring adaptive thinking, native multimodality, and state-of-the-art performance in math and technical...

400K context
$1.25/$10.00/1M
alibaba

Qwen3.5-397B-A17B

alibaba

Qwen3.5-397B-A17B is Alibaba's flagship open-weight MoE model. It features native multimodal reasoning, a 1M context window, and a 19x decoding throughput...

1M context
$0.40/$2.40/1M
zhipu

GLM-5

Zhipu (GLM)

GLM-5 is Zhipu AI's 744B parameter open-weight powerhouse, excelling in long-horizon agentic tasks, coding, and factual accuracy with a 200k context window.

200K context
$1.00/$3.20/1M
openai

GPT-5.2

OpenAI

GPT-5.2 is OpenAI's flagship model for professional tasks, featuring a 400K context window, elite coding, and deep multi-step reasoning capabilities.

400K context
$1.75/$14.00/1M
anthropic

Claude Sonnet 4.6

Anthropic

Claude Sonnet 4.6 offers frontier performance for coding and computer use with a massive 1M token context window for only $3/1M tokens.

1M context
$3.00/$15.00/1M

Frequently Asked Questions

Find answers to common questions about Grok-4