xai

Grok-3

Grok-3 is xAI's flagship reasoning model, featuring deep logic deduction, a 128k context window, and real-time integration with X for live research and coding.

xai logoxaiGrokFebruary 17, 2025
Context
128Ktokens
Max Output
8Ktokens
Input Price
$3.00/ 1M
Output Price
$15.00/ 1M
Modality:TextImage
Capabilities:VisionToolsStreamingReasoning
Benchmarks
GPQA
84.6%
GPQA: Graduate-Level Science Q&A. A rigorous benchmark with 448 multiple-choice questions in biology, physics, and chemistry created by domain experts. PhD experts only achieve 65-74% accuracy, while non-experts score just 34% even with unlimited web access (hence 'Google-proof'). Grok-3 scored 84.6% on this benchmark.
HLE
36%
HLE: High-Level Expertise Reasoning. Tests a model's ability to demonstrate expert-level reasoning across specialized domains. Evaluates deep understanding of complex topics that require professional-level knowledge. Grok-3 scored 36% on this benchmark.
MMLU
87.5%
MMLU: Massive Multitask Language Understanding. A comprehensive benchmark with 16,000 multiple-choice questions across 57 academic subjects including math, philosophy, law, and medicine. Tests broad knowledge and reasoning capabilities. Grok-3 scored 87.5% on this benchmark.
MMLU Pro
76.5%
MMLU Pro: MMLU Professional Edition. An enhanced version of MMLU with 12,032 questions using a harder 10-option multiple choice format. Covers Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, and Computer Science. Grok-3 scored 76.5% on this benchmark.
SimpleQA
42%
SimpleQA: Factual Accuracy Benchmark. Tests a model's ability to provide accurate, factual responses to straightforward questions. Measures reliability and reduces hallucinations in knowledge retrieval tasks. Grok-3 scored 42% on this benchmark.
IFEval
91.2%
IFEval: Instruction Following Evaluation. Measures how well a model follows specific instructions and constraints. Tests the ability to adhere to formatting rules, length limits, and other explicit requirements. Grok-3 scored 91.2% on this benchmark.
AIME 2025
93.3%
AIME 2025: American Invitational Math Exam. Competition-level mathematics problems from the prestigious AIME exam designed for talented high school students. Tests advanced mathematical problem-solving requiring abstract reasoning, not just pattern matching. Grok-3 scored 93.3% on this benchmark.
MATH
94.4%
MATH: Mathematical Problem Solving. A comprehensive math benchmark testing problem-solving across algebra, geometry, calculus, and other mathematical domains. Requires multi-step reasoning and formal mathematical knowledge. Grok-3 scored 94.4% on this benchmark.
GSM8k
98.7%
GSM8k: Grade School Math 8K. 8,500 grade school-level math word problems requiring multi-step reasoning. Tests basic arithmetic and logical thinking through real-world scenarios like shopping or time calculations. Grok-3 scored 98.7% on this benchmark.
MGSM
92.4%
MGSM: Multilingual Grade School Math. The GSM8k benchmark translated into 10 languages including Spanish, French, German, Russian, Chinese, and Japanese. Tests mathematical reasoning across different languages. Grok-3 scored 92.4% on this benchmark.
MathVista
71.3%
MathVista: Mathematical Visual Reasoning. Tests the ability to solve math problems that involve visual elements like charts, graphs, geometry diagrams, and scientific figures. Combines visual understanding with mathematical reasoning. Grok-3 scored 71.3% on this benchmark.
SWE-Bench
49%
SWE-Bench: Software Engineering Benchmark. AI models attempt to resolve real GitHub issues in open-source Python projects with human verification. Tests practical software engineering skills on production codebases. Top models went from 4.4% in 2023 to over 70% in 2024. Grok-3 scored 49% on this benchmark.
HumanEval
94.5%
HumanEval: Python Programming Problems. 164 hand-written programming problems where models must generate correct Python function implementations. Each solution is verified against unit tests. Top models now achieve 90%+ accuracy. Grok-3 scored 94.5% on this benchmark.
LiveCodeBench
79.4%
LiveCodeBench: Live Coding Benchmark. Tests coding abilities on continuously updated, real-world programming challenges. Unlike static benchmarks, uses fresh problems to prevent data contamination and measure true coding skills. Grok-3 scored 79.4% on this benchmark.
MMMU
78%
MMMU: Multimodal Understanding. Massive Multi-discipline Multimodal Understanding benchmark testing vision-language models on college-level problems across 30 subjects requiring both image understanding and expert knowledge. Grok-3 scored 78% on this benchmark.
MMMU Pro
58.5%
MMMU Pro: MMMU Professional Edition. Enhanced version of MMMU with more challenging questions and stricter evaluation. Tests advanced multimodal reasoning at professional and expert levels. Grok-3 scored 58.5% on this benchmark.
ChartQA
89.2%
ChartQA: Chart Question Answering. Tests the ability to understand and reason about information presented in charts and graphs. Requires extracting data, comparing values, and performing calculations from visual data representations. Grok-3 scored 89.2% on this benchmark.
DocVQA
92.4%
DocVQA: Document Visual Q&A. Document Visual Question Answering benchmark testing the ability to extract and reason about information from document images including forms, reports, and scanned text. Grok-3 scored 92.4% on this benchmark.
Terminal-Bench
52%
Terminal-Bench: Terminal/CLI Tasks. Tests the ability to perform command-line operations, write shell scripts, and navigate terminal environments. Measures practical system administration and development workflow skills. Grok-3 scored 52% on this benchmark.
ARC-AGI
12.5%
ARC-AGI: Abstraction & Reasoning. Abstraction and Reasoning Corpus for AGI - tests fluid intelligence through novel pattern recognition puzzles. Each task requires discovering the underlying rule from examples, measuring general reasoning ability rather than memorization. Grok-3 scored 12.5% on this benchmark.

Try Grok-3 Free

Chat with Grok-3 for free. Test its capabilities, ask questions, and explore what this AI model can do.

Prompt
Response
xai/grok-3

Your AI response will appear here

About Grok-3

Learn about Grok-3's capabilities, features, and how it can help you achieve better results.

Frontier Reasoning and Intelligence

Grok-3 represents a monumental leap in artificial intelligence, trained on xAI's Colossus supercomputing cluster using over 100,000 NVIDIA H100 GPUs. It is specifically architected to excel at complex logic, mathematical deduction, and high-stakes software engineering. Unlike traditional models that prioritize rapid response generation, Grok-3 features a specialized Deep Thinking mode that utilizes massive test-time compute to verify its own internal reasoning steps before delivering a finalized output.

Real-Time Knowledge Integration

A core differentiator of Grok-3 is its unparalleled access to the X platform's real-time data stream. This allows the model to synthesize breaking news, financial shifts, and global trends with a latency of seconds, whereas other models rely on knowledge cutoffs or slower web search tools. This real-time awareness, paired with a 128,000-token context window, makes it an essential tool for market researchers and data scientists needing up-to-the-minute insights.

Multimodal and Agentic Capabilities

Beyond text and logic, Grok-3 is a powerful multimodal vision model capable of interpreting complex technical diagrams, blueprints, and visual data with frontier-level precision. It supports advanced function calling and tool use, enabling it to act as the cognitive engine for autonomous agents. With a score of 94.5% on HumanEval, it currently stands as one of the most capable coding assistants available, rivaling or exceeding competitors in autonomous debugging and architectural refactoring.

Grok-3

Use Cases for Grok-3

Discover the different ways you can use Grok-3 to achieve great results.

Advanced Software Engineering

Solving complex architectural problems and refactoring entire codebases with deep reasoning and 94.5% HumanEval accuracy.

Real-Time Market Intelligence

Leveraging live X data to synthesize breaking financial news and consumer sentiment faster than traditional search engines.

Scientific Data Synthesis

Processing thousands of pages of academic journals in Deep Research mode to identify novel research connections and hypotheses.

Multimodal Document Analysis

Interpreting complex technical diagrams, blueprints, and financial charts using frontier-level vision capabilities.

Competition-Level Tutoring

Breaking down complex Olympiad-level math and physics problems into digestible, verified steps using Think mode.

Agentic Workflow Automation

Acting as a core engine for autonomous agents that require precise function calling and tool use in production environments.

Strengths

Limitations

Superior Reasoning: Outperforms leading competitors on complex math benchmarks like AIME 2025 (93.3%) and MATH (94.4%).
High Latency in Thinking Mode: Complex reasoning prompts can take over 60 seconds to generate a verified response in Think mode.
Integrated Deep Research: Features a unique web-search capability that synthesizes live X data significantly faster than rivals.
No Native Video or Audio: Lacks the real-time multimodal audio and video processing found in competitors like Gemini 2.0.
Elite Coding Performance: Scores 94.5% on HumanEval, making it a top-tier choice for autonomous software development and debugging.
Strict Usage Quotas: Message limits for Premium+ subscribers are currently lower than some established competitors during peak hours.
Transparent Thinking Traces: Allows users to see the model's logic step-by-step, increasing trust and making complex errors easier to debug.
Beta Stability Issues: Users may occasionally encounter server errors or truncated thinking traces during high-traffic periods.

API Quick Start

xai/grok-3

View Documentation
xai SDK
import OpenAI from "openai";

const xai = new OpenAI({
  apiKey: process.env.XAI_API_KEY,
  baseURL: "https://api.x.ai/v1"
});

const response = await xai.chat.completions.create({
  model: "grok-3",
  messages: [{ role: "user", content: "Analyze current X trends for AGI." }],
  stream: true
});

for await (const chunk of response) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Install the SDK and start making API calls in minutes.

What People Are Saying About Grok-3

See what the community thinks about Grok-3

"Grok-3's deep research is significantly faster and more accurate than OpenAI's version"
TechEnthusiast
x
"The coding performance is absolutely mental; it fixed a bug I'd been stuck on for hours in seconds"
DevLife
reddit
"Grok-3 is arguably the most cutting edge reasoning model available today"
DataCamp
youtube
"The thinking traces look a lot like DeepSeek but the speed is on another level"
AIResearcher
hackernews
"The vision capabilities on technical blueprints are finally usable for real engineering work"
EngDesign
reddit
"X integration gives it a huge edge for anyone tracking real-time crypto or stock sentiment"
FinancePro
x

Videos About Grok-3

Watch tutorials, reviews, and discussions about Grok-3

Grok 3 is arguably the most cutting edge reasoning model available today

It had way better quality output than the OpenAI deep search function

The speed of the deep research mode is quite impressive compared to O1

You can see the model really crunching through multiple search results simultaneously

This is a significant jump from Grok-2 in terms of logical consistency

Grok 3 and Grok 3 mini are better than all published reasoning models

Logic leans towards the shove... this is the most human-like reasoning I've ever seen

The internal thinking trace provides a much clearer view of the logic

It doesn't just guess; it checks its work, which is the hallmark of System 2 thinking

The math performance on AIME benchmarks is truly state-of-the-art

On those benchmarks you can see that Grok 3 actually perform quite well across the board

Compared to other competitors, it's pretty promising

The coding performance is the real story here, rivaling the best in the industry

It handles architectural refactoring tasks that previous versions failed on

The integration with the X API makes it uniquely powerful for current events

More than just prompts

Supercharge your workflow with AI Automation

Automatio combines the power of AI agents, web automation, and smart integrations to help you accomplish more in less time.

AI Agents
Web Automation
Smart Workflows
Watch demo video

Pro Tips

Expert tips to help you get the most out of this model and achieve better results.

Toggle Deep Thinking

Always enable Think mode for math or logic tasks to ensure step-by-step verification through test-time compute.

Utilize X Integration

Use specific queries about breaking news or current events to get data that other LLMs cannot access due to knowledge cutoffs.

Inspect Traces

Review the internal thinking traces to identify exactly where the model is spending its compute and verify its logical path.

Vision for UI

Upload screenshots of UI designs and ask Grok to generate corresponding React or Tailwind code for rapid front-end prototyping.

Testimonials

What Our Users Say

Join thousands of satisfied users who have transformed their workflow

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Related AI Models

anthropic

Claude 3.7 Sonnet

anthropic

Claude 3.7 Sonnet is Anthropic's first hybrid reasoning model, delivering state-of-the-art coding capabilities, a 200k context window, and visible thinking.

200K context
$3.00/$15.00/1M
zhipu

GLM-4.7

zhipu

GLM-4.7 by Zhipu AI is a flagship 358B MoE model featuring a 200K context window, elite 73.8% SWE-bench performance, and native Deep Thinking for agentic...

200K context
$0.60/$2.20/1M
anthropic

Claude Sonnet 4.5

anthropic

Anthropic's Claude Sonnet 4.5 delivers world-leading coding (77.2% SWE-bench) and a 200K context window, optimized for the next generation of autonomous agents.

200K context
$3.00/$15.00/1M
anthropic

Claude Opus 4.5

anthropic

Claude Opus 4.5 is Anthropic's most powerful frontier model, delivering record-breaking 80.9% SWE-bench performance and advanced autonomous agency for coding.

200K context
$5.00/$25.00/1M
xai

Grok-4

xai

Grok-4 by xAI is a frontier model featuring a 2M token context window, real-time X platform integration, and world-record reasoning capabilities.

2M context
$3.00/$15.00/1M
openai

GPT-5.1

openai

GPT-5.1 is OpenAI’s advanced reasoning flagship featuring adaptive thinking, native multimodality, and state-of-the-art performance in math and technical...

400K context
$1.25/$10.00/1M
google

Gemini 3 Flash

google

Gemini 3 Flash is Google's high-speed multimodal model featuring a 1M token context window, elite 90.4% GPQA reasoning, and autonomous browser automation tools.

1M context
$0.50/$3.00/1M
google

Gemini 3 Pro

google

Google's Gemini 3 Pro is a multimodal powerhouse featuring a 1M token context window, native video processing, and industry-leading reasoning performance.

1M context
$2.00/$12.00/1M

Frequently Asked Questions

Find answers to common questions about this model