xai

Grok-3

Grok-3 is xAI's flagship reasoning model, featuring deep logic deduction, a 128k context window, and real-time integration with X for live research and coding.

xai logoxaiGrokFebruary 17, 2025
Context
1.0Mtokens
Max Output
4Ktokens
Input Price
$3.00/ 1M
Output Price
$15.00/ 1M
Modality:TextImage
Capabilities:VisionToolsStreamingReasoning
Benchmarks
GPQA
93.6%
GPQA: Graduate-Level Science Q&A. A rigorous benchmark with 448 multiple-choice questions in biology, physics, and chemistry created by domain experts. PhD experts only achieve 65-74% accuracy, while non-experts score just 34% even with unlimited web access (hence 'Google-proof'). Grok-3 scored 93.6% on this benchmark.
HLE
45%
HLE: High-Level Expertise Reasoning. Tests a model's ability to demonstrate expert-level reasoning across specialized domains. Evaluates deep understanding of complex topics that require professional-level knowledge. Grok-3 scored 45% on this benchmark.
MMLU
92.7%
MMLU: Massive Multitask Language Understanding. A comprehensive benchmark with 16,000 multiple-choice questions across 57 academic subjects including math, philosophy, law, and medicine. Tests broad knowledge and reasoning capabilities. Grok-3 scored 92.7% on this benchmark.
MMLU Pro
80.6%
MMLU Pro: MMLU Professional Edition. An enhanced version of MMLU with 12,032 questions using a harder 10-option multiple choice format. Covers Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, and Computer Science. Grok-3 scored 80.6% on this benchmark.
IFEval
95%
IFEval: Instruction Following Evaluation. Measures how well a model follows specific instructions and constraints. Tests the ability to adhere to formatting rules, length limits, and other explicit requirements. Grok-3 scored 95% on this benchmark.
AIME 2025
100%
AIME 2025: American Invitational Math Exam. Competition-level mathematics problems from the prestigious AIME exam designed for talented high school students. Tests advanced mathematical problem-solving requiring abstract reasoning, not just pattern matching. Grok-3 scored 100% on this benchmark.
MATH
89.3%
MATH: Mathematical Problem Solving. A comprehensive math benchmark testing problem-solving across algebra, geometry, calculus, and other mathematical domains. Requires multi-step reasoning and formal mathematical knowledge. Grok-3 scored 89.3% on this benchmark.
GSM8k
89.3%
GSM8k: Grade School Math 8K. 8,500 grade school-level math word problems requiring multi-step reasoning. Tests basic arithmetic and logical thinking through real-world scenarios like shopping or time calculations. Grok-3 scored 89.3% on this benchmark.
MGSM
90%
MGSM: Multilingual Grade School Math. The GSM8k benchmark translated into 10 languages including Spanish, French, German, Russian, Chinese, and Japanese. Tests mathematical reasoning across different languages. Grok-3 scored 90% on this benchmark.
MathVista
78%
MathVista: Mathematical Visual Reasoning. Tests the ability to solve math problems that involve visual elements like charts, graphs, geometry diagrams, and scientific figures. Combines visual understanding with mathematical reasoning. Grok-3 scored 78% on this benchmark.
SWE-Bench
83.9%
SWE-Bench: Software Engineering Benchmark. AI models attempt to resolve real GitHub issues in open-source Python projects with human verification. Tests practical software engineering skills on production codebases. Top models went from 4.4% in 2023 to over 70% in 2024. Grok-3 scored 83.9% on this benchmark.
HumanEval
86.5%
HumanEval: Python Programming Problems. 164 hand-written programming problems where models must generate correct Python function implementations. Each solution is verified against unit tests. Top models now achieve 90%+ accuracy. Grok-3 scored 86.5% on this benchmark.
LiveCodeBench
66.7%
LiveCodeBench: Live Coding Benchmark. Tests coding abilities on continuously updated, real-world programming challenges. Unlike static benchmarks, uses fresh problems to prevent data contamination and measure true coding skills. Grok-3 scored 66.7% on this benchmark.
MMMU
78%
MMMU: Multimodal Understanding. Massive Multi-discipline Multimodal Understanding benchmark testing vision-language models on college-level problems across 30 subjects requiring both image understanding and expert knowledge. Grok-3 scored 78% on this benchmark.
MMMU Pro
65%
MMMU Pro: MMMU Professional Edition. Enhanced version of MMMU with more challenging questions and stricter evaluation. Tests advanced multimodal reasoning at professional and expert levels. Grok-3 scored 65% on this benchmark.
ChartQA
86%
ChartQA: Chart Question Answering. Tests the ability to understand and reason about information presented in charts and graphs. Requires extracting data, comparing values, and performing calculations from visual data representations. Grok-3 scored 86% on this benchmark.
DocVQA
93%
DocVQA: Document Visual Q&A. Document Visual Question Answering benchmark testing the ability to extract and reason about information from document images including forms, reports, and scanned text. Grok-3 scored 93% on this benchmark.
Terminal-Bench
63.5%
Terminal-Bench: Terminal/CLI Tasks. Tests the ability to perform command-line operations, write shell scripts, and navigate terminal environments. Measures practical system administration and development workflow skills. Grok-3 scored 63.5% on this benchmark.
ARC-AGI
71.8%
ARC-AGI: Abstraction & Reasoning. Abstraction and Reasoning Corpus for AGI - tests fluid intelligence through novel pattern recognition puzzles. Each task requires discovering the underlying rule from examples, measuring general reasoning ability rather than memorization. Grok-3 scored 71.8% on this benchmark.

About Grok-3

Learn about Grok-3's capabilities, features, and how it can help you achieve better results.

Frontier Reasoning and Intelligence

Grok-3 is xAI's flagship frontier model, representing a significant leap in computational scale and logic. Trained on the Colossus supercomputer cluster with over 100,000 NVIDIA H100 GPUs, it handles complex mathematical and scientific challenges. The model features a specialized reasoning mode that uses additional computation to verify its own logic before providing a final response.

Real-Time Knowledge Integration

A primary differentiator is its native integration with the X platform. This allows Grok-3 to access breaking news, financial shifts, and global trends with lower latency than models reliant on standard web crawling. Paired with a 1 million token context window, it enables researchers to synthesize massive amounts of up-to-the-minute data.

Multimodal and Agentic Capabilities

Beyond text, Grok-3 is a powerful vision model capable of interpreting technical diagrams, blueprints, and visual data. It supports advanced function calling for use in autonomous agents. With a score of 83.9% on SWE-Bench Verified, it is one of the most capable models for resolving real-world software engineering issues.

Grok-3

Use Cases

Discover the different ways you can use Grok-3 to achieve great results.

Real-time Market Analysis

Uses live X data to analyze financial sentiment and breaking news for investors.

PhD-level Science Research

Solves graduate-level STEM problems and analyzes dense literature with reasoning modes.

Competitive Software Engineering

Generates production-grade code and resolves GitHub issues with high accuracy.

Complex Mathematical Proofs

Utilizes test-time compute to solve olympiad-level math requiring multi-step deduction.

Technical Document Interpretation

Analyzes blueprints and technical manuals through its multimodal vision system.

Autonomous Agent Logic

Serves as the cognitive core for agents requiring high-fidelity planning and tool use.

Strengths

Limitations

Olympiad-Level Reasoning: Achieved a perfect 100% score on the AIME 2025 math benchmark using its Deep Thinking mode.
High Environmental Footprint: Training required 200,000 GPUs and consumes approximately 150MW of power, raising sustainability concerns.
Massive Context Capacity: Offers a 1 million token context window, enabling the ingestion of entire libraries or software projects.
Premium API Pricing: At $15 per million output tokens, it is significantly more expensive than smaller frontier alternatives.
Unrivaled Real-Time Data: Direct integration with X provides the freshest data stream of any AI model currently available.
Output Token Limits: Responses are generally capped at 4,096 tokens, which may truncate extremely long reports or code files.
High Coding Precision: Scored 83.9% on SWE-Bench Verified, outperforming major competitors in resolving complex GitHub issues.
Access Restrictions: Full model capabilities and API keys are often restricted to X Premium Plus subscribers or specific regions.

API Quick Start

xai/grok-3

View Documentation
xai SDK
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.XAI_API_KEY,
  baseURL: "https://api.x.ai/v1",
});

async function main() {
  const completion = await client.chat.completions.create({
    model: "grok-3",
    messages: [{ role: "user", content: "Analyze the current market sentiment for Nvidia on X." }],
  });

  console.log(completion.choices[0].message.content);
}

main();

Install the SDK and start making API calls in minutes.

Community Feedback

See what the community thinks about Grok-3

Grok-3 [is] the best AI model for traders and investors due to its real-time sentiment analysis.
Austin Starks
reddit
It managed to solve some hard HVM code completion prompts that Gemini and Sonnet failed. I feel a level of 'quality' that is higher than Sonnet-3.5.
Victor Taelin
twitter
The speed is so damn fast. Reasoning, real-time info, just seems like the fastest flagship model out there right now.
Matthew Berman
youtube
Grok has real-time data access and a willingness to go places other models will not, making it the 'edgy' choice for power users.
Beginning-Willow-801
reddit
Grok-3 performance on GPQA is remarkable. It is definitely competing for the top spot in reasoning.
EpochAIResearch
twitter
The 1M context window actually works. It handled my entire legacy codebase without losing context on the initial prompts.
DevGuru42
hackernews

Related Videos

Watch tutorials, reviews, and discussions about Grok-3

Introduction to Grok-3 and its training scale.

The model is built for intelligence and truth-seeking.

Grok 3 reasoning... seems to be beating both the open AI 01 and deep seek R1 model on scientific benchmarks.

Benchmark performance on MMLU shows it's a top-tier model.

Grok 3 will actually also attempt to solve unsolved problems... while other models will simply state that it's unsolved.

Elon Musk claims this is the most powerful AI to date.

Grok 3 has now claimed the top spot in this blind test, making it the reigning champion in the chatbot Arena.

The integration with X provides a distinct advantage in recency.

Multimodal capabilities are significantly improved over Grok-2.

The most powerful version of Grok and the latest version will be the web version at grok.com.

Exploring the technical architecture of the Colossus cluster.

Discussion on the massive 100k H100 GPU training run.

Big brain is a feature which is truly unique to Grok 3... it allows users to use multiple reasoning agents to solve complex problems.

The development of Grok 3 was accelerated by X's Colossus supercomputer which utilized 100,000 Nvidia H100 GPUs in Phase 1.

Final thoughts on why Grok-3 is a major step forward for open-weights-style transparency.

More than just prompts

Supercharge your workflow with AI Automation

Automatio combines the power of AI agents, web automation, and smart integrations to help you accomplish more in less time.

AI Agents
Web Automation
Smart Workflows

Pro Tips

Expert tips to help you get the most out of Grok-3 and achieve better results.

Leverage Deep Search

Use deep search for queries regarding news from the last hour for the most accurate results.

Enable High Reasoning

Specify the reasoning effort as high for math puzzles to trigger self-verification steps.

Utilize the Collections API

Upload sensitive documents to the Collections API to keep your data out of training loops.

Testimonials

What Our Users Say

Join thousands of satisfied users who have transformed their workflow

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Related AI Models

google

Gemini 3.1 Flash Live Preview

Google

Gemini 3.1 Flash Live Preview is Google's ultra-low-latency, audio-to-audio model featuring a 131K context window, high-fidelity multimodal reasoning, and...

131K context
$0.75/$4.50/1M
openai

GPT-5.2 Pro

OpenAI

GPT-5.2 Pro is OpenAI's 2025 flagship reasoning model featuring Extended Thinking for SOTA performance in mathematics, coding, and expert knowledge work.

400K context
$21.00/$168.00/1M
google

Gemini 3.1 Pro

Google

Gemini 3.1 Pro is Google's elite multimodal model featuring the DeepThink reasoning engine, a 1M+ context window, and industry-leading ARC-AGI logic scores.

1M context
$2.00/$12.00/1M
google

Gemini 3 Pro

Google

Google's Gemini 3 Pro is a multimodal powerhouse featuring a 1M token context window, native video processing, and industry-leading reasoning performance.

1M context
$2.00/$12.00/1M
anthropic

Claude Opus 4.6

Anthropic

Claude Opus 4.6 is Anthropic's flagship model featuring a 1M token context window, Adaptive Thinking, and world-class coding and reasoning performance.

1M context
$5.00/$25.00/1M
google

Gemini 3 Flash

Google

Gemini 3 Flash is Google's high-speed multimodal model featuring a 1M token context window, elite 90.4% GPQA reasoning, and autonomous browser automation tools.

1M context
$0.50/$3.00/1M
anthropic

Claude Sonnet 4.6

Anthropic

Claude Sonnet 4.6 offers frontier performance for coding and computer use with a massive 1M token context window for only $3/1M tokens.

1M context
$3.00/$15.00/1M
alibaba

Qwen3.5-397B-A17B

alibaba

Qwen3.5-397B-A17B is Alibaba's flagship open-weight MoE model. It features native multimodal reasoning, a 1M context window, and a 19x decoding throughput...

1M context
$0.40/$2.40/1M

Frequently Asked Questions

Find answers to common questions about Grok-3