moonshot

Kimi K2 Thinking

Kimi K2 Thinking is Moonshot AI's trillion-parameter reasoning model. It outperforms GPT-5 on HLE and supports 300 sequential tool calls autonomously for...

moonshot logomoonshotKimi K2November 6, 2025
Context
256Ktokens
Max Output
16Ktokens
Input Price
$0.60/ 1M
Output Price
$2.50/ 1M
Modality:Text
Capabilities:ToolsStreamingReasoning
Benchmarks
GPQA
84.5%
GPQA: Graduate-Level Science Q&A. A rigorous benchmark with 448 multiple-choice questions in biology, physics, and chemistry created by domain experts. PhD experts only achieve 65-74% accuracy, while non-experts score just 34% even with unlimited web access (hence 'Google-proof'). Kimi K2 Thinking scored 84.5% on this benchmark.
HLE
44.9%
HLE: High-Level Expertise Reasoning. Tests a model's ability to demonstrate expert-level reasoning across specialized domains. Evaluates deep understanding of complex topics that require professional-level knowledge. Kimi K2 Thinking scored 44.9% on this benchmark.
MMLU
89.4%
MMLU: Massive Multitask Language Understanding. A comprehensive benchmark with 16,000 multiple-choice questions across 57 academic subjects including math, philosophy, law, and medicine. Tests broad knowledge and reasoning capabilities. Kimi K2 Thinking scored 89.4% on this benchmark.
MMLU Pro
84.6%
MMLU Pro: MMLU Professional Edition. An enhanced version of MMLU with 12,032 questions using a harder 10-option multiple choice format. Covers Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, and Computer Science. Kimi K2 Thinking scored 84.6% on this benchmark.
SimpleQA
48%
SimpleQA: Factual Accuracy Benchmark. Tests a model's ability to provide accurate, factual responses to straightforward questions. Measures reliability and reduces hallucinations in knowledge retrieval tasks. Kimi K2 Thinking scored 48% on this benchmark.
IFEval
88.3%
IFEval: Instruction Following Evaluation. Measures how well a model follows specific instructions and constraints. Tests the ability to adhere to formatting rules, length limits, and other explicit requirements. Kimi K2 Thinking scored 88.3% on this benchmark.
AIME 2025
94.5%
AIME 2025: American Invitational Math Exam. Competition-level mathematics problems from the prestigious AIME exam designed for talented high school students. Tests advanced mathematical problem-solving requiring abstract reasoning, not just pattern matching. Kimi K2 Thinking scored 94.5% on this benchmark.
MATH
94.1%
MATH: Mathematical Problem Solving. A comprehensive math benchmark testing problem-solving across algebra, geometry, calculus, and other mathematical domains. Requires multi-step reasoning and formal mathematical knowledge. Kimi K2 Thinking scored 94.1% on this benchmark.
GSM8k
98.2%
GSM8k: Grade School Math 8K. 8,500 grade school-level math word problems requiring multi-step reasoning. Tests basic arithmetic and logical thinking through real-world scenarios like shopping or time calculations. Kimi K2 Thinking scored 98.2% on this benchmark.
MGSM
91.5%
MGSM: Multilingual Grade School Math. The GSM8k benchmark translated into 10 languages including Spanish, French, German, Russian, Chinese, and Japanese. Tests mathematical reasoning across different languages. Kimi K2 Thinking scored 91.5% on this benchmark.
MathVista
36.8%
MathVista: Mathematical Visual Reasoning. Tests the ability to solve math problems that involve visual elements like charts, graphs, geometry diagrams, and scientific figures. Combines visual understanding with mathematical reasoning. Kimi K2 Thinking scored 36.8% on this benchmark.
SWE-Bench
71.3%
SWE-Bench: Software Engineering Benchmark. AI models attempt to resolve real GitHub issues in open-source Python projects with human verification. Tests practical software engineering skills on production codebases. Top models went from 4.4% in 2023 to over 70% in 2024. Kimi K2 Thinking scored 71.3% on this benchmark.
HumanEval
99%
HumanEval: Python Programming Problems. 164 hand-written programming problems where models must generate correct Python function implementations. Each solution is verified against unit tests. Top models now achieve 90%+ accuracy. Kimi K2 Thinking scored 99% on this benchmark.
LiveCodeBench
83.1%
LiveCodeBench: Live Coding Benchmark. Tests coding abilities on continuously updated, real-world programming challenges. Unlike static benchmarks, uses fresh problems to prevent data contamination and measure true coding skills. Kimi K2 Thinking scored 83.1% on this benchmark.
MMMU
65.8%
MMMU: Multimodal Understanding. Massive Multi-discipline Multimodal Understanding benchmark testing vision-language models on college-level problems across 30 subjects requiring both image understanding and expert knowledge. Kimi K2 Thinking scored 65.8% on this benchmark.
MMMU Pro
62.4%
MMMU Pro: MMMU Professional Edition. Enhanced version of MMMU with more challenging questions and stricter evaluation. Tests advanced multimodal reasoning at professional and expert levels. Kimi K2 Thinking scored 62.4% on this benchmark.
ChartQA
86.2%
ChartQA: Chart Question Answering. Tests the ability to understand and reason about information presented in charts and graphs. Requires extracting data, comparing values, and performing calculations from visual data representations. Kimi K2 Thinking scored 86.2% on this benchmark.
DocVQA
94.5%
DocVQA: Document Visual Q&A. Document Visual Question Answering benchmark testing the ability to extract and reason about information from document images including forms, reports, and scanned text. Kimi K2 Thinking scored 94.5% on this benchmark.
Terminal-Bench
47.1%
Terminal-Bench: Terminal/CLI Tasks. Tests the ability to perform command-line operations, write shell scripts, and navigate terminal environments. Measures practical system administration and development workflow skills. Kimi K2 Thinking scored 47.1% on this benchmark.
ARC-AGI
12.5%
ARC-AGI: Abstraction & Reasoning. Abstraction and Reasoning Corpus for AGI - tests fluid intelligence through novel pattern recognition puzzles. Each task requires discovering the underlying rule from examples, measuring general reasoning ability rather than memorization. Kimi K2 Thinking scored 12.5% on this benchmark.

About Kimi K2 Thinking

Learn about Kimi K2 Thinking's capabilities, features, and how it can help you achieve better results.

Trillion-Parameter Mixture of Experts

Kimi K2 Thinking is a trillion-parameter reasoning model that utilizes a Mixture-of-Experts (MoE) architecture. Developed by Moonshot AI and released in late 2025, it activates only 32B parameters for inference, which balances massive knowledge capacity with computational efficiency. It is designed specifically as a thinking agent that scales its computation during the inference phase to solve complex logical problems. This approach allows the model to reflect on its own reasoning and correct mistakes before providing a final answer.

Agentic Tool Use and Planning

The model distinguishes itself through its capability to handle up to 300 sequential tool calls autonomously. While most standard language models struggle with long-horizon planning, K2 Thinking is engineered for agentic workflows such as autonomous web browsing and multi-step software engineering. It natively supports INT4 precision via Quantization-Aware Training, allowing the model to maintain frontier-level performance while running on standard enterprise hardware clusters.

Developer and Research Focus

With a 256K token context window, the model is built for deep research and complex technical tasks. It bridges the performance gap between closed-source systems and open-weights models. Its ability to solve PhD-level science questions and competitive math problems makes it a suitable choice for academic research, automated coding assistants, and high-fidelity reasoning applications where logical consistency is the primary requirement.

Kimi K2 Thinking

Use Cases

Discover the different ways you can use Kimi K2 Thinking to achieve great results.

Complex Software Engineering

Resolving real GitHub issues and architecting multi-file codebases using iterative self-correction.

Autonomous Research Agents

Executing hundreds of sequential tool calls to gather and synthesize obscure technical data.

Olympiad-Level Mathematics

Solving advanced geometry and algebra problems with deep chain-of-thought verification.

PhD-Level Science Inquiry

Answering expert questions in physics and biology that require multi-step logical deduction.

Interactive Computer Control

Navigating terminal environments and cloud infrastructure to automate devops workflows.

Logic-Heavy Creative Writing

Generating long-form content that requires strict adherence to intricate world-building rules.

Strengths

Limitations

State-of-the-Art Reasoning: Scores 44.9% on HLE with tools, surpassing major closed-source models in expert-level logic.
Massive Resource Requirements: Local inference requires at least 245GB of VRAM even with quantization, limiting its use to high-end server clusters.
Exceptional Agentic Depth: Capable of 300 sequential tool calls, enabling truly autonomous web research and browser tasks.
Inherent Response Latency: The deep thinking process results in significant wait times as the model scales its test-time computation.
Top-Tier Mathematical Accuracy: Achieves 94.5% on AIME 2025, proving its reliability for high-level mathematical problem solving.
Lack of Native Multimodality: This variant cannot process image or video inputs directly, requiring a separate vision model for multimodal tasks.
Open-Weights Accessibility: Offers frontier-level intelligence to the developer community for local deployment and fine-tuning.
High Token Overhead: Internal reasoning steps consume a large number of output tokens, which increases API costs for simple queries.

API Quick Start

moonshot/kimi-k2-thinking

View Documentation
moonshot SDK
import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.MOONSHOT_API_KEY,
  baseURL: 'https://api.moonshot.cn/v1',
});

async function main() {
  const response = await client.chat.completions.create({
    model: 'kimi-k2-thinking',
    messages: [{ role: 'user', content: 'Design a system for autonomous code review using 300 tool calls.' }],
  });
  console.log(response.choices[0].message.content);
}

main();

Install the SDK and start making API calls in minutes.

Community Feedback

See what the community thinks about Kimi K2 Thinking

Kimi K2.5 is the best open model for coding, they really cooked.
npc_gooner
reddit
Moonshot AI just dropped Kimi K2 Thinking. 300 sequential tool calls? That's the future of agentic AI.
@tech_trends
twitter
Kimi released Kimi K2 Thinking, an open-source trillion-parameter reasoning model. This is the real deal.
nekofneko
reddit
The fact that it can handle 300 tool calls sequentially opens up entirely new agent workflows.
AI Explained
youtube
Impressive to see an open-source model hitting these numbers. The test-time scaling approach is clearly paying off.
jsmith23
hackernews
Running this model locally is a challenge, but the reasoning depth is unlike anything else in the open weights space.
LocalLlamaEnthusiast
reddit

Related Videos

Watch tutorials, reviews, and discussions about Kimi K2 Thinking

Kimmy K2 thinking is the best AI model I've ever used.

It is the most agentic independent model ever made. Meaning, it can run for hours by itself.

It is able to think and reflect every single step of the way. So it never gets lost.

The reasoning speed is surprisingly fast despite the trillion parameters.

If you are building agents, this is the architecture you want to look at.

Kimi K2 Thinking... is a thinking upgrade to the Kimmy K2 model, which truthfully seems to be very widely regarded.

This is of course an open-source model... coming in at a total size of around 1 trillion parameters.

All benchmark results are reported under int4 precision.

It handles complex math problems with a level of logic that rivals the top proprietary labs.

The installation process for the local weights is fairly straightforward if you have the VRAM.

Kimi K2.5 is the latest open-source model developed by a Chinese company called Moonshot AI.

It is capable of spinning up up to 100 sub-agents and 1,500 tool calls and run them concurrently.

I would certainly recommend it if you want to make a truly beautiful website.

The internal chain of thought allows it to self-correct code errors before providing the final answer.

Moonshot has really focused on long-horizon planning for this specific release.

More than just prompts

Supercharge your workflow with AI Automation

Automatio combines the power of AI agents, web automation, and smart integrations to help you accomplish more in less time.

AI Agents
Web Automation
Smart Workflows

Pro Tips

Expert tips to help you get the most out of Kimi K2 Thinking and achieve better results.

Enable Thinking Output

Use the special tokens flag in your inference engine to see the model's internal reasoning steps.

Optimize Temperature

Set the sampling temperature to 1.0 and min_p to 0.01 for the most consistent reasoning flow.

Utilize System Prompts

Start conversations with the official Moonshot AI identity prompt to stabilize the model's behavior.

Scale Test-Time Compute

Allow the model to generate more internal tokens for harder problems to increase accuracy.

Testimonials

What Our Users Say

Join thousands of satisfied users who have transformed their workflow

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Related AI Models

openai

GPT-5.4

OpenAI

GPT-5.4 is OpenAI's frontier model featuring a 1.05M context window and Extreme Reasoning. It excels at autonomous UI interaction and long-form data analysis.

1M context
$2.50/$15.00/1M
openai

GPT-5.2

OpenAI

GPT-5.2 is OpenAI's flagship model for professional tasks, featuring a 400K context window, elite coding, and deep multi-step reasoning capabilities.

400K context
$1.75/$14.00/1M
zhipu

GLM-5

Zhipu (GLM)

GLM-5 is Zhipu AI's 744B parameter open-weight powerhouse, excelling in long-horizon agentic tasks, coding, and factual accuracy with a 200k context window.

200K context
$1.00/$3.20/1M
google

Gemini 3.1 Flash-Lite

Google

Gemini 3.1 Flash-Lite is Google's fastest, most cost-efficient model. Features 1M context, native multimodality, and 363 tokens/sec speed for scale.

1M context
$0.25/$1.50/1M
anthropic

Claude Opus 4.5

Anthropic

Claude Opus 4.5 is Anthropic's most powerful frontier model, delivering record-breaking 80.9% SWE-bench performance and advanced autonomous agency for coding.

200K context
$5.00/$25.00/1M
xai

Grok-4

xAI

Grok-4 by xAI is a frontier model featuring a 2M token context window, real-time X platform integration, and world-record reasoning capabilities.

2M context
$3.00/$15.00/1M
moonshot

Kimi K2.5

Moonshot

Discover Moonshot AI's Kimi K2.5, a 1T-parameter open-source agentic model featuring native multimodal capabilities, a 262K context window, and SOTA reasoning.

256K context
$0.60/$3.00/1M
openai

GPT-5.3 Codex

OpenAI

GPT-5.3 Codex is OpenAI's 2026 frontier coding agent, featuring a 400K context window, 77.3% Terminal-Bench score, and superior logic for complex software...

400K context
$1.75/$14.00/1M

Frequently Asked Questions

Find answers to common questions about Kimi K2 Thinking