alibaba

Qwen 3.7 Max

Qwen 3.7 Max is Alibaba’s flagship AI model for deep reasoning and autonomous agent tasks, featuring a 256k context window and top-tier coding performance.

Thinking ModelCoding AssistantAgentic AIAlibaba CloudMoE Architecture
alibaba logoalibabaQwen3May 20, 2026
Context
256Ktokens
Max Output
66Ktokens
Input Price
$1.20/ 1M
Output Price
$6.00/ 1M
Modality:Text
Capabilities:ToolsStreamingReasoning
Benchmarks
GPQA
92.4%
GPQA: Graduate-Level Science Q&A. A rigorous benchmark with 448 multiple-choice questions in biology, physics, and chemistry created by domain experts. PhD experts only achieve 65-74% accuracy, while non-experts score just 34% even with unlimited web access (hence 'Google-proof'). Qwen 3.7 Max scored 92.4% on this benchmark.
HLE
38.2%
HLE: High-Level Expertise Reasoning. Tests a model's ability to demonstrate expert-level reasoning across specialized domains. Evaluates deep understanding of complex topics that require professional-level knowledge. Qwen 3.7 Max scored 38.2% on this benchmark.
MMLU
92.8%
MMLU: Massive Multitask Language Understanding. A comprehensive benchmark with 16,000 multiple-choice questions across 57 academic subjects including math, philosophy, law, and medicine. Tests broad knowledge and reasoning capabilities. Qwen 3.7 Max scored 92.8% on this benchmark.
MMLU Pro
82%
MMLU Pro: MMLU Professional Edition. An enhanced version of MMLU with 12,032 questions using a harder 10-option multiple choice format. Covers Math, Physics, Chemistry, Law, Engineering, Economics, Health, Psychology, Business, Biology, Philosophy, and Computer Science. Qwen 3.7 Max scored 82% on this benchmark.
SimpleQA
45%
SimpleQA: Factual Accuracy Benchmark. Tests a model's ability to provide accurate, factual responses to straightforward questions. Measures reliability and reduces hallucinations in knowledge retrieval tasks. Qwen 3.7 Max scored 45% on this benchmark.
IFEval
95%
IFEval: Instruction Following Evaluation. Measures how well a model follows specific instructions and constraints. Tests the ability to adhere to formatting rules, length limits, and other explicit requirements. Qwen 3.7 Max scored 95% on this benchmark.
AIME 2025
99.7%
AIME 2025: American Invitational Math Exam. Competition-level mathematics problems from the prestigious AIME exam designed for talented high school students. Tests advanced mathematical problem-solving requiring abstract reasoning, not just pattern matching. Qwen 3.7 Max scored 99.7% on this benchmark.
MATH
94.8%
MATH: Mathematical Problem Solving. A comprehensive math benchmark testing problem-solving across algebra, geometry, calculus, and other mathematical domains. Requires multi-step reasoning and formal mathematical knowledge. Qwen 3.7 Max scored 94.8% on this benchmark.
GSM8k
99.2%
GSM8k: Grade School Math 8K. 8,500 grade school-level math word problems requiring multi-step reasoning. Tests basic arithmetic and logical thinking through real-world scenarios like shopping or time calculations. Qwen 3.7 Max scored 99.2% on this benchmark.
MGSM
98%
MGSM: Multilingual Grade School Math. The GSM8k benchmark translated into 10 languages including Spanish, French, German, Russian, Chinese, and Japanese. Tests mathematical reasoning across different languages. Qwen 3.7 Max scored 98% on this benchmark.
SWE-Bench
60.6%
SWE-Bench: Software Engineering Benchmark. AI models attempt to resolve real GitHub issues in open-source Python projects with human verification. Tests practical software engineering skills on production codebases. Top models went from 4.4% in 2023 to over 70% in 2024. Qwen 3.7 Max scored 60.6% on this benchmark.
HumanEval
94.5%
HumanEval: Python Programming Problems. 164 hand-written programming problems where models must generate correct Python function implementations. Each solution is verified against unit tests. Top models now achieve 90%+ accuracy. Qwen 3.7 Max scored 94.5% on this benchmark.
LiveCodeBench
78.2%
LiveCodeBench: Live Coding Benchmark. Tests coding abilities on continuously updated, real-world programming challenges. Unlike static benchmarks, uses fresh problems to prevent data contamination and measure true coding skills. Qwen 3.7 Max scored 78.2% on this benchmark.
Terminal-Bench
69.7%
Terminal-Bench: Terminal/CLI Tasks. Tests the ability to perform command-line operations, write shell scripts, and navigate terminal environments. Measures practical system administration and development workflow skills. Qwen 3.7 Max scored 69.7% on this benchmark.
ARC-AGI
12.4%
ARC-AGI: Abstraction & Reasoning. Abstraction and Reasoning Corpus for AGI - tests fluid intelligence through novel pattern recognition puzzles. Each task requires discovering the underlying rule from examples, measuring general reasoning ability rather than memorization. Qwen 3.7 Max scored 12.4% on this benchmark.

About Qwen 3.7 Max

Learn about Qwen 3.7 Max's capabilities, features, and how it can help you achieve better results.

High-Order Reasoning Engine

Qwen 3.7 Max is a massive Mixture-of-Experts system containing approximately 1.6 trillion parameters. It is designed to operate as a logic-first engine for high-complexity engineering and research tasks. The model integrates a native Always-On Thinking mode, which forces the model to verify logic and plan steps before generating a response. This architectural choice significantly reduces logical drift in long-form outputs and provides a reliable foundation for software architecture and mathematical proofing.

Architected for Autonomous Agency

This model serves as a specialized base for the next generation of autonomous agents. It focuses on long-horizon task management and complex tool usage. During internal evaluations, the model maintained logical coherence across sessions lasting over 30 hours, managing thousands of sequential tool calls to solve hardware-level engineering problems. While the model is optimized for text and code to maintain a high reasoning density, it easily integrates with external vision or audio modules via multi-agent orchestration.

Efficiency in Large Contexts

With a 256,000-token context window, the model supports large-scale repository analysis and complex document retrieval. It maintains high retrieval accuracy even as the window fills, making it ideal for legal discovery and enterprise-level RAG workflows. The competitive pricing structure allows developers to deploy frontier-level logic at a fraction of the cost of comparable models from Western labs.

Qwen 3.7 Max

Use Cases

Discover the different ways you can use Qwen 3.7 Max to achieve great results.

Autonomous Kernel Engineering

The model generates and optimizes hardware-specific code kernels for new chips without existing documentation using recursive tool calls.

Enterprise Repo Refactoring

Qwen 3.7 Max analyzes entire legacy software repositories to update frameworks and resolve technical debt while ensuring logic parity.

Long-Horizon Agent Planning

It manages multi-step workflows requiring autonomous decision-making and planning over continuous 30-plus hour sessions.

Scientific Research Verification

Researchers use the model to verify complex mathematical proofs and solve multi-stage scientific queries with high logical accuracy.

Advanced Financial Risk Modeling

The model ingests thousands of pages of financial data to identify anomalies and project ROI with structured reasoning.

Cross-Framework UI Engineering

It builds functional frontend prototypes with integrated state management and complex logic directly from high-level natural language instructions.

Strengths

Limitations

Elite Reasoning Efficiency: The model delivers 92.4% on GPQA, matching or exceeding the highest-tier reasoning models at a fraction of the cost.
Text-Only Flagship: The Max variant lacks native vision and audio support, requiring a model switch for multimodal workloads.
Autonomous Agent Proficiency: With a 69.7 score on Terminal-Bench, it excels at navigating real terminal environments and managing autonomous tool calls.
Aesthetic Design Gap: While logically sound, generated UI and creative assets often lack the visual polish seen in competitors like Claude.
Massive Scale MoE: The 1.6T parameter Mixture-of-Experts architecture ensures high specialization for diverse tasks without losing general logic.
Preview Stability Issues: Early preview versions have shown occasional logic loops in extremely long document extractions compared to stable 3.6 builds.
Instruction Following Accuracy: A 95.0% score on IFEval demonstrates a superior ability to follow complex, multi-constraint formatting and logical instructions.
Regional Context Bias: Documentation and default cultural references can occasionally prioritize Eastern markets, impacting some niche Western creative tasks.

API Quick Start

alibaba/qwen-3.7-max

View Documentation
alibaba SDK
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.QWEN_API_KEY,
  baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});

async function runReasoningTask() {
  const completion = await client.chat.completions.create({
    model: "qwen-3.7-max",
    messages: [
      { role: "system", content: "You are a senior software architect." },
      { role: "user", content: "Analyze this legacy kernel for potential race conditions." }
    ],
    temperature: 0.1,
  });
  console.log(completion.choices[0].message.content);
}

runReasoningTask();

Install the SDK and start making API calls in minutes.

Community Feedback

See what the community thinks about Qwen 3.7 Max

China's new Qwen 3.7 is insane. It built an SEO ROI calculator with four complex inputs in under 5 minutes. Silicon Valley is nervous.
Julian Goldie
youtube
Qwen3.7-Max is a 1.6T parameter model. The quality improvement in just one month since 3.6 is the fastest iteration I have ever seen.
AJ
twitter
The progress in NL2Repo is the real story. They claim to have matched Claude Opus in repository-level coding.
TeortaxesTex
twitter
Qwen is finally moving away from the overthinking loops of 3.5. The 3.7 Max preview is much more decisive while keeping the logic depth.
LocalLLaMA
reddit
Qwen 3.7 Max just became the first model to seriously rival, and in some cases beat, Claude Opus 4.6 in technical tasks.
TechInsights
twitter
Managed to get QWEN 3.6 27B running locally, but the 3.7 Max cloud performance is on another level for complex reasoning.
DevArchitect
hackernews

Related Videos

Watch tutorials, reviews, and discussions about Qwen 3.7 Max

The Chain of Thought process is exceptionally fast compared to previous iterations.

This is only the second time I've seen a model correctly implement ammunition impact marks on scenery.

The logic consistency in multi-turn coding debugging is noticeably more stable than the 3.6 preview.

It handles the 256k context window with almost zero needle-in-a-haystack loss.

This model represents the bridge between static completion and true autonomous planning.

The context window is 256K tokens for Max, and importantly, it is text-only.

We are observing quite a bit lower amount of thinking or overthinking compared to 3.5.

The performance in terminal-based environments suggests it can actually manage a server.

Qwen 3.7 Max is significantly cheaper for enterprise workloads that need high-end logic.

It doesn't struggle with the same cultural alignment issues seen in some earlier models.

Qwen 3.7 Max Preview landed at number 13 overall in Text Arena.

Thinking mode means the model breaks problems into smaller steps before answering.

It builds complex calculators in under five minutes with perfect state management.

This is specifically optimized for Agentic AI, meaning it acts rather than just talks.

The pricing is a direct shot at OpenAI's dominance in the developer market.

More than just prompts

Supercharge your workflow with AI Automation

Automatio combines the power of AI agents, web automation, and smart integrations to help you accomplish more in less time.

AI Agents
Web Automation
Smart Workflows

Pro Tips

Expert tips to help you get the most out of Qwen 3.7 Max and achieve better results.

Enforce Logic Verification

Include 'Verify your thinking steps before providing the final code' to trigger the model's native deliberative reasoning mode.

Utilize Context Caching

For tasks involving the same massive codebase, use context caching to reduce latency and lower your input token expenditure.

Define Phase Checklists

Provide a numbered checklist for long tasks to ensure the model doesn't omit middle steps during long-horizon generations.

Constraint Design Parameters

When generating UI, provide specific CSS variables for styling to compensate for the model's focus on logic over aesthetics.

Testimonials

What Our Users Say

Join thousands of satisfied users who have transformed their workflow

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Jonathan Kogan

Jonathan Kogan

Co-Founder/CEO, rpatools.io

Automatio is one of the most used for RPA Tools both internally and externally. It saves us countless hours of work and we realized this could do the same for other startups and so we choose Automatio for most of our automation needs.

Mohammed Ibrahim

Mohammed Ibrahim

CEO, qannas.pro

I have used many tools over the past 5 years, Automatio is the Jack of All trades.. !! it could be your scraping bot in the morning and then it becomes your VA by the noon and in the evening it does your automations.. its amazing!

Ben Bressington

Ben Bressington

CTO, AiChatSolutions

Automatio is fantastic and simple to use to extract data from any website. This allowed me to replace a developer and do tasks myself as they only take a few minutes to setup and forget about it. Automatio is a game changer!

Sarah Chen

Sarah Chen

Head of Growth, ScaleUp Labs

We've tried dozens of automation tools, but Automatio stands out for its flexibility and ease of use. Our team productivity increased by 40% within the first month of adoption.

David Park

David Park

Founder, DataDriven.io

The AI-powered features in Automatio are incredible. It understands context and adapts to changes in websites automatically. No more broken scrapers!

Emily Rodriguez

Emily Rodriguez

Marketing Director, GrowthMetrics

Automatio transformed our lead generation process. What used to take our team days now happens automatically in minutes. The ROI is incredible.

Related AI Models

google

Gemini 3 Pro

Google

Google's Gemini 3 Pro is a multimodal powerhouse featuring a 1M token context window, native video processing, and industry-leading reasoning performance.

1M context
$2.00/$12.00/1M
openai

GPT-5.2 Pro

OpenAI

GPT-5.2 Pro is OpenAI's 2025 flagship reasoning model featuring Extended Thinking for SOTA performance in mathematics, coding, and expert knowledge work.

400K context
$21.00/$168.00/1M
anthropic

Claude Opus 4.6

Anthropic

Claude Opus 4.6 is Anthropic's flagship model featuring a 1M token context window, Adaptive Thinking, and world-class coding and reasoning performance.

1M context
$5.00/$25.00/1M
openai

GPT-5.5

OpenAI

GPT-5.5 is OpenAI's flagship frontier model with a 1M context window and five reasoning effort levels, optimized for autonomous agentic workflows and coding.

1M context
$5.00/$30.00/1M
xai

Grok-3

xAI

Grok-3 is xAI's flagship reasoning model, featuring deep logic deduction, a 128k context window, and real-time integration with X for live research and coding.

1M context
$3.00/$15.00/1M
google

Gemini 3.1 Flash Live Preview

Google

Gemini 3.1 Flash Live Preview is Google's ultra-low-latency, audio-to-audio model featuring a 131K context window, high-fidelity multimodal reasoning, and...

131K context
$0.75/$4.50/1M
anthropic

Claude Opus 4.7

Anthropic

Claude Opus 4.7 is Anthropic's flagship model with a 1-million-token context, adaptive reasoning, and 3.3x vision resolution for enterprise-scale agents.

1M context
$5.00/$25.00/1M
moonshot

Kimi k2.6

Moonshot

Kimi k2.6 is Moonshot AI's 1T-parameter MoE model featuring a 256K context window, native video input, and elite performance in autonomous agentic coding.

256K context
$0.95/$4.00/1M

Frequently Asked Questions

Find answers to common questions about Qwen 3.7 Max