Qwen3.6-35B-A3B: The Small Model That Codes Like a Giant

Qwen3.6-35B-A3B is a breakthrough open-source AI model combining 35B capacity with 3B active parameters. It delivers strong coding, reasoning, and multimodal performance at a fraction of the cost.

Qwen3.6
Qwen3.6

Most AI models make you choose: pay for power or settle for efficiency. Alibaba's Qwen team just refused that trade-off. On April 16, 2026, they released Qwen3.6-35B-A3B , a sparse Mixture-of-Experts model that runs on the compute budget of a 3B model while drawing on the learned capacity of a 35B one.

It is open-source, Apache 2.0 licensed, and it is already challenging models twice its effective size on agentic coding benchmarks. This is not a incremental update. It is a signal that the open-source AI race is entering a different gear.

  Comparison to Other models

The Architecture Behind the Numbers

  Architecture

The "A3B" in the model name tells you everything. It means only 3 billion parameters activate per forward pass, even though the full weight file holds 35 billion. This is a sparse Mixture-of-Experts design. A router inside the model selects which expert sub-networks handle each token. The rest stay dormant.

The sparsity ratio - 3B active out of 35B total is roughly 12:1, among the most aggressive in any publicly released model. You get representational depth at a fraction of the inference cost. That math is what drives every benchmark number below.

The model also introduces thinking preservation, the ability to retain reasoning traces from prior messages, which keeps multi-turn logic stable and helps with iterative agentic tasks. For developers building long-running coding agents, this is not a minor feature. It is a structural improvement.

The native context window sits at 262,144 tokens, extensible up to 1,010,000 with RoPE scaling.

Agentic Coding Benchmark

Qwen3.5-27B Gemma4-31B Qwen3.5-35BA3B Gemma4-26BA4B Qwen3.6-35BA3B
Coding Agent
SWE bench Verified 75.0 52.0 70.0 17.4 73.4
SWE bench Multilingual 69.3 51.7 60.3 17.3 67.2
SWE bench Pro 51.2 35.7 44.6 13.8 49.5
Terminal Bench 2.0 41.6 42.9 40.5 34.2 51.5
Claw Eval Avg 64.3 48.5 65.4 58.8 68.7
Claw Eval Pass³ 46.2 25.0 51.0 28.0 50.0
SkillsBench Avg5 27.2 23.6 4.4 12.3 28.7
QwenClawBench 52.2 41.7 47.7 38.7 52.6
NL2Repo 27.3 15.5 20.5 11.6 29.4
QwenWebBench 1068 1197 978 1178 1397
General Agent
TAU3 Bench 68.4 67.5 68.9 59.0 67.2
VITA Bench 41.8 43.0 29.1 36.9 35.6
DeepPlanning 22.6 24.0 22.8 16.2 25.9
Tool Decathlon 31.5 21.2 28.7 12.0 26.9
MCPMark 36.3 18.1 27.0 14.2 37.0
MCP Atlas 68.4 57.2 62.4 50.0 62.8
WideSearch 66.4 35.2 59.1 38.3 60.1
Knowledge
MMLU Pro 86.1 85.2 85.3 82.6 85.2
MMLU Redux 93.2 93.7 93.3 92.7 93.3
SuperGPQA 65.6 65.7 63.4 61.4 64.7
C-Eval 90.5 82.6 90.2 82.5 90.0
STEM & Reasoning
GPQA 85.5 84.3 84.2 82.3 86.0
HLE 24.3 19.5 22.4 8.7 21.4
LiveCodeBench v6 80.7 80.0 74.6 77.1 80.4
HMMT Feb 25 92.0 88.7 89.0 91.7 90.7
HMMT Nov 25 89.8 87.5 89.2 87.5 89.1
HMMT Feb 26 84.3 77.2 78.7 79.0 83.6
IMOAnswerBench 79.9 74.5 76.8 74.3 78.9
AIME26 92.6 89.2 91.0 88.3 92.7

Agentic coding means more than autocomplete. It means the model can read a real GitHub repository, reason about a bug, write a fix, run it, and verify the output, in a multi-step loop.

SWE-bench Verified is the standard test for this. Qwen3.6-35B-A3B scores 73.4% on SWE-bench Verified, while Gemma 4-31B, a dense model that activates all 31 billion of its parameters every inference step - scores 52.0%. The efficient model won by a wide margin.

The gap holds across every related benchmark:

On SWE-bench Multilingual, it scores 67.2. Its predecessor Qwen3.5-35B-A3B scores 60.3. On Terminal-Bench 2.0, real terminal task execution in a sandboxed environment it scores 51.5 against Gemma4-31B's 42.9.

Tool use is where the gap becomes stark. On MCPMark (MCP tool integration), it scores 37.0% against Gemma4-31B's 18.1%, more than twice as capable at integrating with function calls in agentic loops.

It also outperforms its predecessor Qwen3.5-35B-A3B by a wide margin across nearly every benchmark, a substantial generational leap, not an incremental one.

General Reasoning and Knowledge

Strong coding benchmarks sometimes mask weak foundations. That is not the case here.

On GPQA Diamond (graduate-level science reasoning), it scores 86.0, above the dense 27B model Qwen3.5-27B at 85.5. On AIME 2026 (competition mathematics), it scores 92.7, matching the best in its class. On MMLU-Pro (expert knowledge across 57 domains), it scores 85.2, in the same range as models with far more active parameters.

On the Artificial Analysis Intelligence Index, a composite benchmark across reasoning, knowledge, math, and coding, it scores 43, placing it well above average among open-weight models of similar size, where the median sits at 15.

These are not coding-specialist scores. The model generalizes.

Vision and Multimodal Capability

Qwen3.5-27B Claude-Sonnet-4.5 Gemma4-31B Gemma4-26BA4B Qwen3.5-35B-A3B Qwen3.6-35B-A3B
STEM and Puzzle
MMMU 82.3 79.6 80.4 78.4 81.4 81.7
MMMU-Pro 75.0 68.4 76.9* 73.8* 75.1 75.3
Mathvista (mini) 87.8 79.8 79.3 79.4 86.2 86.4
ZEROBench_sub 36.2 26.3 26.0 26.3 34.1 34.4
General VQA
RealWorldQA 83.7 70.3 72.3 72.2 84.1 85.3
MMBenchEN-DEV-v1.1 92.6 88.3 90.9 89.0 91.5 92.8
SimpleVQA 56.0 57.6 52.9 52.2 58.3 58.9
HallusionBench 70.0 59.9 67.4 66.1 67.9 69.8
Text Recognition and Document Understanding
OmniDocBench1.5 88.9 85.8 80.1 74.4 89.3 89.9
CharXiv (RQ) 79.5 67.2 67.9 69.0 77.5 78.0
CC-OCR 81.0 68.1 75.7 74.5 80.7 81.9
AI2D_TEST 92.9 87.0 89.0 88.3 92.6 92.7
Spatial Intelligence
RefCOCO (avg) 90.9 -- -- -- 89.2 92.0
ODInW13 41.1 -- -- -- 42.6 50.8
EmbSpatialBench 84.5 71.8 -- -- 83.1 84.3
RefSpatialBench 67.7 -- -- -- 63.5 64.3
Video Understanding
VideoMME (w sub.) 87.0 81.1 -- -- 86.6 86.6
VideoMME (w/o sub.) 82.8 75.3 -- -- 82.5 82.5
VideoMMMU 82.3 77.6 81.6 76.0 80.4 83.7
MLVU 85.9 72.8 -- -- 85.6 86.2
MVBench 74.6 -- -- -- 74.8 74.6
LVBench 73.6 -- -- -- 71.4 71.4

Qwen3.6-35B-A3B is natively multimodal. It processes images, documents, and video alongside text not as an add-on, but as a core capability baked into the architecture.

Alibaba claims that on most vision-language tasks, performance matches Claude Sonnet 4.5, and even surpasses it on spatial intelligence, achieving 92.0 on RefCOCO and 50.8 on ODInW13.

On document understanding, it scores 89.9 on OmniDocBench 1.5. On video reasoning, it scores 83.7 on VideoMMMU above Gemma4-31B at 81.6. These numbers matter for teams building agents that work with real-world inputs: PDFs, screenshots, engineering diagrams, and visual data.

The model supports both a thinking mode deliberate, step-by-step chain-of-thought reasoning and a non-thinking mode for fast, direct responses. Developers control which mode runs per request.

How to Access and Deploy It

Qwen3.6-35B-A3B is available through three paths:

Qwen Studio - browser-based chat, no setup required. Good for quick testing.

  Qwen studio

  Output

Alibaba Cloud Model Studio API - available as qwen3.6-flash, compatible with the OpenAI specification and the Anthropic API specification. Priced at $0.38 per million input tokens and $2.25 per million output tokens via Alibaba's API.

Open weights on Hugging Face and ModelScope - full weights in BF16. Compatible with Transformers, vLLM, SGLang, and KTransformers. A quantized Q4 GGUF version from Unsloth weighs around 20.9GB - enough to run locally on a high-RAM consumer machine.

The model integrates with OpenClaw, Claude Code, and Qwen Code for development workflows. The preserve_thinking feature is available via the API and is recommended for agentic tasks where reasoning continuity across turns matters.

Conclusion

Under the tenure of Qwen's previous lead, Qwen models accumulated over 600 million downloads and more than 170,000 derivative models on Hugging Face - surpassing Meta's Llama in that metric. The team has not slowed down.

The gap between open-source and closed proprietary models continues to narrow. Qwen3.6-35B-A3B is evidence of that. A model that scores 73.4% on SWE-bench Verified, runs on 3B active parameters, costs nothing to license, and can be self-hosted on your own infrastructure would have seemed implausible 18 months ago.

For developers, the calculus is simple. If you are building coding agents, document workflows, or multimodal tools - and you want frontier-level performance without frontier-level costs or licensing constraints - this model is worth testing today.

FAQs

Q1. What makes Qwen3.6-35B-A3B different from traditional dense models?

Qwen3.6-35B-A3B uses a sparse Mixture-of-Experts architecture, activating only 3B parameters per inference while leveraging a 35B parameter model, making it highly efficient.

Q2. Is Qwen3.6-35B-A3B suitable for real-world coding agents?

Yes, it performs strongly on agentic coding benchmarks like SWE-bench and supports multi-step reasoning, tool usage, and iterative problem solving.

Q3. Can Qwen3.6-35B-A3B handle multimodal tasks?

Yes, it natively supports text, images, documents, and video, making it suitable for complex real-world AI applications.

Blue Decoration Semi-Circle
Free
Data Annotation Workflow Plan

Simplify Your Data Annotation Workflow With Proven Strategies

Free data annotation guide book cover
Download the Free Guide
Blue Decoration Semi-Circle