Qwen3.6-35B-A3B: The Small Model That Codes Like a Giant
Qwen3.6-35B-A3B is a breakthrough open-source AI model combining 35B capacity with 3B active parameters. It delivers strong coding, reasoning, and multimodal performance at a fraction of the cost.
Most AI models make you choose: pay for power or settle for efficiency. Alibaba's Qwen team just refused that trade-off. On April 16, 2026, they released Qwen3.6-35B-A3B , a sparse Mixture-of-Experts model that runs on the compute budget of a 3B model while drawing on the learned capacity of a 35B one.
It is open-source, Apache 2.0 licensed, and it is already challenging models twice its effective size on agentic coding benchmarks. This is not a incremental update. It is a signal that the open-source AI race is entering a different gear.
Comparison to Other models
The Architecture Behind the Numbers
Architecture
The "A3B" in the model name tells you everything. It means only 3 billion parameters activate per forward pass, even though the full weight file holds 35 billion. This is a sparse Mixture-of-Experts design. A router inside the model selects which expert sub-networks handle each token. The rest stay dormant.
The sparsity ratio - 3B active out of 35B total is roughly 12:1, among the most aggressive in any publicly released model. You get representational depth at a fraction of the inference cost. That math is what drives every benchmark number below.
The model also introduces thinking preservation, the ability to retain reasoning traces from prior messages, which keeps multi-turn logic stable and helps with iterative agentic tasks. For developers building long-running coding agents, this is not a minor feature. It is a structural improvement.
The native context window sits at 262,144 tokens, extensible up to 1,010,000 with RoPE scaling.
Agentic Coding Benchmark
| Qwen3.5-27B | Gemma4-31B | Qwen3.5-35BA3B | Gemma4-26BA4B | Qwen3.6-35BA3B | |
|---|---|---|---|---|---|
| Coding Agent | |||||
| SWE bench Verified | 75.0 | 52.0 | 70.0 | 17.4 | 73.4 |
| SWE bench Multilingual | 69.3 | 51.7 | 60.3 | 17.3 | 67.2 |
| SWE bench Pro | 51.2 | 35.7 | 44.6 | 13.8 | 49.5 |
| Terminal Bench 2.0 | 41.6 | 42.9 | 40.5 | 34.2 | 51.5 |
| Claw Eval Avg | 64.3 | 48.5 | 65.4 | 58.8 | 68.7 |
| Claw Eval Pass³ | 46.2 | 25.0 | 51.0 | 28.0 | 50.0 |
| SkillsBench Avg5 | 27.2 | 23.6 | 4.4 | 12.3 | 28.7 |
| QwenClawBench | 52.2 | 41.7 | 47.7 | 38.7 | 52.6 |
| NL2Repo | 27.3 | 15.5 | 20.5 | 11.6 | 29.4 |
| QwenWebBench | 1068 | 1197 | 978 | 1178 | 1397 |
| General Agent | |||||
| TAU3 Bench | 68.4 | 67.5 | 68.9 | 59.0 | 67.2 |
| VITA Bench | 41.8 | 43.0 | 29.1 | 36.9 | 35.6 |
| DeepPlanning | 22.6 | 24.0 | 22.8 | 16.2 | 25.9 |
| Tool Decathlon | 31.5 | 21.2 | 28.7 | 12.0 | 26.9 |
| MCPMark | 36.3 | 18.1 | 27.0 | 14.2 | 37.0 |
| MCP Atlas | 68.4 | 57.2 | 62.4 | 50.0 | 62.8 |
| WideSearch | 66.4 | 35.2 | 59.1 | 38.3 | 60.1 |
| Knowledge | |||||
| MMLU Pro | 86.1 | 85.2 | 85.3 | 82.6 | 85.2 |
| MMLU Redux | 93.2 | 93.7 | 93.3 | 92.7 | 93.3 |
| SuperGPQA | 65.6 | 65.7 | 63.4 | 61.4 | 64.7 |
| C-Eval | 90.5 | 82.6 | 90.2 | 82.5 | 90.0 |
| STEM & Reasoning | |||||
| GPQA | 85.5 | 84.3 | 84.2 | 82.3 | 86.0 |
| HLE | 24.3 | 19.5 | 22.4 | 8.7 | 21.4 |
| LiveCodeBench v6 | 80.7 | 80.0 | 74.6 | 77.1 | 80.4 |
| HMMT Feb 25 | 92.0 | 88.7 | 89.0 | 91.7 | 90.7 |
| HMMT Nov 25 | 89.8 | 87.5 | 89.2 | 87.5 | 89.1 |
| HMMT Feb 26 | 84.3 | 77.2 | 78.7 | 79.0 | 83.6 |
| IMOAnswerBench | 79.9 | 74.5 | 76.8 | 74.3 | 78.9 |
| AIME26 | 92.6 | 89.2 | 91.0 | 88.3 | 92.7 |
Agentic coding means more than autocomplete. It means the model can read a real GitHub repository, reason about a bug, write a fix, run it, and verify the output, in a multi-step loop.
SWE-bench Verified is the standard test for this. Qwen3.6-35B-A3B scores 73.4% on SWE-bench Verified, while Gemma 4-31B, a dense model that activates all 31 billion of its parameters every inference step - scores 52.0%. The efficient model won by a wide margin.
The gap holds across every related benchmark:
On SWE-bench Multilingual, it scores 67.2. Its predecessor Qwen3.5-35B-A3B scores 60.3. On Terminal-Bench 2.0, real terminal task execution in a sandboxed environment it scores 51.5 against Gemma4-31B's 42.9.
Tool use is where the gap becomes stark. On MCPMark (MCP tool integration), it scores 37.0% against Gemma4-31B's 18.1%, more than twice as capable at integrating with function calls in agentic loops.
It also outperforms its predecessor Qwen3.5-35B-A3B by a wide margin across nearly every benchmark, a substantial generational leap, not an incremental one.
General Reasoning and Knowledge
Strong coding benchmarks sometimes mask weak foundations. That is not the case here.
On GPQA Diamond (graduate-level science reasoning), it scores 86.0, above the dense 27B model Qwen3.5-27B at 85.5. On AIME 2026 (competition mathematics), it scores 92.7, matching the best in its class. On MMLU-Pro (expert knowledge across 57 domains), it scores 85.2, in the same range as models with far more active parameters.
On the Artificial Analysis Intelligence Index, a composite benchmark across reasoning, knowledge, math, and coding, it scores 43, placing it well above average among open-weight models of similar size, where the median sits at 15.
These are not coding-specialist scores. The model generalizes.
Vision and Multimodal Capability
| Qwen3.5-27B | Claude-Sonnet-4.5 | Gemma4-31B | Gemma4-26BA4B | Qwen3.5-35B-A3B | Qwen3.6-35B-A3B | |
|---|---|---|---|---|---|---|
| STEM and Puzzle | ||||||
| MMMU | 82.3 | 79.6 | 80.4 | 78.4 | 81.4 | 81.7 |
| MMMU-Pro | 75.0 | 68.4 | 76.9* | 73.8* | 75.1 | 75.3 |
| Mathvista (mini) | 87.8 | 79.8 | 79.3 | 79.4 | 86.2 | 86.4 |
| ZEROBench_sub | 36.2 | 26.3 | 26.0 | 26.3 | 34.1 | 34.4 |
| General VQA | ||||||
| RealWorldQA | 83.7 | 70.3 | 72.3 | 72.2 | 84.1 | 85.3 |
| MMBenchEN-DEV-v1.1 | 92.6 | 88.3 | 90.9 | 89.0 | 91.5 | 92.8 |
| SimpleVQA | 56.0 | 57.6 | 52.9 | 52.2 | 58.3 | 58.9 |
| HallusionBench | 70.0 | 59.9 | 67.4 | 66.1 | 67.9 | 69.8 |
| Text Recognition and Document Understanding | ||||||
| OmniDocBench1.5 | 88.9 | 85.8 | 80.1 | 74.4 | 89.3 | 89.9 |
| CharXiv (RQ) | 79.5 | 67.2 | 67.9 | 69.0 | 77.5 | 78.0 |
| CC-OCR | 81.0 | 68.1 | 75.7 | 74.5 | 80.7 | 81.9 |
| AI2D_TEST | 92.9 | 87.0 | 89.0 | 88.3 | 92.6 | 92.7 |
| Spatial Intelligence | ||||||
| RefCOCO (avg) | 90.9 | -- | -- | -- | 89.2 | 92.0 |
| ODInW13 | 41.1 | -- | -- | -- | 42.6 | 50.8 |
| EmbSpatialBench | 84.5 | 71.8 | -- | -- | 83.1 | 84.3 |
| RefSpatialBench | 67.7 | -- | -- | -- | 63.5 | 64.3 |
| Video Understanding | ||||||
| VideoMME (w sub.) | 87.0 | 81.1 | -- | -- | 86.6 | 86.6 |
| VideoMME (w/o sub.) | 82.8 | 75.3 | -- | -- | 82.5 | 82.5 |
| VideoMMMU | 82.3 | 77.6 | 81.6 | 76.0 | 80.4 | 83.7 |
| MLVU | 85.9 | 72.8 | -- | -- | 85.6 | 86.2 |
| MVBench | 74.6 | -- | -- | -- | 74.8 | 74.6 |
| LVBench | 73.6 | -- | -- | -- | 71.4 | 71.4 |
Qwen3.6-35B-A3B is natively multimodal. It processes images, documents, and video alongside text not as an add-on, but as a core capability baked into the architecture.
Alibaba claims that on most vision-language tasks, performance matches Claude Sonnet 4.5, and even surpasses it on spatial intelligence, achieving 92.0 on RefCOCO and 50.8 on ODInW13.
On document understanding, it scores 89.9 on OmniDocBench 1.5. On video reasoning, it scores 83.7 on VideoMMMU above Gemma4-31B at 81.6. These numbers matter for teams building agents that work with real-world inputs: PDFs, screenshots, engineering diagrams, and visual data.
The model supports both a thinking mode deliberate, step-by-step chain-of-thought reasoning and a non-thinking mode for fast, direct responses. Developers control which mode runs per request.
How to Access and Deploy It
Qwen3.6-35B-A3B is available through three paths:
Qwen Studio - browser-based chat, no setup required. Good for quick testing.
Qwen studio
Output
Alibaba Cloud Model Studio API - available as qwen3.6-flash, compatible with the OpenAI specification and the Anthropic API specification. Priced at $0.38 per million input tokens and $2.25 per million output tokens via Alibaba's API.
Open weights on Hugging Face and ModelScope - full weights in BF16. Compatible with Transformers, vLLM, SGLang, and KTransformers. A quantized Q4 GGUF version from Unsloth weighs around 20.9GB - enough to run locally on a high-RAM consumer machine.
The model integrates with OpenClaw, Claude Code, and Qwen Code for development workflows. The preserve_thinking feature is available via the API and is recommended for agentic tasks where reasoning continuity across turns matters.
Conclusion
Under the tenure of Qwen's previous lead, Qwen models accumulated over 600 million downloads and more than 170,000 derivative models on Hugging Face - surpassing Meta's Llama in that metric. The team has not slowed down.
The gap between open-source and closed proprietary models continues to narrow. Qwen3.6-35B-A3B is evidence of that. A model that scores 73.4% on SWE-bench Verified, runs on 3B active parameters, costs nothing to license, and can be self-hosted on your own infrastructure would have seemed implausible 18 months ago.
For developers, the calculus is simple. If you are building coding agents, document workflows, or multimodal tools - and you want frontier-level performance without frontier-level costs or licensing constraints - this model is worth testing today.
FAQs
Q1. What makes Qwen3.6-35B-A3B different from traditional dense models?
Qwen3.6-35B-A3B uses a sparse Mixture-of-Experts architecture, activating only 3B parameters per inference while leveraging a 35B parameter model, making it highly efficient.
Q2. Is Qwen3.6-35B-A3B suitable for real-world coding agents?
Yes, it performs strongly on agentic coding benchmarks like SWE-bench and supports multi-step reasoning, tool usage, and iterative problem solving.
Q3. Can Qwen3.6-35B-A3B handle multimodal tasks?
Yes, it natively supports text, images, documents, and video, making it suitable for complex real-world AI applications.
Simplify Your Data Annotation Workflow With Proven Strategies