Qwen3.6

Qwen3.6-35B-A3B: The Small Model That Codes Like a Giant

Qwen3.6-35B-A3B is a breakthrough open-source AI model combining 35B capacity with 3B active parameters. It delivers strong coding, reasoning, and multimodal performance at a fraction of the cost.

akash rawal

Apr 21, 2026 • 9 min read

Share this blog

Qwen3.6

Most AI models make you choose: pay for power or settle for efficiency. Alibaba's Qwen team just refused that trade-off. On April 16, 2026, they released Qwen3.6-35B-A3B , a sparse Mixture-of-Experts model that runs on the compute budget of a 3B model while drawing on the learned capacity of a 35B one.

It is open-source, Apache 2.0 licensed, and it is already challenging models twice its effective size on agentic coding benchmarks. This is not a incremental update. It is a signal that the open-source AI race is entering a different gear.

Comparison to Other models

The Architecture Behind the Numbers

Architecture

The "A3B" in the model name tells you everything. It means only 3 billion parameters activate per forward pass, even though the full weight file holds 35 billion. This is a sparse Mixture-of-Experts design. A router inside the model selects which expert sub-networks handle each token. The rest stay dormant.

The sparsity ratio - 3B active out of 35B total is roughly 12:1, among the most aggressive in any publicly released model. You get representational depth at a fraction of the inference cost. That math is what drives every benchmark number below.

The model also introduces thinking preservation, the ability to retain reasoning traces from prior messages, which keeps multi-turn logic stable and helps with iterative agentic tasks. For developers building long-running coding agents, this is not a minor feature. It is a structural improvement.

The native context window sits at 262,144 tokens, extensible up to 1,010,000 with RoPE scaling.

Agentic Coding Benchmark

	Qwen3.5-27B	Gemma4-31B	Qwen3.5-35BA3B	Gemma4-26BA4B	Qwen3.6-35BA3B
Coding Agent
SWE bench Verified	75.0	52.0	70.0	17.4	73.4
SWE bench Multilingual	69.3	51.7	60.3	17.3	67.2
SWE bench Pro	51.2	35.7	44.6	13.8	49.5
Terminal Bench 2.0	41.6	42.9	40.5	34.2	51.5
Claw Eval Avg	64.3	48.5	65.4	58.8	68.7
Claw Eval Pass³	46.2	25.0	51.0	28.0	50.0
SkillsBench Avg5	27.2	23.6	4.4	12.3	28.7
QwenClawBench	52.2	41.7	47.7	38.7	52.6
NL2Repo	27.3	15.5	20.5	11.6	29.4
QwenWebBench	1068	1197	978	1178	1397
General Agent
TAU3 Bench	68.4	67.5	68.9	59.0	67.2
VITA Bench	41.8	43.0	29.1	36.9	35.6
DeepPlanning	22.6	24.0	22.8	16.2	25.9
Tool Decathlon	31.5	21.2	28.7	12.0	26.9
MCPMark	36.3	18.1	27.0	14.2	37.0
MCP Atlas	68.4	57.2	62.4	50.0	62.8
WideSearch	66.4	35.2	59.1	38.3	60.1
Knowledge
MMLU Pro	86.1	85.2	85.3	82.6	85.2
MMLU Redux	93.2	93.7	93.3	92.7	93.3
SuperGPQA	65.6	65.7	63.4	61.4	64.7
C-Eval	90.5	82.6	90.2	82.5	90.0
STEM & Reasoning
GPQA	85.5	84.3	84.2	82.3	86.0
HLE	24.3	19.5	22.4	8.7	21.4
LiveCodeBench v6	80.7	80.0	74.6	77.1	80.4
HMMT Feb 25	92.0	88.7	89.0	91.7	90.7
HMMT Nov 25	89.8	87.5	89.2	87.5	89.1
HMMT Feb 26	84.3	77.2	78.7	79.0	83.6
IMOAnswerBench	79.9	74.5	76.8	74.3	78.9
AIME26	92.6	89.2	91.0	88.3	92.7

Agentic coding means more than autocomplete. It means the model can read a real GitHub repository, reason about a bug, write a fix, run it, and verify the output, in a multi-step loop.

SWE-bench Verified is the standard test for this. Qwen3.6-35B-A3B scores 73.4% on SWE-bench Verified, while Gemma 4-31B, a dense model that activates all 31 billion of its parameters every inference step - scores 52.0%. The efficient model won by a wide margin.

The gap holds across every related benchmark:

On SWE-bench Multilingual, it scores 67.2. Its predecessor Qwen3.5-35B-A3B scores 60.3. On Terminal-Bench 2.0, real terminal task execution in a sandboxed environment it scores 51.5 against Gemma4-31B's 42.9.

Tool use is where the gap becomes stark. On MCPMark (MCP tool integration), it scores 37.0% against Gemma4-31B's 18.1%, more than twice as capable at integrating with function calls in agentic loops.

It also outperforms its predecessor Qwen3.5-35B-A3B by a wide margin across nearly every benchmark, a substantial generational leap, not an incremental one.

General Reasoning and Knowledge

Strong coding benchmarks sometimes mask weak foundations. That is not the case here.

On GPQA Diamond (graduate-level science reasoning), it scores 86.0, above the dense 27B model Qwen3.5-27B at 85.5. On AIME 2026 (competition mathematics), it scores 92.7, matching the best in its class. On MMLU-Pro (expert knowledge across 57 domains), it scores 85.2, in the same range as models with far more active parameters.

On the Artificial Analysis Intelligence Index, a composite benchmark across reasoning, knowledge, math, and coding, it scores 43, placing it well above average among open-weight models of similar size, where the median sits at 15.

These are not coding-specialist scores. The model generalizes.

Vision and Multimodal Capability

	Qwen3.5-27B	Claude-Sonnet-4.5	Gemma4-31B	Gemma4-26BA4B	Qwen3.5-35B-A3B	Qwen3.6-35B-A3B
STEM and Puzzle
MMMU	82.3	79.6	80.4	78.4	81.4	81.7
MMMU-Pro	75.0	68.4	76.9*	73.8*	75.1	75.3
Mathvista (mini)	87.8	79.8	79.3	79.4	86.2	86.4
ZEROBench_sub	36.2	26.3	26.0	26.3	34.1	34.4
General VQA
RealWorldQA	83.7	70.3	72.3	72.2	84.1	85.3
MMBench_EN-DEV-v1.1	92.6	88.3	90.9	89.0	91.5	92.8
SimpleVQA	56.0	57.6	52.9	52.2	58.3	58.9
HallusionBench	70.0	59.9	67.4	66.1	67.9	69.8
Text Recognition and Document Understanding
OmniDocBench1.5	88.9	85.8	80.1	74.4	89.3	89.9
CharXiv (RQ)	79.5	67.2	67.9	69.0	77.5	78.0
CC-OCR	81.0	68.1	75.7	74.5	80.7	81.9
AI2D_TEST	92.9	87.0	89.0	88.3	92.6	92.7
Spatial Intelligence
RefCOCO (avg)	90.9	--	--	--	89.2	92.0
ODInW13	41.1	--	--	--	42.6	50.8
EmbSpatialBench	84.5	71.8	--	--	83.1	84.3
RefSpatialBench	67.7	--	--	--	63.5	64.3
Video Understanding
VideoMME _{(w sub.)}	87.0	81.1	--	--	86.6	86.6
VideoMME _{(w/o sub.)}	82.8	75.3	--	--	82.5	82.5
VideoMMMU	82.3	77.6	81.6	76.0	80.4	83.7
MLVU	85.9	72.8	--	--	85.6	86.2
MVBench	74.6	--	--	--	74.8	74.6
LVBench	73.6	--	--	--	71.4	71.4

Qwen3.6-35B-A3B is natively multimodal. It processes images, documents, and video alongside text not as an add-on, but as a core capability baked into the architecture.

Alibaba claims that on most vision-language tasks, performance matches Claude Sonnet 4.5, and even surpasses it on spatial intelligence, achieving 92.0 on RefCOCO and 50.8 on ODInW13.

On document understanding, it scores 89.9 on OmniDocBench 1.5. On video reasoning, it scores 83.7 on VideoMMMU above Gemma4-31B at 81.6. These numbers matter for teams building agents that work with real-world inputs: PDFs, screenshots, engineering diagrams, and visual data.

The model supports both a thinking mode deliberate, step-by-step chain-of-thought reasoning and a non-thinking mode for fast, direct responses. Developers control which mode runs per request.

How to Access and Deploy It

Qwen3.6-35B-A3B is available through three paths:

Qwen Studio - browser-based chat, no setup required. Good for quick testing.

Qwen studio

Output

Alibaba Cloud Model Studio API - available as qwen3.6-flash, compatible with the OpenAI specification and the Anthropic API specification. Priced at $0.38 per million input tokens and $2.25 per million output tokens via Alibaba's API.

Open weights on Hugging Face and ModelScope - full weights in BF16. Compatible with Transformers, vLLM, SGLang, and KTransformers. A quantized Q4 GGUF version from Unsloth weighs around 20.9GB - enough to run locally on a high-RAM consumer machine.

The model integrates with OpenClaw, Claude Code, and Qwen Code for development workflows. The preserve_thinking feature is available via the API and is recommended for agentic tasks where reasoning continuity across turns matters.

Conclusion

Under the tenure of Qwen's previous lead, Qwen models accumulated over 600 million downloads and more than 170,000 derivative models on Hugging Face - surpassing Meta's Llama in that metric. The team has not slowed down.

The gap between open-source and closed proprietary models continues to narrow. Qwen3.6-35B-A3B is evidence of that. A model that scores 73.4% on SWE-bench Verified, runs on 3B active parameters, costs nothing to license, and can be self-hosted on your own infrastructure would have seemed implausible 18 months ago.

For developers, the calculus is simple. If you are building coding agents, document workflows, or multimodal tools - and you want frontier-level performance without frontier-level costs or licensing constraints - this model is worth testing today.

FAQs

Q1. What makes Qwen3.6-35B-A3B different from traditional dense models?

Qwen3.6-35B-A3B uses a sparse Mixture-of-Experts architecture, activating only 3B parameters per inference while leveraging a 35B parameter model, making it highly efficient.

Q2. Is Qwen3.6-35B-A3B suitable for real-world coding agents?

Yes, it performs strongly on agentic coding benchmarks like SWE-bench and supports multi-step reasoning, tool usage, and iterative problem solving.

Q3. Can Qwen3.6-35B-A3B handle multimodal tasks?

Yes, it natively supports text, images, documents, and video, making it suitable for complex real-world AI applications.

Free

Data Annotation Workflow Plan

Simplify Your Data Annotation Workflow With Proven Strategies

Download the Free Guide