5 Open-Source Coding LLMs You Can Run Locally in 2025
In 2025, open-source coding LLMs like Qwen3-Coder, Devastral, StarCode2, Codestral, and Qwen-2.5Coder offer sophisticated multi-language support, agentic task handling, long context windows, and state-of-the-art code generation for local use.

Open-source coding LLMs are democratizing AI-powered development, and local deployment is at the forefront of this revolution. By running models on your own machine, you gain privacy, eliminate API costs, and unlock deep customization. In 2025, running powerful coding AI locally is no longer a dream—it's a practical reality.
This article will introduce five of the best coding models available today. We'll compare their strengths, show you how to run them with Ollama, and provide practical use cases to get you started. This guide is for developers, hobbyists, and teams looking to harness the power of local AI without sacrificing performance.
1. Qwen3-Coder-480B-A35B-Instruct: The Agentic Powerhouse
Model Overview
- Developer: Alibaba's Qwen Team
- Parameters: 480B total, 35B activated (Mixture-of-Experts)
- Context Window: 256K native, up to 1M with Yarn
- License: Custom (research-friendly)
- Specialty: Agentic coding and browser automation
Key Features
- Agentic Coding Support: Designed to work with frameworks like Qwen Code and CLINE.
- Long-Context Mastery: Understands repository-scale codebases with its massive context window.
- Multi-Platform Compatibility: Supports most coding agent platforms.
- Non-Thinking Mode: Optimized for direct code generation without reasoning blocks.
Performance Highlights
- Comparable to Claude Sonnet on agentic coding tasks.
- Excels in complex, multi-step software engineering workflows.
- Strong performance on browser-based coding tasks.
- Advanced function calling capabilities for tool use.
How to Run the Model
Running a model of this size locally is a challenge. While a direct ollama pull
is not feasible for most users due to its ~200GB size, you can run quantized versions using llama.cpp
with MoE offloading. This technique keeps the main model layers on the GPU and offloads the "expert" layers to system RAM.
Command to run Qwen3-Coder with MoE offloading:
./llama.cpp/llama-cli \ --model unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF/UD-Q2_K_XL/Qwen3-Coder-480B-A35B-Instruct-UD-Q2_K_XL-00001-of-00004.gguf \ --ctx-size 16384 \ --n-gpu-layers 99 \ -ot ".ffn_.*_exps.=CPU" \ --threads 8 \ --temp 0.7 \ -p "<|im_start|>user\nWrite a Python function to find prime numbers up to N.<|im_end|>\n<|im_start|>assistant\n"
Best Use Cases
- Complex software engineering projects.
- Multi-file repository analysis and refactoring.
- Automated testing and debugging workflows.
- Research in agentic AI.
2. Qwen2.5-Coder: The Versatile Champion
Model Overview
- Developer: Alibaba's Qwen Team
- Parameters: 6 sizes available (0.5B, 1.5B, 3B, 7B, 14B, 32B)
- Context Window: 32K-128K tokens
- License: Apache 2.0 (fully permissive)
- Specialty: Balanced performance across all coding tasks
Key Features
- Six Size Options: Scales from tiny models for edge devices to a 32B powerhouse.
- Multi-Language Mastery: Excels across 40+ programming languages.
- Code Repair Excellence: Outstanding debugging and error-fixing capabilities.
- Competitive with GPT-4o: The 32B model rivals closed-source leaders.
Performance Highlights
- HumanEval: 91.0% (32B model, matching GPT-4o).
- Aider (Code Repair): 73.7% (comparable to GPT-4o).
- McEval: 65.9 across multiple languages.
- LiveCodeBench: 43.4% (competitive with top models).
How to Run with Ollama
Qwen2.5-Coder-7B-Instruct example usage:
from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "Qwen/Qwen2.5-Coder-7B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "write a quick sort algorithm." messages = [ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=512 ) response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) print(response)
Best Use Cases
- General-purpose coding assistance.
- Multi-language development projects.
- Code debugging and repair.
- Educational programming assistance.
3. StarCoder2: The Transparent Workhorse
Model Overview
- Developer: BigCode Project
- Parameters: 3 sizes (3B, 7B, 15B)
- Context Window: 16K tokens
- License: Apache 2.0
- Specialty: Transparent training and code completion
Key Features
- Transparent Training: Fully open training process and dataset.
- Multi-Language Coverage: 600+ programming languages.
- Efficient Architecture: Excellent performance-to-size ratio.
- Strong Code Completion: Best-in-class fill-in-the-middle capabilities.
Performance Highlights
- HumanEval FIM: 86.4% (excellent code completion).
- RepoBench: Strong performance on repository-level tasks.
- Size Efficiency: The 15B model matches 33B+ models on many tasks.
- Training Scale: Trained on 4+ trillion tokens for robust understanding.
How to Run with Ollama
Starcoder2 text generation example:
from transformers import pipeline pipe = pipeline("text-generation", model="bigcode/starcoder2-7b", torch_dtype="auto", device_map="auto") prompt = "def fibonacci(n):" result = pipe(prompt, max_new_tokens=100) print(result[0]['generated_text'])
Best Use Cases
- IDE integration and real-time code completion.
- Educational environments requiring transparency.
- Medium-sized development projects.
- Scenarios where auditability and reproducibility matter.
4. Codestral: The Speed Demon
Model Overview
- Developer: Mistral AI
- Parameters: 22B
- Context Window: 32K tokens
- License: Apache 2.0
- Specialty: Fast, efficient code generation
Key Features
- Lightning-Fast Inference: Optimized for speed and efficiency.
- 80+ Language Fluency: Extensive programming language support.
- Fill-in-the-Middle: Excellent at completing partial code.
- Production-Ready: Designed for real-world development workflows.
Performance Highlights
- HumanEval Python: 86.6% (excellent Python performance).
- Fill-in-the-Middle: 95.3% accuracy.
- Inference Speed: Among the fastest in its class.
- Memory Efficiency: 13GB model size for practical deployment.
How to Run with Ollama
DeepSeek-Coder-V2 example usage:
from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained('deepseek-ai/deepseek-coder-v2') tokenizer = AutoTokenizer.from_pretrained('deepseek-ai/deepseek-coder-v2') # Prompt for a data cleaning function prompt = "Write a Python function to clean and preprocess a CSV dataset with missing values and outliers" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=500) generated_code = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_code)
Best Use Cases
- Real-time code completion in IDEs.
- Multi-language development environments.
- Production code generation.
- Scenarios requiring fast response times.
5. Devstral: The Agentic Specialist
Model Overview
- Developer: Mistral AI & All Hands AI
- Parameters: 24B
- Context Window: 128K tokens
- License: Apache 2.0
- Specialty: Software engineering agents and tool use
Key Features
- Agentic Excellence: Designed specifically for coding agents.
- Tool Mastery: Excels at using tools to explore and edit codebases.
- Multi-File Operations: Strong at managing complex project structures.
- Local Deployment: Runs on a single RTX 4090 or a 32GB Mac.
Performance Highlights
- SWE-Bench Verified: 46.8% (best open-source performance).
- Outperforms Larger Models: Beats Deepseek-V3 and Qwen3 232B.
- Tool Usage: Exceptional at file system operations and code navigation.
- Context Management: Efficient handling of large codebases.
How to Run with Ollama
Install and run Mistral Devstral model:
# Install the library pip install mistral_inference --upgrade # Download the model from huggingface_hub import snapshot_download from pathlib import Path mistral_models_path = Path.home().joinpath('mistral_models', 'Devstral') mistral_models_path.mkdir(parents=True, exist_ok=True) snapshot_download( repo_id="mistralai/Devstral-Small-2505", allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"], local_dir=mistral_models_path ) # Run the model mistral-chat $HOME/mistral_models/Devstral --instruct --max_tokens 300
Best Use Cases
- Software engineering agents.
- Automated code refactoring.
- Multi-file project management.
- Development workflow automation.
Let's compare these models
Performance Comparison
Model | Parameters | Context | HumanEval | SWE-Bench | Best For |
---|---|---|---|---|---|
Qwen3-Coder-480B | 480B (35B active) | 256K-1M | ~90% | N/A | Agentic coding |
Qwen2.5-Coder-32B | 32B | 128K | 91.0% | N/A | All-around performance |
StarCoder2-15B | 15B | 16K | 86.4% | N/A | Code completion |
Codestral-22B | 22B | 32K | 86.6% | N/A | Speed & efficiency |
Devstral-24B | 24B | 128K | N/A | 46.8% | Software agents |
Resource Requirements
Model | RAM/VRAM Needed | Disk Space | Inference Speed | Best Hardware |
---|---|---|---|---|
Qwen3-Coder-480B | 64GB+ | ~200GB | Slow | High-end server |
Qwen2.5-Coder-32B | 32GB+ | ~20GB | Medium | RTX 4090 / M2 Max |
StarCoder2-15B | 16GB+ | ~9GB | Fast | RTX 3090 / M1 Pro |
Codestral-22B | 16GB+ | ~13GB | Very Fast | RTX 4080 / M2 |
Devstral-24B | 24GB+ | ~14GB | Medium | RTX 4090 / 32GB Mac |
Conclusion
The era of local coding AI has arrived. Whether you need a powerful agent for complex projects or a speedy assistant for real-time completion, there is an open-source model you can run on your own hardware. By starting with a model that fits your needs and resources, you can unlock a new level of productivity, privacy, and customization in your development workflow. The best way to begin is to download Ollama, pull a model, and start experimenting today.
FAQs
What is Qwen3-Coder?
Qwen3-Coder is a high-performance, open-source AI coding model supporting agentic workflows, 256k+ context, and over 100 programming languages, ideal for local and enterprise deployment.
How does Codestral benefit local coding projects?
Codestral delivers advanced code generation and reasoning features for multiple languages, with permissive licensing and hardware-friendly inference—perfect for privacy-focused local development.
What makes StarCode2 stand out?
StarCode2 excels at code completion, bug fixing, and documentation, designed for low-latency operation and compatibility with popular IDEs and toolchains.
Is Devastral suited for agentic coding tasks?
Yes, Devastral’s architecture is specialized for multi-step code planning, automated review, and intelligent code translation for Python, C++, JavaScript, and more.
Can these coding LLMs run on consumer hardware in 2025?
All five models—Qwen3-Coder, Devastral, StarCode2, Codestral, and Qwen-2.5Coder—are optimized for local inference, with efficient quantized variants and active open-source support for running on desktops, laptops, and servers.

Simplify Your Data Annotation Workflow With Proven Strategies
.png)
