7 Best Small Language Models Under 10B Parameters in 2026
Small language models have quietly become the workhorses of production AI. You don't always need a 200B-parameter giant to summarize a document, write code, or power an agent. The sub-10B class has closed the gap dramatically in 2026, and several models now outperform last year's 30B+ flagships on key benchmarks.
This guide ranks the seven best open-weight models under 10B parameters as of 2026, compares them across standard benchmarks, and shows exactly how to call each one through the Hugging Face Inference API and LangChain.
Quick Comparison Table
| # | Model | Params | MMLU-Pro | HumanEval | GSM8K | GPQA | ARC-C | Context | License |
|---|---|---|---|---|---|---|---|---|---|
| 1 | IBM Granite 4.1 8B | 8B | 56.0% | 87.2% | 92.5% | 42.0% | — | 131K | Apache 2.0 |
| 2 | Qwen3.5-9B | 9B | 82.5% | 76.0% | 91.2% | 81.7% | 79.4% | 128K | Apache 2.0 |
| 3 | Gemma 4 E4B | ~4.5B | 78.1% | 71.3% | 89.2% | 68.4% | 83.7% | 128K | Apache 2.0 |
| 4 | Qwen3-8B | 8B | 79.6% | 76.0% | 89.8% | 65.4% | 77.2% | 128K | Apache 2.0 |
| 5 | DeepSeek-R1-Distill-Qwen-7B | 7B | 72.8% | 55.1% | 92.8% | 62.1% | 74.3% | 128K | MIT |
| 6 | Phi-4-mini | 3.8B | 52.8% | 74.4% | 88.6% | 52.8% | 83.7% | 128K | MIT |
| 7 | Llama 3.1 8B Instruct | 8B | 73.0% | 72.6% | 86.4% | 49.0% | 75.1% | 128K | Llama 3.1 Community |
A quick way to read this table: Granite 4.1 8B and Qwen3.5-9B trade the top spot depending on whether you care more about coding or general reasoning. Gemma 4 E4B and Phi-4-mini punch well above their parameter count on edge devices.
DeepSeek-R1-Distill is your pick for math and chain-of-thought. Llama 3.3 8B remains the safest "general purpose" default thanks to its enormous fine-tune ecosystem.
1. IBM Granite 4.1 8B
IBM Granite 4.1 8B
A dense 8B model released in late April 2026, Granite 4.1 8B beats IBM's previous 32B Mixture-of-Experts model on most production benchmarks. Its HumanEval score of 87.2% is the highest in this list, making it a strong pick for code generation, RAG, and enterprise tool-calling workflows. It supports 12 languages and a 131K context window (extendable to 512K).
Best for: Code generation, enterprise RAG, tool calling License: Apache 2.0
Using Granite 4.1 8B via Hugging Face Inference API
import requests
API_URL = "https://api-inference.huggingface.co/models/ibm-granite/granite-4.1-8b-instruct"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "Write a Python function to check if a string is a palindrome.",
"parameters": {"max_new_tokens": 200, "temperature": 0.3}
})
print(output)
Using Granite 4.1 8B with LangChain
from langchain_huggingface import HuggingFaceEndpoint
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
llm = HuggingFaceEndpoint(
repo_id="ibm-granite/granite-4.1-8b-instruct",
huggingfacehub_api_token="YOUR_HF_TOKEN",
max_new_tokens=256,
temperature=0.3,
)
prompt = PromptTemplate.from_template("Write a Python function to {task}")
chain = prompt | llm | StrOutputParser()
result = chain.invoke({"task": "check if a string is a palindrome"})
print(result)
2. Qwen3.5-9B
Qwen3.5-9B
Alibaba's Qwen3.5-9B is the strongest all-around reasoner in this size class, scoring 82.5% on MMLU-Pro and 81.7% on GPQA Diamond, numbers that would have placed it in the top tier of all open models a year ago.
It supports dual "thinking" and "non-thinking" modes, so you can switch between fast responses and deep chain-of-thought reasoning depending on the task.
Best for: General reasoning, multilingual tasks, science QA License: Apache 2.0
Using Qwen3.5-9B via Hugging Face Inference API
import requests
API_URL = "https://api-inference.huggingface.co/models/Qwen/Qwen3.5-9B-Instruct"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "Explain the difference between supervised and unsupervised learning in two sentences.",
"parameters": {"max_new_tokens": 150}
})
print(output)
Using Qwen3.5-9B with LangChain
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
from langchain_core.messages import HumanMessage
llm = HuggingFaceEndpoint(
repo_id="Qwen/Qwen3.5-9B-Instruct",
huggingfacehub_api_token="YOUR_HF_TOKEN",
max_new_tokens=300,
)
chat = ChatHuggingFace(llm=llm)
response = chat.invoke([HumanMessage(content="Explain quantum entanglement simply.")])
print(response.content)
Tip: To enable "thinking mode" with Qwen models, prepend /think to your prompt, or /no_think for fast responses without chain-of-thought.
3. Gemma 4 E4B
Gemma 4 E4B
Google's Gemma 4 E4B is built for agents and tool calling on constrained hardware. It runs at just 5GB RAM at 4-bit quantization, supports a 128K context window, and accepts native audio input alongside text. Its 83.7% ARC-Challenge score ties Phi-4-mini for the best scientific reasoning at this size.
Best for: Agents, tool calling, on-device/edge deployment, audio input License: Apache 2.0
Using Gemma 4 E4B via Hugging Face Inference API
import requests
API_URL = "https://api-inference.huggingface.co/models/google/gemma-4-e4b-it"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "What's the weather like in a city near the equator in December?",
"parameters": {"max_new_tokens": 150}
})
print(output)
Using Gemma 4 E4B with LangChain
from langchain_huggingface import HuggingFaceEndpoint
from langchain.agents import initialize_agent, AgentType, Tool
llm = HuggingFaceEndpoint(
repo_id="google/gemma-4-e4b-it",
huggingfacehub_api_token="YOUR_HF_TOKEN",
max_new_tokens=256,
)
def get_word_length(word: str) -> int:
return len(word)
tools = [
Tool(
name="WordLength",
func=get_word_length,
description="Returns the length of a word"
)
]
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
agent.run("How many letters are in the word 'multimodal'?")
4. Qwen3-8B
Qwen3-8B
The previous-generation Qwen3-8B remains relevant in mid-2026 thanks to its strength in code generation, where it leads the 7-8B class on HumanEval. It supports 29+ languages and the same thinking/non-thinking dual mode as its successor.
Best for: Code generation, multilingual applications License: Apache 2.0
Using Qwen3-8B via Hugging Face Inference API
import requests
API_URL = "https://api-inference.huggingface.co/models/Qwen/Qwen3-8B"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "Write a SQL query to find the top 5 customers by total order value.",
"parameters": {"max_new_tokens": 200}
})
print(output)
Using Qwen3-8B with LangChain
from langchain_huggingface import HuggingFaceEndpoint
from langchain_core.prompts import ChatPromptTemplate
llm = HuggingFaceEndpoint(
repo_id="Qwen/Qwen3-8B",
huggingfacehub_api_token="YOUR_HF_TOKEN",
max_new_tokens=300,
)
prompt = ChatPromptTemplate.from_messages([
("system", "You are an expert SQL developer."),
("user", "{question}")
])
chain = prompt | llm
result = chain.invoke({"question": "Write a query to find top 5 customers by order value"})
print(result)
5. DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Qwen-7B
This model distills the chain-of-thought reasoning from the full DeepSeek-R1 into a 7B Qwen3 base. The result is the best math performance at this size, with a 92.8% GSM8K score. If your application involves step-by-step logical reasoning, this is the model to pick.
Best for: Math, chain-of-thought reasoning, logic puzzles License: MIT
Using DeepSeek-R1-Distill-Qwen-7B via Hugging Face Inference API
import requests
API_URL = "https://api-inference.huggingface.co/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "If a train travels 60 km in 45 minutes, what is its speed in km/h? Show your reasoning.",
"parameters": {"max_new_tokens": 300}
})
print(output)
Using DeepSeek-R1-Distill-Qwen-7B with LangChain
from langchain_huggingface import HuggingFaceEndpoint
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
llm = HuggingFaceEndpoint(
repo_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
huggingfacehub_api_token="YOUR_HF_TOKEN",
max_new_tokens=400,
temperature=0.1,
)
prompt = PromptTemplate.from_template(
"Solve the following problem step by step:\n{problem}"
)
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.invoke({"problem": "A car depreciates 15% per year. If it's worth $20,000 now, what will it be worth in 3 years?"})
print(result["text"])
6. Phi-4-mini
Phi-4-mini
Microsoft's Phi-4-mini packs surprising performance into 3.8B parameters. It runs on roughly 3GB of VRAM, making it the best choice for resource-constrained devices like laptops with integrated GPUs or older hardware. Its ARC-Challenge score of 83.7% is the highest in the entire sub-10B leaderboard for that benchmark.
Best for: Edge deployment, low-resource environments, CPU inference License: MIT
Using Phi-4-mini via Hugging Face Inference API
import requests
API_URL = "https://api-inference.huggingface.co/models/microsoft/Phi-4-mini-instruct"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "Summarize the importance of edge AI in three bullet points.",
"parameters": {"max_new_tokens": 150}
})
print(output)
Using Phi-4-mini with LangChain
from langchain_huggingface import HuggingFaceEndpoint
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
llm = HuggingFaceEndpoint(
repo_id="microsoft/Phi-4-mini-instruct",
huggingfacehub_api_token="YOUR_HF_TOKEN",
max_new_tokens=200,
)
prompt = PromptTemplate.from_template("Summarize the following topic in 3 bullet points: {topic}")
chain = prompt | llm | StrOutputParser()
print(chain.invoke({"topic": "the importance of edge AI"}))
7. Llama 3.1 8B Instruct
Llama 3.1 8B Instruct
Meta's Llama 3.1 8B remains the safest general-purpose default in 2026. Its balance across MMLU (73.0%) and HumanEval (72.6%) places it within striking distance of models twice its size, and no other model family matches its ecosystem of community fine-tunes covering everything from legal review to TypeScript generation.
Best for: General-purpose chat, fine-tuning base model, broad ecosystem support License: Llama 3 Community License
Using Llama 3.1 8B via Hugging Face Inference API
import requests
API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-3.1-8B-Instruct"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "Draft a short, friendly email reminding a client about an upcoming meeting.",
"parameters": {"max_new_tokens": 200}
})
print(output)
Using Llama 3.1 8B with LangChain
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
from langchain_core.messages import SystemMessage, HumanMessage
llm = HuggingFaceEndpoint(
repo_id="meta-llama/Llama-3.1-8B-Instruct",
huggingfacehub_api_token="YOUR_HF_TOKEN",
max_new_tokens=256,
)
chat = ChatHuggingFace(llm=llm)
messages = [
SystemMessage(content="You are a helpful assistant that writes professional emails."),
HumanMessage(content="Draft a reminder email about a meeting tomorrow at 3 PM.")
]
response = chat.invoke(messages)
print(response.content)
Setup Notes for All Examples
- Get a Hugging Face token: Sign up at huggingface.co, go to Settings → Access Tokens, and create a read token.
- Install dependencies:
pip install langchain langchain-huggingface requests
- Some models require a "Pro" Inference Endpoint or local hosting for larger context windows or higher throughput. For production workloads, consider deploying via Hugging Face Inference Endpoints or running locally with Ollama/vLLM, then pointing LangChain at your local server using
ChatOpenAIwith a custombase_url. - Rate limits: The free Hugging Face Inference API has rate limits and may queue requests for less popular models. For consistent latency, use a dedicated Inference Endpoint.
Which Model Should You Pick?
- Need the best coding output → Granite 4.1 8B or Qwen3-8B
- Need the strongest general reasoning → Qwen3.5-9B
- Building an agent with tool calling on limited hardware → Gemma 4 E4B
- Math-heavy or logic-heavy application → DeepSeek-R1-Distill-Qwen-7B
- Running on a laptop or CPU-only machine → Phi-4-mini
- Want the broadest community support and fine-tunes → Llama 3.1 8B Instruct
All seven models are open-weight, run comfortably on a single consumer GPU, and are accessible through the same Hugging Face + LangChain pattern shown above, just swap the repo_id and adjust your prompt.
FAQs
Q1. Which open-weight model under 10B parameters is best for coding tasks?
IBM Granite 4.1 8B is one of the strongest choices for coding, achieving the highest HumanEval score among the models listed and excelling in code generation and tool-calling workflows.
Q2. What is the best small model for mathematical reasoning?
DeepSeek-R1-Distill-Qwen-7B stands out for mathematical and logical reasoning, achieving one of the highest GSM8K scores in the sub-10B category.
Q3. Which sub-10B model is best for running on low-resource devices?
Phi-4-mini is optimized for resource-constrained environments, requiring relatively low VRAM while maintaining strong reasoning and benchmark performance.