7 Best Small Language Models Under 10B Parameters in 2026

Small language models have quietly become the workhorses of production AI. You don't always need a 200B-parameter giant to summarize a document, write code, or power an agent. The sub-10B class has closed the gap dramatically in 2026, and several models now outperform last year's 30B+ flagships on key benchmarks.

This guide ranks the seven best open-weight models under 10B parameters as of 2026, compares them across standard benchmarks, and shows exactly how to call each one through the Hugging Face Inference API and LangChain.

Quick Comparison Table

#	Model	Params	MMLU-Pro	HumanEval	GSM8K	GPQA	ARC-C	Context	License
1	IBM Granite 4.1 8B	8B	56.0%	87.2%	92.5%	42.0%	—	131K	Apache 2.0
2	Qwen3.5-9B	9B	82.5%	76.0%	91.2%	81.7%	79.4%	128K	Apache 2.0
3	Gemma 4 E4B	~4.5B	78.1%	71.3%	89.2%	68.4%	83.7%	128K	Apache 2.0
4	Qwen3-8B	8B	79.6%	76.0%	89.8%	65.4%	77.2%	128K	Apache 2.0
5	DeepSeek-R1-Distill-Qwen-7B	7B	72.8%	55.1%	92.8%	62.1%	74.3%	128K	MIT
6	Phi-4-mini	3.8B	52.8%	74.4%	88.6%	52.8%	83.7%	128K	MIT
7	Llama 3.1 8B Instruct	8B	73.0%	72.6%	86.4%	49.0%	75.1%	128K	Llama 3.1 Community

A quick way to read this table: Granite 4.1 8B and Qwen3.5-9B trade the top spot depending on whether you care more about coding or general reasoning. Gemma 4 E4B and Phi-4-mini punch well above their parameter count on edge devices.

DeepSeek-R1-Distill is your pick for math and chain-of-thought. Llama 3.3 8B remains the safest "general purpose" default thanks to its enormous fine-tune ecosystem.

1. IBM Granite 4.1 8B

IBM Granite 4.1 8B

A dense 8B model released in late April 2026, Granite 4.1 8B beats IBM's previous 32B Mixture-of-Experts model on most production benchmarks. Its HumanEval score of 87.2% is the highest in this list, making it a strong pick for code generation, RAG, and enterprise tool-calling workflows. It supports 12 languages and a 131K context window (extendable to 512K).

Best for: Code generation, enterprise RAG, tool calling License: Apache 2.0

Using Granite 4.1 8B via Hugging Face Inference API

import requests

API_URL = "https://api-inference.huggingface.co/models/ibm-granite/granite-4.1-8b-instruct"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": "Write a Python function to check if a string is a palindrome.",
    "parameters": {"max_new_tokens": 200, "temperature": 0.3}
})
print(output)

Using Granite 4.1 8B with LangChain

from langchain_huggingface import HuggingFaceEndpoint
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = HuggingFaceEndpoint(
    repo_id="ibm-granite/granite-4.1-8b-instruct",
    huggingfacehub_api_token="YOUR_HF_TOKEN",
    max_new_tokens=256,
    temperature=0.3,
)

prompt = PromptTemplate.from_template("Write a Python function to {task}")
chain = prompt | llm | StrOutputParser()

result = chain.invoke({"task": "check if a string is a palindrome"})
print(result)

2. Qwen3.5-9B

Qwen3.5-9B

Alibaba's Qwen3.5-9B is the strongest all-around reasoner in this size class, scoring 82.5% on MMLU-Pro and 81.7% on GPQA Diamond, numbers that would have placed it in the top tier of all open models a year ago.

It supports dual "thinking" and "non-thinking" modes, so you can switch between fast responses and deep chain-of-thought reasoning depending on the task.

Best for: General reasoning, multilingual tasks, science QA License: Apache 2.0

Using Qwen3.5-9B via Hugging Face Inference API

import requests

API_URL = "https://api-inference.huggingface.co/models/Qwen/Qwen3.5-9B-Instruct"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": "Explain the difference between supervised and unsupervised learning in two sentences.",
    "parameters": {"max_new_tokens": 150}
})
print(output)

Using Qwen3.5-9B with LangChain

from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
from langchain_core.messages import HumanMessage

llm = HuggingFaceEndpoint(
    repo_id="Qwen/Qwen3.5-9B-Instruct",
    huggingfacehub_api_token="YOUR_HF_TOKEN",
    max_new_tokens=300,
)

chat = ChatHuggingFace(llm=llm)
response = chat.invoke([HumanMessage(content="Explain quantum entanglement simply.")])
print(response.content)

Tip: To enable "thinking mode" with Qwen models, prepend /think to your prompt, or /no_think for fast responses without chain-of-thought.

3. Gemma 4 E4B

Gemma 4 E4B

Google's Gemma 4 E4B is built for agents and tool calling on constrained hardware. It runs at just 5GB RAM at 4-bit quantization, supports a 128K context window, and accepts native audio input alongside text. Its 83.7% ARC-Challenge score ties Phi-4-mini for the best scientific reasoning at this size.

Best for: Agents, tool calling, on-device/edge deployment, audio input License: Apache 2.0

Using Gemma 4 E4B via Hugging Face Inference API

import requests

API_URL = "https://api-inference.huggingface.co/models/google/gemma-4-e4b-it"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": "What's the weather like in a city near the equator in December?",
    "parameters": {"max_new_tokens": 150}
})
print(output)

Using Gemma 4 E4B with LangChain

from langchain_huggingface import HuggingFaceEndpoint
from langchain.agents import initialize_agent, AgentType, Tool

llm = HuggingFaceEndpoint(
    repo_id="google/gemma-4-e4b-it",
    huggingfacehub_api_token="YOUR_HF_TOKEN",
    max_new_tokens=256,
)

def get_word_length(word: str) -> int:
    return len(word)

tools = [
    Tool(
        name="WordLength",
        func=get_word_length,
        description="Returns the length of a word"
    )
]

agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
agent.run("How many letters are in the word 'multimodal'?")

4. Qwen3-8B

Qwen3-8B

The previous-generation Qwen3-8B remains relevant in mid-2026 thanks to its strength in code generation, where it leads the 7-8B class on HumanEval. It supports 29+ languages and the same thinking/non-thinking dual mode as its successor.

Best for: Code generation, multilingual applications License: Apache 2.0

Using Qwen3-8B via Hugging Face Inference API

import requests

API_URL = "https://api-inference.huggingface.co/models/Qwen/Qwen3-8B"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": "Write a SQL query to find the top 5 customers by total order value.",
    "parameters": {"max_new_tokens": 200}
})
print(output)

Using Qwen3-8B with LangChain

from langchain_huggingface import HuggingFaceEndpoint
from langchain_core.prompts import ChatPromptTemplate

llm = HuggingFaceEndpoint(
    repo_id="Qwen/Qwen3-8B",
    huggingfacehub_api_token="YOUR_HF_TOKEN",
    max_new_tokens=300,
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert SQL developer."),
    ("user", "{question}")
])

chain = prompt | llm
result = chain.invoke({"question": "Write a query to find top 5 customers by order value"})
print(result)

5. DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Qwen-7B

This model distills the chain-of-thought reasoning from the full DeepSeek-R1 into a 7B Qwen3 base. The result is the best math performance at this size, with a 92.8% GSM8K score. If your application involves step-by-step logical reasoning, this is the model to pick.

Best for: Math, chain-of-thought reasoning, logic puzzles License: MIT

Using DeepSeek-R1-Distill-Qwen-7B via Hugging Face Inference API

import requests

API_URL = "https://api-inference.huggingface.co/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": "If a train travels 60 km in 45 minutes, what is its speed in km/h? Show your reasoning.",
    "parameters": {"max_new_tokens": 300}
})
print(output)

Using DeepSeek-R1-Distill-Qwen-7B with LangChain

from langchain_huggingface import HuggingFaceEndpoint
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

llm = HuggingFaceEndpoint(
    repo_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    huggingfacehub_api_token="YOUR_HF_TOKEN",
    max_new_tokens=400,
    temperature=0.1,
)

prompt = PromptTemplate.from_template(
    "Solve the following problem step by step:\n{problem}"
)

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.invoke({"problem": "A car depreciates 15% per year. If it's worth $20,000 now, what will it be worth in 3 years?"})
print(result["text"])

6. Phi-4-mini

Phi-4-mini

Microsoft's Phi-4-mini packs surprising performance into 3.8B parameters. It runs on roughly 3GB of VRAM, making it the best choice for resource-constrained devices like laptops with integrated GPUs or older hardware. Its ARC-Challenge score of 83.7% is the highest in the entire sub-10B leaderboard for that benchmark.

Best for: Edge deployment, low-resource environments, CPU inference License: MIT

Using Phi-4-mini via Hugging Face Inference API

import requests

API_URL = "https://api-inference.huggingface.co/models/microsoft/Phi-4-mini-instruct"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": "Summarize the importance of edge AI in three bullet points.",
    "parameters": {"max_new_tokens": 150}
})
print(output)

Using Phi-4-mini with LangChain

from langchain_huggingface import HuggingFaceEndpoint
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = HuggingFaceEndpoint(
    repo_id="microsoft/Phi-4-mini-instruct",
    huggingfacehub_api_token="YOUR_HF_TOKEN",
    max_new_tokens=200,
)

prompt = PromptTemplate.from_template("Summarize the following topic in 3 bullet points: {topic}")
chain = prompt | llm | StrOutputParser()

print(chain.invoke({"topic": "the importance of edge AI"}))

7. Llama 3.1 8B Instruct

Llama 3.1 8B Instruct

Meta's Llama 3.1 8B remains the safest general-purpose default in 2026. Its balance across MMLU (73.0%) and HumanEval (72.6%) places it within striking distance of models twice its size, and no other model family matches its ecosystem of community fine-tunes covering everything from legal review to TypeScript generation.

Best for: General-purpose chat, fine-tuning base model, broad ecosystem support License: Llama 3 Community License

Using Llama 3.1 8B via Hugging Face Inference API

import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-3.1-8B-Instruct"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": "Draft a short, friendly email reminding a client about an upcoming meeting.",
    "parameters": {"max_new_tokens": 200}
})
print(output)

Using Llama 3.1 8B with LangChain

from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace
from langchain_core.messages import SystemMessage, HumanMessage

llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Llama-3.1-8B-Instruct",
    huggingfacehub_api_token="YOUR_HF_TOKEN",
    max_new_tokens=256,
)

chat = ChatHuggingFace(llm=llm)

messages = [
    SystemMessage(content="You are a helpful assistant that writes professional emails."),
    HumanMessage(content="Draft a reminder email about a meeting tomorrow at 3 PM.")
]

response = chat.invoke(messages)
print(response.content)

Setup Notes for All Examples

Get a Hugging Face token: Sign up at huggingface.co, go to Settings → Access Tokens, and create a read token.
Install dependencies:

   pip install langchain langchain-huggingface requests

Some models require a "Pro" Inference Endpoint or local hosting for larger context windows or higher throughput. For production workloads, consider deploying via Hugging Face Inference Endpoints or running locally with Ollama/vLLM, then pointing LangChain at your local server using ChatOpenAI with a custom base_url.
Rate limits: The free Hugging Face Inference API has rate limits and may queue requests for less popular models. For consistent latency, use a dedicated Inference Endpoint.

Which Model Should You Pick?

Need the best coding output → Granite 4.1 8B or Qwen3-8B
Need the strongest general reasoning → Qwen3.5-9B
Building an agent with tool calling on limited hardware → Gemma 4 E4B
Math-heavy or logic-heavy application → DeepSeek-R1-Distill-Qwen-7B
Running on a laptop or CPU-only machine → Phi-4-mini
Want the broadest community support and fine-tunes → Llama 3.1 8B Instruct

All seven models are open-weight, run comfortably on a single consumer GPU, and are accessible through the same Hugging Face + LangChain pattern shown above, just swap the repo_id and adjust your prompt.

FAQs

Q1. Which open-weight model under 10B parameters is best for coding tasks?

IBM Granite 4.1 8B is one of the strongest choices for coding, achieving the highest HumanEval score among the models listed and excelling in code generation and tool-calling workflows.

Q2. What is the best small model for mathematical reasoning?

DeepSeek-R1-Distill-Qwen-7B stands out for mathematical and logical reasoning, achieving one of the highest GSM8K scores in the sub-10B category.

Q3. Which sub-10B model is best for running on low-resource devices?

Phi-4-mini is optimized for resource-constrained environments, requiring relatively low VRAM while maintaining strong reasoning and benchmark performance.