Phi-4-Reasoning: Building Smarter AI Agents with 14B Param

I build AI agents. My goal is to create assistants that can tackle complex problems, understand tricky instructions, and plan several steps ahead.

For a while, though, I hit a wall. The open-weight language models I used, even the good ones, would often get confused by logic puzzles, misinterpret what I needed, or just give up on tasks that required deep thinking.

I needed something better – a model that could really reason and, importantly, show me how it was thinking.

Then, I discovered Microsoft's Phi-4-reasoning models. The buzz was about a 14-billion parameter model that could outperform much larger ones on reasoning tasks.

This sounded like exactly what my agents needed. So, I dived in to see if it could solve my reasoning bottleneck.

What's New and Special About Phi-4-reasoning?

Phi-4-reasoning isn't just another large language model. It starts with the solid foundation of Microsoft's Phi-4 (a 14B parameter, dense decoder-only Transformer model and then undergoes specialized training to become a reasoning powerhouse.

Here’s what makes it stand out for me:

Two Flavors for Reasoning: Phi-4-reasoning and Phi-4-reasoning-plus

Microsoft offers two enhanced versions focused on reasoning:

Phi-4-reasoning (SFT): This is the base reasoning model. Microsoft fine-tuned the original Phi-4 model using Supervised Fine-Tuning (SFT) .

They used a carefully curated dataset packed with Chain-of-Thought (CoT) traces. CoT means the model learns to break down problems into step-by-step reasoning before giving an answer.

This dataset included over 1.4 million prompts with high-quality answers, many generated using OpenAI's o3-mini model, focusing on math, science, and coding.

They also included data for safety and alignment.
Phi-4-reasoning-plus (RL Enhanced): This version takes the already capable Phi-4-reasoning and makes it even better using Reinforcement Learning (RL).

Specifically, they used RL on about 6,000 high-quality math problems with verifiable solutions.

This RL training encourages the model to use more "thinking time" during inference, it generates about 1.5 times more tokens in its reasoning process compared to the SFT version.

The result? Higher accuracy, especially on complex math problems.

Key Features:

I've compiled a list of features that make these models shine:

Advanced Reasoning Capabilities:
- Both models are designed to excel at complex reasoning tasks, including math, coding, algorithmic problem-solving, and planning.
- They often outperform much larger models (like DeepSeek-R1-Distill-70B and sometimes even the full 671B parameter DeepSeek-R1 on specific benchmarks like AIME 2025) despite their relatively small 14B parameter size.
Specialized Fine-Tuning Data:
- The SFT for Phi-4-reasoning used a rich dataset (16B tokens, ~8.3B unique) focused on high-quality CoT traces, synthetic prompts, and filtered public data covering STEM, coding, and safety. This targeted training is key to its reasoning prowess.
Structured Output (Reasoning + Solution):
- A standout feature is how the models structure their responses. They output a reasoning chain-of-thought block, often marked with <think>...</think> tags, followed by a final summarization or solution block (e.g., <solution>...</solution>). This is invaluable because:
  - Transparency: I can see the model's step-by-step thought process.
  - Debuggability: It's easier to identify where reasoning might go wrong.
  - Agent Integration: My agent can parse these blocks to understand the logic and use intermediate steps for further actions.
Increased Context Length for Reasoning:
- The base Phi-4 model had a 16K token context length. For Phi-4-reasoning, Microsoft increased this to 32,000 tokens by modifying the RoPE base frequency and continuing training.
  
  This longer context allows the model to handle more complex, multi-part prompts and maintain coherence over longer reasoning chains.
  
  Microsoft also reported promising experimental results at 64k tokens.
Architecture (Based on Phi-4):
- Both reasoning models inherit the 14B parameter, dense decoder-only Transformer architecture from Phi-4.
- The Phi-4 architecture itself was trained on a blend of synthetic datasets, filtered public domain websites, and academic/Q&A datasets, focusing on high quality and advanced reasoning from the start.
Efficiency:
- Achieving such strong reasoning performance with "only" 14B parameters is a significant step in making powerful AI more accessible and runnable on less extreme hardware.
Safety Post-Training:
- The base Phi-4 (and by extension, these fine-tuned versions) underwent robust safety post-training using diverse datasets to ensure precise instruction adherence and safety measures.
Availability and License:
- Both Phi-4-reasoning and Phi-4-reasoning-plus are available on Hugging Face and Azure AI Foundry.
- They are released under a permissive MIT License, allowing for both noncommercial and commercial use.

These features convinced me that Phi-4-reasoning was worth integrating into my agent development workflow.

Running Phi-4-reasoning Locally

I decided to run Phi-4-reasoning locally using the Hugging Face transformers library. This gives me good control and allows easy integration into my Python-based agents.

(Ensure you have a machine with a suitable GPU and enough VRAM. For a 14B parameter model, especially using bfloat16, you'd typically want a GPU with at least 28-32GB VRAM for comfortable loading and inference, though device_map="auto" will try its best to utilize what you have, potentially offloading to CPU which will be slower.)

Step 1: Install Necessary Libraries

First, I made sure I had the required Python packages installed. In my terminal, I ran:

pip install transformers accelerate torch einops bitsandbytes

transformers: For loading models and tokenizers from Hugging Face.
accelerate: Helps with efficient model loading across devices (CPU/GPU).
torch: The PyTorch deep learning framework.
einops: Often a dependency for Transformer architectures.
bitsandbytes: For quantization if you decide to load in 8-bit or 4-bit to save memory (though for full performance, bfloat16 or float16 is better if you have the VRAM).

Step 2: Python Script to Load and Run the Model

I created a Python script (let's call it run_phi4_reasoning.py) to load the model and tokenizer, and then to interact with it.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Choose the model ID:
# For the Supervised Fine-Tuned version:
model_id = "microsoft/Phi-4-reasoning"
# For the Reinforcement Learning enhanced version (generally better for complex math/reasoning):
# model_id = "microsoft/Phi-4-reasoning-plus" # Check Hugging Face for exact availability

print(f"Loading tokenizer for {model_id}...")
# trust_remote_code=True is often necessary for models with custom code/architecture.
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
print("Tokenizer loaded.")

print(f"Loading model {model_id}...")
# device_map="auto" will try to use your GPU if available.
# torch_dtype="auto" selects the optimal dtype; torch.bfloat16 is good for modern GPUs.
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported
    trust_remote_code=True,
    # To potentially save memory at the cost of speed/precision:
    # load_in_8bit=True,
    # load_in_4bit=True,
)
print("Model loaded.")
print(f"Model is currently on device: {model.device}")

def get_phi4_response(system_prompt_content, user_prompt_content, max_new_tokens=1500):
    """
    Generates a response from the Phi-4-reasoning model using a system and user prompt.
    """
    # Phi-4-reasoning models perform best with a chat format.
    # The system prompt guides the model on how to structure its reasoning and solution.
    messages = [
        {"role": "system", "content": system_prompt_content},
        {"role": "user", "content": user_prompt_content}
    ]

    # The tokenizer's apply_chat_template correctly formats the input for the model,
    # including special tokens like <|im_start|> and <|im_end|> if used by this model family.
    # add_generation_prompt=True is crucial to signal the model to generate the assistant's reply.
    print("\nApplying chat template...")
    inputs = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        add_generation_prompt=True
    ).to(model.device) # Move input tensors to the same device as the model
    print("Chat template applied. Generating response...")

    # Generate the response from the model
    # You can experiment with generation parameters like temperature, top_p, num_beams etc.
    outputs = model.generate(
        inputs,
        max_new_tokens=max_new_tokens,
        # Example generation parameters (optional):
        # do_sample=True,
        # temperature=0.5,
        # top_p=0.95,
        # num_return_sequences=1,
    )
    print("Response generated. Decoding...")

    # Decode the generated tokens into text.
    # We only want to decode the newly generated tokens, not the input prompt tokens.
    response_tokens = outputs[0][inputs.shape[-1]:]
    response_text = tokenizer.decode(response_tokens, skip_special_tokens=True)
    
    return response_text

# This is the recommended system prompt structure to encourage Chain-of-Thought.
# (Adapted from Microsoft's documentation/examples for similar models)
system_prompt_for_reasoning = """Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format:  {Thought section}   {Solution section}. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion. Now, try to solve the following question through the above guidelines:"""

# My actual question for the model
user_question_to_solve = "I have 3 red boxes, 4 blue boxes, and 5 green boxes. Each red box contains 2 apples. Each blue box contains 3 oranges. Each green box contains 1 apple and 1 orange. How many total apples and total oranges do I have?"

print(f"\n--- System Prompt Being Used ---")
print(system_prompt_for_reasoning)
print(f"\n--- User Question ---")
print(user_question_to_solve)

# Get and print the model's response
full_model_response = get_phi4_response(system_prompt_for_reasoning, user_question_to_solve)

print("\n\n--- Full Model Response ---")
print(full_model_response)
print("---------------------------")

# Simple parsing of the  and  blocks
try:
    think_content = full_model_response.split("<think>")[1].split("</think>")[0].strip()
    solution_content = full_model_response.split("<solution>")[1].split("</solution>")[0].strip()
    print("\n--- Parsed Thinking Process ---")
    print(think_content)
    print("\n--- Parsed Solution ---")
    print(solution_content)
except IndexError:
    print("\nCould not parse <think> and <solution> blocks from the response. The model might not have followed the format strictly.")

Step 3: Run the Python Script

I save the code above as run_phi4_reasoning.py and then run it from my terminal:

python run_phi4_reasoning.py

The first time I run it, it downloads the model weights, which can take some time depending on my internet speed. After that, it loads the model into memory (GPU if available) and starts generating the response.

Seeing Phi-4-reasoning in Action

The key to getting good results from Phi-4-reasoning is to use the system prompt that guides it to produce the <think>...</think> and <solution>...</solution> blocks.

My Test Prompt:
(Using the system_prompt_for_reasoning defined in the Python script above)

User Question:

I have 3 red boxes, 4 blue boxes, and 5 green boxes. Each red box contains 2 apples. Each blue box contains 3 oranges. Each green box contains 1 apple and 1 orange. How many total apples and total oranges do I have?

Expected Output Structure from Phi-4-reasoning (Illustrative Content):

<think>

Understand the Goal: I need to calculate the total number of apples and oranges.
Break Down by Box Color - Apples:
- Red Boxes: 3 red boxes × 2 apples/red box = 6 apples.
- Blue Boxes: 4 blue boxes × 0 apples/blue box = 0 apples.
- Green Boxes: 5 green boxes × 1 apple/green box = 5 apples.
Calculate Total Apples:
- Total apples = Apples from red boxes + blue boxes + green boxes
- Total apples = 6 + 0 + 5 = 11 apples.
Break Down by Box Color - Oranges:
- Red Boxes: 3 red boxes × 0 oranges/red box = 0 oranges.
- Blue Boxes: 4 blue boxes × 3 oranges/blue box = 12 oranges.
- Green Boxes: 5 green boxes × 1 orange/green box = 5 oranges.
Calculate Total Oranges:
- Total oranges = Oranges from red boxes + blue boxes + green boxes
- Total oranges = 0 + 12 + 5 = 17 oranges.
Final Check: Review calculations to ensure accuracy. All steps seem correct.

</think>

<solution>

To find the total number of apples and oranges, let's calculate them step by step:

Apples:

Apples from red boxes: There are 3 red boxes, and each contains 2 apples. So, 3 × 2 = 6 apples.
Apples from blue boxes: Blue boxes only contain oranges, so 0 apples from blue boxes.
Apples from green boxes: There are 5 green boxes, and each contains 1 apple. So, 5 × 1 = 5 apples.

Total apples = 6 (from red) + 0 (from blue) + 5 (from green) = 11 apples.

Oranges:

Oranges from red boxes: Red boxes only contain apples, so 0 oranges from red boxes.
Oranges from blue boxes: There are 4 blue boxes, and each contains 3 oranges. So, 4 × 3 = 12 oranges.
Oranges from green boxes: There are 5 green boxes, and each contains 1 orange. So, 5 × 1 = 5 oranges.

Total oranges = 0 (from red) + 12 (from blue) + 5 (from green) = 17 oranges.

Therefore, you have a total of 11 apples and 17 oranges.

</solution>

This structured output is incredibly useful. My agent can parse the <think> block to understand the logic and the <solution> block for the final answer.

Why Phi-4-reasoning is a Leap Forward for My AI Agents

After integrating Phi-4-reasoning, my agents became noticeably "smarter":

Explicit Reasoning is Powerful: Seeing the step-by-step thinking in the <think> block allows my agent to verify the logic or even learn from it.
Improved Planning: For complex tasks, my agent can now ask Phi-4-reasoning to outline a plan (its CoT essentially is a plan), which the agent can then execute.
Tackling Difficult Problems: Problems that previously stumped my agents, especially those requiring multiple logical steps or careful interpretation, are now handled much more effectively.
Efficient for its Capabilities: Getting this level of reasoning from a 14B parameter model means I don't always need to rely on much larger, more resource-intensive APIs or models. This makes local deployment more feasible for sophisticated tasks.
Flexibility with MIT License: The MIT license gives me the freedom to use and adapt the model for various projects, including commercial ones.

Conclusion: Powering Up My Agents with Specialized Reasoning

Microsoft's Phi-4-reasoning (and its 'plus' variant) directly addressed the reasoning bottleneck I faced when building intelligent AI agents.

Its specialized fine-tuning on Chain-of-Thought data, the highly useful structured output, and its impressive reasoning capabilities packed into an efficient 14B parameter size make it an outstanding choice.

For any developer looking to build AI agents or applications that require robust, explainable reasoning without necessarily needing a colossal model, I strongly recommend exploring Phi-4-reasoning.

Setting it up locally with Hugging Face Transformers is straightforward, and the improvement in your agent's problem-solving abilities can be quite significant. My agents are certainly benefiting from this upgrade in "thinking" power!

FAQs

Q1: What is Phi-4-Reasoning?
A: Phi-4-Reasoning is a 14-billion parameter model designed to excel in complex reasoning tasks, trained through supervised fine-tuning and reinforcement learning.

Q2: How does Phi-4-Reasoning differ from Phi-4?
A: While Phi-4 focuses on data quality during training, Phi-4-Reasoning builds upon it with curated prompts and outcome-based reinforcement learning to enhance reasoning capabilities.

Q3: What tasks does Phi-4-Reasoning excel in?
A: It performs strongly in math and scientific reasoning, coding, algorithmic problem-solving, planning, and spatial understanding.

Q4: How does Phi-4-Reasoning compare to larger models?
A: Despite its smaller size, Phi-4-Reasoning outperforms larger models like DeepSeek-R1-Distill-Llama-70B in various reasoning benchmarks.

Q5: What is Phi-4-Reasoning-Plus?
A: It's an enhanced variant of Phi-4-Reasoning, further improved through outcome-based reinforcement learning, generating longer reasoning traces for better performance.