DeepSeek

DeepSeek R1: What Is It Actually Distilling?

There has been a lot of buzz lately about DeepSeek. It’s gotten to the point where it feels awkward not to take a closer look if you want to stay up-to-date with AI developments.

After skimming the DeepSeek R1 paper, what really caught my attention was its distillation technique, which breathes new life into small models, giving them stronger reasoning abilities. In my view, this advancement is significant for future on-device AI. Because it’s open source, many organizations around the globe are attempting to replicate its results, and we can expect even better distilled models in the near future.

In today’s AI landscape, larger models often deliver stronger reasoning, especially for math, code, and science-related tasks. However, these massive models are extremely expensive to run—beyond the reach of most individuals (and even many mid-sized to large companies).

But if we can use a sufficiently powerful language model (like the R1 discussed here) to generate high-quality reasoning datasets, and then apply distillation to teach smaller models to mimic that reasoning, we can drastically reduce computational costs while still boosting performance for specific tasks.

Below is a closer look at the process of distilling smaller models:

DeepSeek R1: Reinforcement Learning and Multi-Stage Optimization

DeepSeek R1 is a reasoning model developed by the DeepSeek-AI team. It goes through three major training phases:

DeepSeek-R1-Zero (Pure RL Training)
- Uses pure reinforcement learning without any supervised fine-tuning (SFT). The model learns to reason on its own.
- Excels at math, coding, and scientific reasoning.
DeepSeek R1 (Cold Start + RL)
- Starts with high-quality “cold-start” data for SFT to improve readability and stability.
- Then applies reinforcement learning to further strengthen its reasoning abilities.

The resulting DeepSeek R1 model not only solves math proofs, coding challenges, and complex reasoning tasks, but also clearly explains its thought process (Chain-of-Thought, CoT).

What Is Distillation?

In AI, “distillation” is a technique that transfers a large model’s capabilities to a smaller model. By having the smaller model mimic the output of the larger one, distillation:

Preserves the powerful reasoning of the large model
Greatly reduces computational costs
Enhances the smaller model’s performance on targeted tasks

How Does Distillation Work?

Distillation primarily relies on high-quality supervised fine-tuning (SFT). The process involves these main steps:

Step 1: Generating High-Quality Data

The research team used DeepSeek R1 to produce around 600,000 reasoning samples, and they also gathered an additional 200,000 non-reasoning samples, for a total of 800,000. These data cover:

Mathematical reasoning
Coding challenges
Scientific and logical reasoning
General Q&A

These samples include not just the final answer but also the step-by-step reasoning. For example:

Teacher Model (DeepSeek R1) Output Sample

<think>
Step 1: Use the distance formula:
  Distance = Speed × Time

Step 2: Plug in known values:
  Distance = 60 km/h × 2.5 h

Step 3: Calculate the result:
  Distance = 150 km
</think>

<answer>150 km</answer>

Step 2: Fine-Tuning Smaller Models

Next, the team selected the Qwen and Llama series as their student models, and used supervised fine-tuning to teach them how DeepSeek R1 reasons:

Qwen2.5-7B
Qwen2.5-14B
Qwen2.5-32B
Llama-3.1-8B
Llama-3.3-70B

Through SFT, these smaller models gradually learned:

How to break down complex problems
How to organize continuous thinking (CoT)
How to produce clear, understandable answers

After training, they evaluated the models using various benchmarks. The results showed that even 7B or 14B models, once distilled, could achieve or surpass the performance of non-distilled 32B models on math and coding tasks!

Model	AIME 2024 (pass@1)	AIME 2024 (cons@64)	MATH-500 (pass@1)	GPQA Diamond (pass@1)	LiveCodeBench (pass@1)	CodeForces Score
OpenAI-o1-mini	63.6%	80.0	90.0%	60.0%	53.8%	1820
QwQ-32B-Preview	50.0%	60.0	90.6%	54.5%	41.9%	1316
DeepSeek-R1-Distill-Qwen-7B	55.5%	83.3	92.8%	49.1%	37.6%	1189
DeepSeek-R1-Distill-Qwen-14B	69.7%	80.0	93.9%	59.1%	53.1%	1481
DeepSeek-R1-Distill-Qwen-32B	72.6%	83.3	94.3%	62.1%	57.2%	1691
DeepSeek-R1-Distill-Llama-8B	50.4%	80.0	89.1%	49.0%	39.6%	1205
DeepSeek-R1-Distill-Llama-70B	70.0%	86.7	94.5%	65.2%	57.5%	1633

Why Is DeepSeek’s Distillation So Effective?

Three main reasons stand out:

It teaches smaller models how to reason, not just memorize answers.
It leverages Chain-of-Thought (CoT) to strengthen reasoning, allowing smaller models to learn how to adapt and generalize.
It uses high-quality data generated by DeepSeek R1, ensuring that smaller models only learn the best insights.

Conclusion

By performing SFT on high-quality data generated by DeepSeek R1, smaller Qwen and Llama models have learned the powerful reasoning techniques of DeepSeek R1. Even without additional reinforcement learning, these distilled models can approach (or sometimes match) the performance of something like OpenAI’s o1-mini on math and coding tasks.

This approach makes smaller models stronger and more efficient, suggesting that the day is coming when everyone can deploy their own AI. We may soon even have smartphones capable of running AI models that can solve real-world problems on the fly.

What Is DeepSeek Actually Distilling? Is the Spring of On-Device AI Finally Coming?

DeepSeek R1: What Is It Actually Distilling?

DeepSeek R1: Reinforcement Learning and Multi-Stage Optimization

What Is Distillation?

How Does Distillation Work?

Step 1: Generating High-Quality Data

Teacher Model (DeepSeek R1) Output Sample

Step 2: Fine-Tuning Smaller Models

Step 3: Testing and Refinement

Why Is DeepSeek’s Distillation So Effective?

Conclusion

DeepSeek R1: What Is It Actually Distilling?#

DeepSeek R1: Reinforcement Learning and Multi-Stage Optimization#

What Is Distillation?#

How Does Distillation Work?#

Step 1: Generating High-Quality Data#

Teacher Model (DeepSeek R1) Output Sample#

Step 2: Fine-Tuning Smaller Models#

Step 3: Testing and Refinement#

Why Is DeepSeek’s Distillation So Effective?#

Conclusion#

DeepSeek R1: What Is It Actually Distilling?

DeepSeek R1: Reinforcement Learning and Multi-Stage Optimization

What Is Distillation?

How Does Distillation Work?

Step 1: Generating High-Quality Data

Teacher Model (DeepSeek R1) Output Sample

Step 2: Fine-Tuning Smaller Models

Step 3: Testing and Refinement

Why Is DeepSeek’s Distillation So Effective?

Conclusion