DeepSeek R1: What Is It Actually Distilling?
There has been a lot of buzz lately about DeepSeek. It’s gotten to the point where it feels awkward not to take a closer look if you want to stay up-to-date with AI developments.
After skimming the DeepSeek R1 paper, what really caught my attention was its distillation technique, which breathes new life into small models, giving them stronger reasoning abilities. In my view, this advancement is significant for future on-device AI. Because it’s open source, many organizations around the globe are attempting to replicate its results, and we can expect even better distilled models in the near future.
In today’s AI landscape, larger models often deliver stronger reasoning, especially for math, code, and science-related tasks. However, these massive models are extremely expensive to run—beyond the reach of most individuals (and even many mid-sized to large companies).
But if we can use a sufficiently powerful language model (like the R1 discussed here) to generate high-quality reasoning datasets, and then apply distillation to teach smaller models to mimic that reasoning, we can drastically reduce computational costs while still boosting performance for specific tasks.
Below is a closer look at the process of distilling smaller models:
DeepSeek R1: Reinforcement Learning and Multi-Stage Optimization
DeepSeek R1 is a reasoning model developed by the DeepSeek-AI team. It goes through three major training phases:
DeepSeek-R1-Zero (Pure RL Training)
- Uses pure reinforcement learning without any supervised fine-tuning (SFT). The model learns to reason on its own.
- Excels at math, coding, and scientific reasoning.
DeepSeek R1 (Cold Start + RL)
- Starts with high-quality “cold-start” data for SFT to improve readability and stability.
- Then applies reinforcement learning to further strengthen its reasoning abilities.
The resulting DeepSeek R1 model not only solves math proofs, coding challenges, and complex reasoning tasks, but also clearly explains its thought process (Chain-of-Thought, CoT).
What Is Distillation?
In AI, “distillation” is a technique that transfers a large model’s capabilities to a smaller model. By having the smaller model mimic the output of the larger one, distillation:
- Preserves the powerful reasoning of the large model
- Greatly reduces computational costs
- Enhances the smaller model’s performance on targeted tasks
How Does Distillation Work?
Distillation primarily relies on high-quality supervised fine-tuning (SFT). The process involves these main steps:
Step 1: Generating High-Quality Data
The research team used DeepSeek R1 to produce around 600,000 reasoning samples, and they also gathered an additional 200,000 non-reasoning samples, for a total of 800,000. These data cover:
- Mathematical reasoning
- Coding challenges
- Scientific and logical reasoning
- General Q&A
These samples include not just the final answer but also the step-by-step reasoning. For example:
Teacher Model (DeepSeek R1) Output Sample
<think>
Step 1: Use the distance formula:
Distance = Speed × Time
Step 2: Plug in known values:
Distance = 60 km/h × 2.5 h
Step 3: Calculate the result:
Distance = 150 km
</think>
<answer>150 km</answer>
Step 2: Fine-Tuning Smaller Models
Next, the team selected the Qwen and Llama series as their student models, and used supervised fine-tuning to teach them how DeepSeek R1 reasons:
- Qwen2.5-7B
- Qwen2.5-14B
- Qwen2.5-32B
- Llama-3.1-8B
- Llama-3.3-70B
Through SFT, these smaller models gradually learned:
- How to break down complex problems
- How to organize continuous thinking (CoT)
- How to produce clear, understandable answers
Step 3: Testing and Refinement
After training, they evaluated the models using various benchmarks. The results showed that even 7B or 14B models, once distilled, could achieve or surpass the performance of non-distilled 32B models on math and coding tasks!
Model | AIME 2024 (pass@1) | AIME 2024 (cons@64) | MATH-500 (pass@1) | GPQA Diamond (pass@1) | LiveCodeBench (pass@1) | CodeForces Score |
---|---|---|---|---|---|---|
OpenAI-o1-mini | 63.6% | 80.0 | 90.0% | 60.0% | 53.8% | 1820 |
QwQ-32B-Preview | 50.0% | 60.0 | 90.6% | 54.5% | 41.9% | 1316 |
DeepSeek-R1-Distill-Qwen-7B | 55.5% | 83.3 | 92.8% | 49.1% | 37.6% | 1189 |
DeepSeek-R1-Distill-Qwen-14B | 69.7% | 80.0 | 93.9% | 59.1% | 53.1% | 1481 |
DeepSeek-R1-Distill-Qwen-32B | 72.6% | 83.3 | 94.3% | 62.1% | 57.2% | 1691 |
DeepSeek-R1-Distill-Llama-8B | 50.4% | 80.0 | 89.1% | 49.0% | 39.6% | 1205 |
DeepSeek-R1-Distill-Llama-70B | 70.0% | 86.7 | 94.5% | 65.2% | 57.5% | 1633 |
Why Is DeepSeek’s Distillation So Effective?
Three main reasons stand out:
- It teaches smaller models how to reason, not just memorize answers.
- It leverages Chain-of-Thought (CoT) to strengthen reasoning, allowing smaller models to learn how to adapt and generalize.
- It uses high-quality data generated by DeepSeek R1, ensuring that smaller models only learn the best insights.
Conclusion
By performing SFT on high-quality data generated by DeepSeek R1, smaller Qwen and Llama models have learned the powerful reasoning techniques of DeepSeek R1. Even without additional reinforcement learning, these distilled models can approach (or sometimes match) the performance of something like OpenAI’s o1-mini on math and coding tasks.
This approach makes smaller models stronger and more efficient, suggesting that the day is coming when everyone can deploy their own AI. We may soon even have smartphones capable of running AI models that can solve real-world problems on the fly.