HyperAI

Deploy DeepSeek R1 7B Using vLLM

🔥 Super fast deployment of DeepSeek-R1 7B! vLLM + Open-WebUI helps you get it done with one click!🚀

1. Tutorial Introduction

DeepSeek-R1 is an efficient and lightweight language model launched by DeepSeek in 2025. It supports multiple tasks such as text generation, dialogue, translation, and summarization. It uses knowledge distillation technology, takes into account both high performance and low computing power requirements, and is suitable for rapid deployment and practical applications.

⚡  Why choose vLLM deployment?

  • 🚀 Ultra-fast reasoning: PagedAttention + FlashInfer, let LLM fly!
  • 💾 Smart memory management: Efficiently process long texts and reduce video memory usage!
  • 🎯 Optimizing the kernel: Supports GPTQ, AWQ, INT4/8 and other quantization, and maximizes performance!
  • 🌍 Compatible with OpenAI API: Seamless migration, get started right away!
  • 🔥 Multiple hardware support: NVIDIA, AMD, Intel, TPU…run wherever you want!

💡 Open-WebUI makes interaction easier!

  • 🌟 Web-based management, ready to use!
  • 🎨 Intuitive interface, low threshold deployment!
  • 🔗Multiple model support, one-stop experience!

This tutorial uses the DeepSeek-R1-Distill-Qwen-7B model as a demonstration, and the computing resources used are "single RTX4090 card".

2. Operation steps

1. After starting the container, click the API address to enter the web interface (If "Bad Gateway" is displayed, it means that the model is initializing. Since the model is large, please wait about 2 minutes and try again.)

2. After entering the webpage, you can start a conversation with the model

Enter your account number:admin@123.com

Password: 123456

Notice:
1. This tutorial supports "online search". After this function is turned on, the inference speed will slow down, which is normal.
2. Backend vLLM inference can be viewed in /home/vllm.log

Deploy DeekSeek-R1 based on vllm

Common conversation settings

1. Temperature

  • Controls the randomness of the output, typically in the range of 0.0-2.0.
  • Low value (such as 0.1): More certain, biased towards common words.
  • High value (such as 1.5): More random, potentially more creative but erratic content.

2. Top-k Sampling

  • Only sample the k words with the highest probability and exclude low-probability words.
  • k is small (e.g. 10): More certainty, less randomness.
  • k is large (e.g. 50): More diversity, more innovation.

3. Top-p Sampling (Nucleus Sampling, Top-p Sampling)

  • Select the word set whose cumulative probability reaches p, and do not fix the value of k.
  • Low value (such as 0.3): More certainty, less randomness.
  • High value (such as 0.9): More diversity, improved fluency.

4. Repetition Penalty

  • Controls the text repetition rate, usually between 1.0-2.0.
  • High value (such as 1.5): Reduce repetition and improve readability.
  • Low value (such as 1.0): No penalty, may cause the model to repeat words and sentences.

5. Max Tokens (maximum generation length)

  • Limit the maximum number of tokens generated by the model to avoid excessively long output.
  • Typical range:50-4096 (depending on the specific model).

Exchange and discussion

🖌️ If you see a high-quality project, please leave a message in the background to recommend it! In addition, we have also established a tutorial exchange group. Welcome friends to scan the QR code and remark [SD Tutorial] to join the group to discuss various technical issues and share application effects↓