快速上手

本文将通过一个实际的例子，演示如何在 HyperAI 上使用 vLLM 部署大语言模型。我们将部署 DeepSeek-R1-Distill-Qwen-1.5B 模型，这是一个基于 Qwen 的轻量级模型。

模型介绍

DeepSeek-R1-Distill-Qwen-1.5B 是一个轻量级的中英双语对话模型：

1.5B 参数量，单卡即可部署
最小显存要求：3GB
推荐显存配置：4GB 及以上

在模型训练中开发和测试

创建一个新的模型训练

选择 RTX 5090 算力
选择 vLLM 0.16.0-2204-gpu 镜像
在数据绑定中选择 DeepSeek-R1-Distill-Qwen-1.5B 模型，绑定到 /hyperai/input/input0

准备启动脚本 `start.sh`

容器启动后，创建以下 start.sh 脚本。该脚本会自动检测 GPU 数量，并根据运行环境（模型训练或模型部署）动态配置服务端口。

start.sh

#!/bin/bash

# 获取 GPU 数量
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

# 设置端口，模型部署默认暴露的端口为 80 而模型训练默认暴露的端口为 8080
PORT=8080
if [ ! -z "$OPENBAYES_SERVING_PRODUCTION" ]; then
    PORT=80
fi

# 启动 vLLM 服务
echo "Starting vLLM service..."
vllm serve /openbayes/input/input0 \
    --served-model-name DeepSeek-R1-Distill-Qwen-1.5B \
    --disable-log-requests \
    --trust-remote-code \
    --host 0.0.0.0 --port $PORT \
    --gpu-memory-utilization 0.98 \
    --max-model-len 8192 --enable-prefix-caching \
    --tensor-parallel-size $GPU_COUNT

在容器中测试服务

执行以下命令启动 vLLM 服务：

bash start.sh

服务启动后，可以使用以下 curl 命令测试模型推理功能：

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "请用中文解释什么是大语言模型"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

在 JupyterLab 中打开新的终端（Terminal），执行上述 curl 命令进行测试：

提示

在模型训练环境中测试时使用的是 8080 端口，而模型部署环境会自动切换到 80 端口。这是 HyperAI 模型部署服务的规范要求，所有部署服务必须通过 80 端口对外提供服务。

部署模型服务

完成开发测试后，可通过以下两种方式将模型转化为生产可用的部署服务：

方式一：一键部署（推荐）

HyperAI 提供「一键部署」功能，可直接将模型训练转换为模型部署服务，无需重复配置。

使用一键部署功能

在模型训练详情页面，点击右上角「启动」按钮旁的下拉菜单，选择「创建模型部署」。
确认部署配置信息（系统会自动继承训练容器的配置）。
点击「确认部署」，系统会自动创建对应的模型部署服务。

部署配置确认

系统会自动继承以下配置：

算力资源
基础镜像
工作空间数据
数据绑定关系

你可在确认页面根据需要调整配置。

部署成功

提交后系统会自动创建模型部署并启动服务，成功后跳转到部署详情页，可直接使用在线测试工具验证接口。

方式二：手动创建模型部署

如需更灵活地配置部署环境，或从头创建新的模型部署，可按以下步骤操作：

配置算力、镜像和数据绑定

选择 RTX 5090 算力
选择 vLLM 0.16.0-2204-gpu 镜像
在数据绑定中选择 DeepSeek-R1-Distill-Qwen-1.5B 模型，绑定到 /hyperai/input/input0
将训练容器的工作空间绑定到 /hyperai/home

启动部署

点击「部署」按钮，等待模型部署状态变为「运行中」。

点击运行的模型部署版本，可查看当前部署的详细信息和运行日志。

在线测试

模型部署详情页提供在线测试工具，支持在网页端可视化地编写和发送 HTTP 请求，快速验证模型接口功能，无需使用本地命令行或第三方工具。

主要功能：

选择请求方法（GET、POST 等）
填写接口路径和参数
自定义请求头和请求体（支持 JSON 等格式）
实时查看响应内容和响应头
支持流式输出，体验大模型的流式推理效果

GET 请求示例

用于获取模型信息或健康检查。选择 GET 方法，填写接口路径（如 /v1/models），点击「发送」即可查看模型列表或状态。

POST 请求示例

用于与大语言模型进行对话。选择 POST 方法，路径填写 /v1/chat/completions，在请求体中输入对话内容（如下所示），点击「发送」即可体验模型推理。

{
  "model": "qwen3-32b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "北京的天气如何?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

流式调用示例

用于体验大模型的流式推理效果。在 POST 请求体中添加 "stream": true 字段，发送请求后可实时查看模型逐步输出的内容，适合需要边生成边消费结果的场景。

{
  "model": "qwen3-32b",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "北京的天气如何?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

命令行测试

如需使用命令行工具（如 curl）进行接口测试，可参考以下方法：

在模型部署页面获取 HyperAI 生成的服务 URL，使用以下命令测试模型可用性：

curl -X POST http://<模型部署的 url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "你好，请介绍一下自己"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

下一步

了解更多模型部署的管理
查看 vLLM 官方文档

模型介绍

DeepSeek-R1-Distill-Qwen-1.5B 是一个轻量级的中英双语对话模型：

1.5B 参数量，单卡即可部署
最小显存要求：3GB
推荐显存配置：4GB 及以上

在模型训练中开发和测试

创建一个新的模型训练

选择 RTX 5090 算力
选择 vLLM 0.16.0-2204-gpu 镜像
在数据绑定中选择 DeepSeek-R1-Distill-Qwen-1.5B 模型，绑定到 /hyperai/input/input0

准备启动脚本 `start.sh`

容器启动后，创建以下 start.sh 脚本。该脚本会自动检测 GPU 数量，并根据运行环境（模型训练或模型部署）动态配置服务端口。

start.sh

#!/bin/bash

# 获取 GPU 数量
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

# 设置端口，模型部署默认暴露的端口为 80 而模型训练默认暴露的端口为 8080
PORT=8080
if [ ! -z "$OPENBAYES_SERVING_PRODUCTION" ]; then
    PORT=80
fi

# 启动 vLLM 服务
echo "Starting vLLM service..."
vllm serve /openbayes/input/input0 \
    --served-model-name DeepSeek-R1-Distill-Qwen-1.5B \
    --disable-log-requests \
    --trust-remote-code \
    --host 0.0.0.0 --port $PORT \
    --gpu-memory-utilization 0.98 \
    --max-model-len 8192 --enable-prefix-caching \
    --tensor-parallel-size $GPU_COUNT

在容器中测试服务

执行以下命令启动 vLLM 服务：

bash start.sh

服务启动后，可以使用以下 curl 命令测试模型推理功能：

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "请用中文解释什么是大语言模型"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

在 JupyterLab 中打开新的终端（Terminal），执行上述 curl 命令进行测试：

提示

部署模型服务

完成开发测试后，可通过以下两种方式将模型转化为生产可用的部署服务：

方式一：一键部署（推荐）

HyperAI 提供「一键部署」功能，可直接将模型训练转换为模型部署服务，无需重复配置。

使用一键部署功能

在模型训练详情页面，点击右上角「启动」按钮旁的下拉菜单，选择「创建模型部署」。
确认部署配置信息（系统会自动继承训练容器的配置）。
点击「确认部署」，系统会自动创建对应的模型部署服务。

部署配置确认

系统会自动继承以下配置：

算力资源
基础镜像
工作空间数据
数据绑定关系

你可在确认页面根据需要调整配置。

部署成功

提交后系统会自动创建模型部署并启动服务，成功后跳转到部署详情页，可直接使用在线测试工具验证接口。

方式二：手动创建模型部署

如需更灵活地配置部署环境，或从头创建新的模型部署，可按以下步骤操作：

配置算力、镜像和数据绑定

选择 RTX 5090 算力
选择 vLLM 0.16.0-2204-gpu 镜像
在数据绑定中选择 DeepSeek-R1-Distill-Qwen-1.5B 模型，绑定到 /hyperai/input/input0
将训练容器的工作空间绑定到 /hyperai/home

启动部署

点击「部署」按钮，等待模型部署状态变为「运行中」。

点击运行的模型部署版本，可查看当前部署的详细信息和运行日志。

在线测试

模型部署详情页提供在线测试工具，支持在网页端可视化地编写和发送 HTTP 请求，快速验证模型接口功能，无需使用本地命令行或第三方工具。

主要功能：

选择请求方法（GET、POST 等）
填写接口路径和参数
自定义请求头和请求体（支持 JSON 等格式）
实时查看响应内容和响应头
支持流式输出，体验大模型的流式推理效果

GET 请求示例

用于获取模型信息或健康检查。选择 GET 方法，填写接口路径（如 /v1/models），点击「发送」即可查看模型列表或状态。

POST 请求示例

用于与大语言模型进行对话。选择 POST 方法，路径填写 /v1/chat/completions，在请求体中输入对话内容（如下所示），点击「发送」即可体验模型推理。

{
  "model": "qwen3-32b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "北京的天气如何?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

流式调用示例

{
  "model": "qwen3-32b",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "北京的天气如何?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

命令行测试

如需使用命令行工具（如 curl）进行接口测试，可参考以下方法：

在模型部署页面获取 HyperAI 生成的服务 URL，使用以下命令测试模型可用性：

curl -X POST http://<模型部署的 url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "你好，请介绍一下自己"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

下一步

了解更多模型部署的管理
查看 vLLM 官方文档

模型介绍

DeepSeek-R1-Distill-Qwen-1.5B 是一个轻量级的中英双语对话模型：

1.5B 参数量，单卡即可部署
最小显存要求：3GB
推荐显存配置：4GB 及以上

在模型训练中开发和测试

创建一个新的模型训练

选择 RTX 5090 算力
选择 vLLM 0.16.0-2204-gpu 镜像
在数据绑定中选择 DeepSeek-R1-Distill-Qwen-1.5B 模型，绑定到 /hyperai/input/input0

准备启动脚本 `start.sh`

容器启动后，创建以下 start.sh 脚本。该脚本会自动检测 GPU 数量，并根据运行环境（模型训练或模型部署）动态配置服务端口。

start.sh

#!/bin/bash

# 获取 GPU 数量
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)

# 设置端口，模型部署默认暴露的端口为 80 而模型训练默认暴露的端口为 8080
PORT=8080
if [ ! -z "$OPENBAYES_SERVING_PRODUCTION" ]; then
    PORT=80
fi

# 启动 vLLM 服务
echo "Starting vLLM service..."
vllm serve /openbayes/input/input0 \
    --served-model-name DeepSeek-R1-Distill-Qwen-1.5B \
    --disable-log-requests \
    --trust-remote-code \
    --host 0.0.0.0 --port $PORT \
    --gpu-memory-utilization 0.98 \
    --max-model-len 8192 --enable-prefix-caching \
    --tensor-parallel-size $GPU_COUNT

在容器中测试服务

执行以下命令启动 vLLM 服务：

bash start.sh

服务启动后，可以使用以下 curl 命令测试模型推理功能：

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "请用中文解释什么是大语言模型"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

在 JupyterLab 中打开新的终端（Terminal），执行上述 curl 命令进行测试：

提示

部署模型服务

完成开发测试后，可通过以下两种方式将模型转化为生产可用的部署服务：

方式一：一键部署（推荐）

HyperAI 提供「一键部署」功能，可直接将模型训练转换为模型部署服务，无需重复配置。

使用一键部署功能

在模型训练详情页面，点击右上角「启动」按钮旁的下拉菜单，选择「创建模型部署」。
确认部署配置信息（系统会自动继承训练容器的配置）。
点击「确认部署」，系统会自动创建对应的模型部署服务。

部署配置确认

系统会自动继承以下配置：

算力资源
基础镜像
工作空间数据
数据绑定关系

你可在确认页面根据需要调整配置。

部署成功

提交后系统会自动创建模型部署并启动服务，成功后跳转到部署详情页，可直接使用在线测试工具验证接口。

方式二：手动创建模型部署

如需更灵活地配置部署环境，或从头创建新的模型部署，可按以下步骤操作：

配置算力、镜像和数据绑定

选择 RTX 5090 算力
选择 vLLM 0.16.0-2204-gpu 镜像
在数据绑定中选择 DeepSeek-R1-Distill-Qwen-1.5B 模型，绑定到 /hyperai/input/input0
将训练容器的工作空间绑定到 /hyperai/home

启动部署

点击「部署」按钮，等待模型部署状态变为「运行中」。

点击运行的模型部署版本，可查看当前部署的详细信息和运行日志。

在线测试

模型部署详情页提供在线测试工具，支持在网页端可视化地编写和发送 HTTP 请求，快速验证模型接口功能，无需使用本地命令行或第三方工具。

主要功能：

选择请求方法（GET、POST 等）
填写接口路径和参数
自定义请求头和请求体（支持 JSON 等格式）
实时查看响应内容和响应头
支持流式输出，体验大模型的流式推理效果

GET 请求示例

用于获取模型信息或健康检查。选择 GET 方法，填写接口路径（如 /v1/models），点击「发送」即可查看模型列表或状态。

POST 请求示例

用于与大语言模型进行对话。选择 POST 方法，路径填写 /v1/chat/completions，在请求体中输入对话内容（如下所示），点击「发送」即可体验模型推理。

{
  "model": "qwen3-32b",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "北京的天气如何?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

流式调用示例

{
  "model": "qwen3-32b",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "北京的天气如何?"
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g., San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use"
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

命令行测试

如需使用命令行工具（如 curl）进行接口测试，可参考以下方法：

在模型部署页面获取 HyperAI 生成的服务 URL，使用以下命令测试模型可用性：

curl -X POST http://<模型部署的 url>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-Distill-Qwen-1.5B",
    "messages": [
        {
            "role": "user",
            "content": "你好，请介绍一下自己"
        }
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

下一步

了解更多模型部署的管理
查看 vLLM 官方文档

快速上手

On this page

快速上手

On this page

快速上手

On this page