HyperAI

PyTorch 2.0 实操：为 HuggingFace 和 TIMM 模型提速！

PyTorch 2.0 通过简单一行 torch.compile() 就可以使模型训练速度提高 30%-200%，本教程将演示如何真实复现这种提速。

torch.compile() 可以轻松地尝试不同的编译器后端，进而加速 PyTorch 代码的运行。它作为 torch.jit.script() 的直接替代品，可以直接在 nn.Module 上运行，无需修改源代码。

上篇文章中，我们介绍了 torch.compile 支持任意的 PyTorch 代码、 control flow 、 mutation，并一定程度上支持 dynamic shapes 。

通过对 163 个开源模型进行测试，我们发现 torch.compile() 可以带来 30%-200% 的加速。

opt_module = torch.compile(module)

测试结果详见：

https://github.com/pytorch/torchdynamo/issues/681

本教程将演示如何利用 torch.compile() 为模型训练提速。

要求及设置

对于 GPU 而言（越新的 GPU 性能提升越突出）：

pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117

对于 CPU 而言：

pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu

可选：验证安装

git clone https://github.com/pytorch/pytorch
cd tools/dynamo
python verify_dynamo.py

可选：Docker 安装

在 PyTorch 的 Nightly Binaries 文件中提供了所有必要的依赖项，可以通过以下方式下载：

docker pull ghcr.io/pytorch/pytorch-nightly

对于临时测试 (ad hoc experiment)，只需确保容器能够访问所有 GPU 即可：

docker run --gpus all -it ghcr.io/pytorch/pytorch-nightly:latest /bin/bash

开始

简单示例

先来看一个简单示例，注意，GPU 越新速度提升越明显。

import torch
   def fn(x, y):
       a = torch.sin(x).cuda()
       b = torch.sin(y).cuda()
       return a + b
   new_fn = torch.compile(fn, backend="inductor")
   input_tensor = torch.randn(10000).to(device="cuda:0")
   a = new_fn()

这个例子实际上不会提升速度，但是可以抛砖引玉。

该示例中，torch.cos() 和 torch.sin() 是逐点运算 (pointwise ops) 的例子，他们可以在向量上逐一操作 element，一个更著名的逐点运算是 torch.relu()。

eager mode 下的逐点运算并不是最优解，因为每个算子都需要从内存中读取一个张量、做一些更改，然后再写回这些更改。

PyTorch 2.0 最重要的一项优化是融合 (fusion) 。

因此，该例中就可以把 2 次读和 2 次写变成 1 次读和 1 次写，这对较新的 GPU 来说是至关重要的，因为这些 GPU 的瓶颈是内存带宽（能多快地把数据发送到 GPU）而不是计算（GPU 能多快地进行浮点运算）。

PyTorch 2.0 第二个重要优化是 CUDA graphs 。

CUDA graphs 有助于消除从 Python 程序中启动单个内核的开销。

torch.compile() 支持许多不同的后端，其中最值得关注的是 Inductor，它可以生成 Triton 内核。

https://github.com/openai/triton

这些内核是用 Python 写的，但却优于绝大多数手写的 CUDA 内核。假设上面的例子叫做 trig.py，实际上可以通过运行来检查生成 triton 内核的代码。

TORCHINDUCTOR_TRACE=1 python trig.py

@pointwise(size_hints=[16384], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
   @triton.jit
   def kernel(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
       xnumel = 10000
       xoffset = tl.program_id(0) * XBLOCK
       xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])
       xmask = xindex < xnumel
       x0 = xindex
       tmp0 = tl.load(in_ptr0 + (x0), xmask)
       tmp1 = tl.sin(tmp0)
       tmp2 = tl.sin(tmp1)
       tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)

以上代码可知：两个 sins 确实发生了融合，因为两个 sin 算子发生在一个 Triton 内核中，而且临时变量被保存在 register 中，访问速度非常快。

真实模型示例

以 PyTorch Hub 中的 resnet50 为例：

import torch
   model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
   opt_model = torch.compile(model, backend="inductor")
   model(torch.randn(1,3,64,64))

实际运行中会发现，第一次运行速度很慢，这是因为模型正在被编译。随后的运行速度会加快，所以在开始基准测试之前，通常的做法是对模型进行 warm up 。

可以看到，这里我们用「inductor」表示编译器名称，但它不是唯一可用的后端，可以在 REPL 中运行 torch._dynamo.list_backends() 来查看可用后端的完整列表。

也可以试试 aot_cudagraphs 或 nvfuser 。

Hugging Face 模型示例

PyTorch 社区经常使用 transformers 或 TIMM 的预训练模型：

https://github.com/huggingface/transformers

https://github.com/rwightman/pytorch-image-models

PyTorch 2.0 的设计目标之一，就是任意编译栈，都需要在实际运行的绝大多数模型中，开箱即用。

这里我们直接从 HuggingFace hub 下载一个预训练的模型，并进行优化：

import torch
   from transformers import BertTokenizer, BertModel
   # Copy pasted from here https://huggingface.co/bert-base-uncased
   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
   model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
   model = torch.compile(model) # This is the only line of code that we changed
   text = "Replace me by any text you'd like."
   encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
   output = model(**encoded_input)

如果从模型中删除 to(device=”cuda:0″) 和 encoded_input ，PyTorch 2.0 将生成为在 CPU 上运行优化的 C++ 内核。

可以检查 BERT 的 Triton 或 C++ 内核，它们显然比上面的三角函数的例子更复杂。但如果你了解 PyTorch 可以略过。

同样的代码与以下一起使用，仍旧可以得到更好的效果：

* https://github.com/huggingface/accelerate

* DDP

同样的，试试 TIMM 的例子：

import timm
   import torch
   model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
   opt_model = torch.compile(model, backend="inductor")
   opt_model(torch.randn(64,3,7,7))

PyTorch 的目标是建立一个能适配更多模型的编译器，为绝大多数开源模型的运行提速，现在就访问 HuggingFace Hub，用 PyTorch 2.0 为 TIMM 模型加速吧！

https://huggingface.co/timm

PyTorch 2.0 实操：为 HuggingFace 和 TIMM 模型提速！

PyTorch 2.0 通过简单一行 torch.compile() 就可以使模型训练速度提高 30%-200%，本教程将演示如何真实复现这种提速。

上篇文章中，我们介绍了 torch.compile 支持任意的 PyTorch 代码、 control flow 、 mutation，并一定程度上支持 dynamic shapes 。

通过对 163 个开源模型进行测试，我们发现 torch.compile() 可以带来 30%-200% 的加速。

opt_module = torch.compile(module)

测试结果详见：

https://github.com/pytorch/torchdynamo/issues/681

本教程将演示如何利用 torch.compile() 为模型训练提速。

要求及设置

对于 GPU 而言（越新的 GPU 性能提升越突出）：

pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117

对于 CPU 而言：

pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu

可选：验证安装

git clone https://github.com/pytorch/pytorch
cd tools/dynamo
python verify_dynamo.py

可选：Docker 安装

在 PyTorch 的 Nightly Binaries 文件中提供了所有必要的依赖项，可以通过以下方式下载：

docker pull ghcr.io/pytorch/pytorch-nightly

对于临时测试 (ad hoc experiment)，只需确保容器能够访问所有 GPU 即可：

docker run --gpus all -it ghcr.io/pytorch/pytorch-nightly:latest /bin/bash

开始

简单示例

先来看一个简单示例，注意，GPU 越新速度提升越明显。

import torch
   def fn(x, y):
       a = torch.sin(x).cuda()
       b = torch.sin(y).cuda()
       return a + b
   new_fn = torch.compile(fn, backend="inductor")
   input_tensor = torch.randn(10000).to(device="cuda:0")
   a = new_fn()

这个例子实际上不会提升速度，但是可以抛砖引玉。

该示例中，torch.cos() 和 torch.sin() 是逐点运算 (pointwise ops) 的例子，他们可以在向量上逐一操作 element，一个更著名的逐点运算是 torch.relu()。

eager mode 下的逐点运算并不是最优解，因为每个算子都需要从内存中读取一个张量、做一些更改，然后再写回这些更改。

PyTorch 2.0 最重要的一项优化是融合 (fusion) 。

PyTorch 2.0 第二个重要优化是 CUDA graphs 。

CUDA graphs 有助于消除从 Python 程序中启动单个内核的开销。

torch.compile() 支持许多不同的后端，其中最值得关注的是 Inductor，它可以生成 Triton 内核。

https://github.com/openai/triton

这些内核是用 Python 写的，但却优于绝大多数手写的 CUDA 内核。假设上面的例子叫做 trig.py，实际上可以通过运行来检查生成 triton 内核的代码。

TORCHINDUCTOR_TRACE=1 python trig.py

@pointwise(size_hints=[16384], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
   @triton.jit
   def kernel(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
       xnumel = 10000
       xoffset = tl.program_id(0) * XBLOCK
       xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])
       xmask = xindex < xnumel
       x0 = xindex
       tmp0 = tl.load(in_ptr0 + (x0), xmask)
       tmp1 = tl.sin(tmp0)
       tmp2 = tl.sin(tmp1)
       tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)

以上代码可知：两个 sins 确实发生了融合，因为两个 sin 算子发生在一个 Triton 内核中，而且临时变量被保存在 register 中，访问速度非常快。

真实模型示例

以 PyTorch Hub 中的 resnet50 为例：

import torch
   model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
   opt_model = torch.compile(model, backend="inductor")
   model(torch.randn(1,3,64,64))

也可以试试 aot_cudagraphs 或 nvfuser 。

Hugging Face 模型示例

PyTorch 社区经常使用 transformers 或 TIMM 的预训练模型：

https://github.com/huggingface/transformers

https://github.com/rwightman/pytorch-image-models

PyTorch 2.0 的设计目标之一，就是任意编译栈，都需要在实际运行的绝大多数模型中，开箱即用。

这里我们直接从 HuggingFace hub 下载一个预训练的模型，并进行优化：

import torch
   from transformers import BertTokenizer, BertModel
   # Copy pasted from here https://huggingface.co/bert-base-uncased
   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
   model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
   model = torch.compile(model) # This is the only line of code that we changed
   text = "Replace me by any text you'd like."
   encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
   output = model(**encoded_input)

如果从模型中删除 to(device=”cuda:0″) 和 encoded_input ，PyTorch 2.0 将生成为在 CPU 上运行优化的 C++ 内核。

可以检查 BERT 的 Triton 或 C++ 内核，它们显然比上面的三角函数的例子更复杂。但如果你了解 PyTorch 可以略过。

同样的代码与以下一起使用，仍旧可以得到更好的效果：

* https://github.com/huggingface/accelerate

* DDP

同样的，试试 TIMM 的例子：

import timm
   import torch
   model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
   opt_model = torch.compile(model, backend="inductor")
   opt_model(torch.randn(64,3,7,7))

PyTorch 的目标是建立一个能适配更多模型的编译器，为绝大多数开源模型的运行提速，现在就访问 HuggingFace Hub，用 PyTorch 2.0 为 TIMM 模型加速吧！

https://huggingface.co/timm

PyTorch 2.0 实操：为 HuggingFace 和 TIMM 模型提速！

PyTorch 2.0 通过简单一行 torch.compile() 就可以使模型训练速度提高 30%-200%，本教程将演示如何真实复现这种提速。

上篇文章中，我们介绍了 torch.compile 支持任意的 PyTorch 代码、 control flow 、 mutation，并一定程度上支持 dynamic shapes 。

通过对 163 个开源模型进行测试，我们发现 torch.compile() 可以带来 30%-200% 的加速。

opt_module = torch.compile(module)

测试结果详见：

https://github.com/pytorch/torchdynamo/issues/681

本教程将演示如何利用 torch.compile() 为模型训练提速。

要求及设置

对于 GPU 而言（越新的 GPU 性能提升越突出）：

pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117

对于 CPU 而言：

pip3 install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cpu

可选：验证安装

git clone https://github.com/pytorch/pytorch
cd tools/dynamo
python verify_dynamo.py

可选：Docker 安装

在 PyTorch 的 Nightly Binaries 文件中提供了所有必要的依赖项，可以通过以下方式下载：

docker pull ghcr.io/pytorch/pytorch-nightly

对于临时测试 (ad hoc experiment)，只需确保容器能够访问所有 GPU 即可：

docker run --gpus all -it ghcr.io/pytorch/pytorch-nightly:latest /bin/bash

开始

简单示例

先来看一个简单示例，注意，GPU 越新速度提升越明显。

import torch
   def fn(x, y):
       a = torch.sin(x).cuda()
       b = torch.sin(y).cuda()
       return a + b
   new_fn = torch.compile(fn, backend="inductor")
   input_tensor = torch.randn(10000).to(device="cuda:0")
   a = new_fn()

这个例子实际上不会提升速度，但是可以抛砖引玉。

该示例中，torch.cos() 和 torch.sin() 是逐点运算 (pointwise ops) 的例子，他们可以在向量上逐一操作 element，一个更著名的逐点运算是 torch.relu()。

eager mode 下的逐点运算并不是最优解，因为每个算子都需要从内存中读取一个张量、做一些更改，然后再写回这些更改。

PyTorch 2.0 最重要的一项优化是融合 (fusion) 。

PyTorch 2.0 第二个重要优化是 CUDA graphs 。

CUDA graphs 有助于消除从 Python 程序中启动单个内核的开销。

torch.compile() 支持许多不同的后端，其中最值得关注的是 Inductor，它可以生成 Triton 内核。

https://github.com/openai/triton

这些内核是用 Python 写的，但却优于绝大多数手写的 CUDA 内核。假设上面的例子叫做 trig.py，实际上可以通过运行来检查生成 triton 内核的代码。

TORCHINDUCTOR_TRACE=1 python trig.py

@pointwise(size_hints=[16384], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
   @triton.jit
   def kernel(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
       xnumel = 10000
       xoffset = tl.program_id(0) * XBLOCK
       xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])
       xmask = xindex < xnumel
       x0 = xindex
       tmp0 = tl.load(in_ptr0 + (x0), xmask)
       tmp1 = tl.sin(tmp0)
       tmp2 = tl.sin(tmp1)
       tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)

以上代码可知：两个 sins 确实发生了融合，因为两个 sin 算子发生在一个 Triton 内核中，而且临时变量被保存在 register 中，访问速度非常快。

真实模型示例

以 PyTorch Hub 中的 resnet50 为例：

import torch
   model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
   opt_model = torch.compile(model, backend="inductor")
   model(torch.randn(1,3,64,64))

也可以试试 aot_cudagraphs 或 nvfuser 。

Hugging Face 模型示例

PyTorch 社区经常使用 transformers 或 TIMM 的预训练模型：

https://github.com/huggingface/transformers

https://github.com/rwightman/pytorch-image-models

PyTorch 2.0 的设计目标之一，就是任意编译栈，都需要在实际运行的绝大多数模型中，开箱即用。

这里我们直接从 HuggingFace hub 下载一个预训练的模型，并进行优化：

import torch
   from transformers import BertTokenizer, BertModel
   # Copy pasted from here https://huggingface.co/bert-base-uncased
   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
   model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
   model = torch.compile(model) # This is the only line of code that we changed
   text = "Replace me by any text you'd like."
   encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
   output = model(**encoded_input)

如果从模型中删除 to(device=”cuda:0″) 和 encoded_input ，PyTorch 2.0 将生成为在 CPU 上运行优化的 C++ 内核。

可以检查 BERT 的 Triton 或 C++ 内核，它们显然比上面的三角函数的例子更复杂。但如果你了解 PyTorch 可以略过。

同样的代码与以下一起使用，仍旧可以得到更好的效果：

* https://github.com/huggingface/accelerate

* DDP

同样的，试试 TIMM 的例子：

import timm
   import torch
   model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
   opt_model = torch.compile(model, backend="inductor")
   opt_model(torch.randn(64,3,7,7))

PyTorch 的目标是建立一个能适配更多模型的编译器，为绝大多数开源模型的运行提速，现在就访问 HuggingFace Hub，用 PyTorch 2.0 为 TIMM 模型加速吧！

https://huggingface.co/timm

Command Palette

PyTorch 2.0 实操：为 HuggingFace 和 TIMM 模型提速！

Command Palette

PyTorch 2.0 实操：为 HuggingFace 和 TIMM 模型提速！

相关报道

平均 1.8 秒完成预测，MIT 等开发 GPU 功耗估算框架 EnergAIzer，误差约 8%

活动预告｜智源/TileRT/腾讯/华为/智元创新同台，共探 AI 编译的多层级协同优化

在线教程丨 UC 伯克利/英伟达等发布 3DGS 开源库 Gsplat，节省 4 倍显存，训练时间缩短 10%

4 步出图/4K 画质/6 倍提速，PiD 用像素扩散统一解码与超分辨率输出；SA-3DAO：包含 1000 组真实图像与艺术家手工 3D 网格配对的数据集

Free CPU 在线教程 | Hermes Agent 学会长期记忆？记忆增强插件 TencentDB Agent Memory 可将事实/偏好/任务状态等分开存储

速度提升 252 倍，斯坦福/UCLA 等用 LSTM 将二阶非线性光学仿真带入毫秒级时代

在线教程丨单卡即可爆改，面壁智能等开源 MiniCPM-V-4.6，1.3B 端侧模型支持图像理解/视频理解/OCR/多轮多模态对话

Free CPU 教程丨狂揽 8.8k Stars，tts 模型 Supertonic-3 参数规模仅约 99M，支持 31 种语言

康奈尔大学开发多智能体平台 EMSeek，仅需 2-5 分钟即可将电子显微镜图像转化为材料学见解

Command Palette

PyTorch 2.0 实操：为 HuggingFace 和 TIMM 模型提速！

相关报道

平均 1.8 秒完成预测，MIT 等开发 GPU 功耗估算框架 EnergAIzer，误差约 8%

活动预告｜智源/TileRT/腾讯/华为/智元创新同台，共探 AI 编译的多层级协同优化

在线教程丨 UC 伯克利/英伟达等发布 3DGS 开源库 Gsplat，节省 4 倍显存，训练时间缩短 10%

4 步出图/4K 画质/6 倍提速，PiD 用像素扩散统一解码与超分辨率输出；SA-3DAO：包含 1000 组真实图像与艺术家手工 3D 网格配对的数据集

Free CPU 在线教程 | Hermes Agent 学会长期记忆？记忆增强插件 TencentDB Agent Memory 可将事实/偏好/任务状态等分开存储

速度提升 252 倍，斯坦福/UCLA 等用 LSTM 将二阶非线性光学仿真带入毫秒级时代

在线教程丨单卡即可爆改，面壁智能等开源 MiniCPM-V-4.6，1.3B 端侧模型支持图像理解/视频理解/OCR/多轮多模态对话

Free CPU 教程丨狂揽 8.8k Stars，tts 模型 Supertonic-3 参数规模仅约 99M，支持 31 种语言

康奈尔大学开发多智能体平台 EMSeek，仅需 2-5 分钟即可将电子显微镜图像转化为材料学见解

相关报道

平均 1.8 秒完成预测，MIT 等开发 GPU 功耗估算框架 EnergAIzer，误差约 8%

活动预告｜智源/TileRT/腾讯/华为/智元创新同台，共探 AI 编译的多层级协同优化

在线教程丨 UC 伯克利/英伟达等发布 3DGS 开源库 Gsplat，节省 4 倍显存，训练时间缩短 10%

4 步出图/4K 画质/6 倍提速，PiD 用像素扩散统一解码与超分辨率输出；SA-3DAO：包含 1000 组真实图像与艺术家手工 3D 网格配对的数据集

Free CPU 在线教程 | Hermes Agent 学会长期记忆？记忆增强插件 TencentDB Agent Memory 可将事实/偏好/任务状态等分开存储

速度提升 252 倍，斯坦福/UCLA 等用 LSTM 将二阶非线性光学仿真带入毫秒级时代

在线教程丨单卡即可爆改，面壁智能等开源 MiniCPM-V-4.6，1.3B 端侧模型支持图像理解/视频理解/OCR/多轮多模态对话

Free CPU 教程丨狂揽 8.8k Stars，tts 模型 Supertonic-3 参数规模仅约 99M，支持 31 种语言

康奈尔大学开发多智能体平台 EMSeek，仅需 2-5 分钟即可将电子显微镜图像转化为材料学见解

相关报道

平均 1.8 秒完成预测，MIT 等开发 GPU 功耗估算框架 EnergAIzer，误差约 8%

活动预告｜智源/TileRT/腾讯/华为/智元创新同台，共探 AI 编译的多层级协同优化

在线教程丨 UC 伯克利/英伟达等发布 3DGS 开源库 Gsplat，节省 4 倍显存，训练时间缩短 10%

4 步出图/4K 画质/6 倍提速，PiD 用像素扩散统一解码与超分辨率输出；SA-3DAO：包含 1000 组真实图像与艺术家手工 3D 网格配对的数据集

Free CPU 在线教程 | Hermes Agent 学会长期记忆？记忆增强插件 TencentDB Agent Memory 可将事实/偏好/任务状态等分开存储

速度提升 252 倍，斯坦福/UCLA 等用 LSTM 将二阶非线性光学仿真带入毫秒级时代

在线教程丨单卡即可爆改，面壁智能等开源 MiniCPM-V-4.6，1.3B 端侧模型支持图像理解/视频理解/OCR/多轮多模态对话

Free CPU 教程丨狂揽 8.8k Stars，tts 模型 Supertonic-3 参数规模仅约 99M，支持 31 种语言

康奈尔大学开发多智能体平台 EMSeek，仅需 2-5 分钟即可将电子显微镜图像转化为材料学见解