PyTorch 1.13 Officially Released: CUDA Upgrade, Integration of Multiple Libraries, M1 Chip Support

Content introduction: Recently, the PyTorch team announced the release of Pytorch 1.13 on the official blog. This article will introduce the four highlights of the new version in detail.
This article was first published on WeChat: PyTorch Developer Community
According to the official introduction, PyTorch 1.13 includes the stable version of BetterTransformer, no longer supports CUDA 10.2 and 11.3, and has completed the migration to CUDA 11.6 and 11.7. In addition, the Beta version also adds support for Apple M1 chips and functorch.
Summary of highlights of PyTorch 1.13:
-
The BetterTransformer feature set supports fastpath execution of general Transformer models during inference without modifying the model. Improvements also include acceleration of the add+matmul linear algebra kernel for sizes commonly used in Transformer models, and nested Tensors are now enabled by default.
-
We no longer support old CUDA versions and introduce the latest CUDA version released by Nvidia. This allows PyTorch and the new NVIDIA Open GPU kernel module to support C++17.
-
functorch has been moved from a separate package to be available directly through
import functorch
Import PyTorch for use without installing it separately. -
Testing provides native builds for M1 chip Macs and better PyTorch API support.
Stable Features
1. BetterTransformer API
The BetterTransformer feature set supports fastpath execution of general Transformer models during inference without modifying the model.
As a supplement, PyTorch 1.13 also accelerates the add+matmul linear algebra kernel for the size commonly used in the Transformer model.
In order to improve the performance of NLP models,BetterTransformer in PyTorch 1.13 enables nested Tensors by default. In terms of compatibility, a mask check is performed to ensure that a continuous mask can be provided.
The mask check of src_key_padding_mask in Transformer Encoder can be disabled by setting mask_check=False. This setting can speed up the processing instead of only providing the aligned mask.
Finally, better error messages are provided to simplify the diagnosis of incorrect input and provide better diagnostic methods for fastpath execution errors.
Better Transformer is directly integrated into the PyTorch TorchText library. This makes it easier for TorchText users to take advantage of the speed and efficiency of BetterTransformer.
2. Introducing CUDA 11.6 and 11.7, no longer supporting CUDA 10.2 and 11.3
CUDA 11 is the first CUDA version to support C++17. Dropping support for CUDA 10.2 is an important step in advancing PyTorch support for C++17, and it can also improve PyTorch code by eliminating legacy CUDA 10.2-specific instructions.
The withdrawal of CUDA 11.3 and the introduction of 11.7 make PyTorch more compatible with NVIDIA Open GPU kernel modules. Another important highlight is the support for lazy loading.
CUDA 11.7 comes with cuDNN 8.5.0, which includes a number of optimizations to accelerate Transformer-based models, reduce the library size by 30%, and make various improvements to the runtime fusion engine.
Beta Features
1. functorch
Similar to Google JAX, functorch is a library in PyTorch that provides composable vmap (vectorization) and autodiff transformations. It supports advanced autodiff use cases that are difficult to express in PyTorch, including:
-
Model ensembling
-
Efficient computation of Jacobians and Hessians
-
Compute per-sample-gradients or other per-sample quantities
PyTorch 1.13 has built-in functorch library, no need to install separately. After installing PyTorch through conda or pip, you can use it in your program. import functorch
.
2. Integrated Intel VTune™ Profiler and ITT
If PyTorch users want to use the underlying performance metrics to analyze the performance of each operator on the Intel platform,The operator-level timeline of a PyTorch script execution can be visualized in Intel VTune™ Profiler.
with torch.autograd.profiler.emit_itt():
for i in range(10):
torch.itt.range_push('step_{}'.format(i))
model(input)
torch.itt.range_pop()
3. NNC: Added support for BF16 and Channels last
By adding support for Channels last and BF16 in NNC, TorchScript's graph-mode inference performance on x86 CPUs has been significantly improved.
On Intel Cooper Lake processors, these two optimizations can more than double the performance of visual models.
Through the existing TorchScript, Channels last and BF16 Autocast API, Performance improvements can be achieved. As shown below, the optimizations in NNC will be migrated to the new PyTorch DL Compiler TorchInductor:
import torch
import torchvision.models as models
model = models.resnet50(pretrained=True)
# Convert the model to channels-last
model = model.to(memory_format=torch.channels_last)
model.eval()
data = torch.rand(1, 3, 224, 224)
# Convert the data to channels-lastdata = data.to(memory_format=torch.channels_last)
# Enable autocast to run with BF16
with torch.cpu.amp.autocast(), torch.no_grad():
# Trace the model
model = torch.jit.trace(model, torch.rand(1, 3, 224, 224))
model = torch.jit.freeze(model)
# Run the traced model
model(data)
4. Add support for Apple devices with M1 chip
Since version 1.12, PyTorch has been committed to providing native builds for Apple's M1 chip. PyTorch 1.13 further improves the relevant APIs.
PyTorch 1.13 has all submodules tested except torch.distribution on M1 macOS 12.6 instances. These improved tests fix features such as cpp extensions and convolution correctnes for certain inputs.
Note: This feature requires macOS 12 or later for the M1 chip and uses native Python (arm64).
Prototype Features
1. ACL backend support for AWS Graviton
PyTorch 1.13 achieves substantial improvements in CV and NLP reasoning on aarch64 CPUs through the Arm Compute Library (ACL). This enables ACL backend support for PyTorch and torch-xla modules. Highlights include:
-
Enable mkldnn+acl as default backend for aarch64 torch wheel
-
Enable mkldnn matmul operator for arch64 BF16 devices
-
Bring TensorFlow xla+acl functionality to torch-xla.
2. CUDA Sanitizer
Once enabled, the Sanitizer will start analyzing the underlying CUDA operations called by the user's PyTorch code to detect data race errors.
Note: These errors are caused by unsynchronized data access from different CUDA streams.
Similar to Thread Sanitizer, the located error will be printed out together with the stack trace of the incorrect access.
Corrupted data in machine learning applications can be easily overlooked, and errors are sometimes not apparent, so CUDA Sanitizer, which is used to detect and locate errors, is particularly important.
3. Partial support for Python 3.11
Users can download Linux binaries that support Python 3.11 through pip. However, this feature is only a preview version, and features such as Distributed, Profiler, FX, and JIT are not fully supported.
From 0 to 1, learn PyTorch official tutorial
OpenBayes.com has launched several official PyTorch tutorials in Chinese, including but not limited to NLP, CV, DL and other examples.Visit the console and search for public resources.
Run the PyTorch Chinese tutorial, click the end of the article to read the original text, or visit the following link:
https://openbayes.com/console/public/tutorials