PyTorch Official Blog: Detailed Explanation of PyTorch Profiler V1.9

Improvements in Profiler v1.9 focus on targeting the execution steps that are most energy intensive in terms of runtime and/or memory, while visualizing the workload distribution between the GPU and CPU.
Profiler v1.9 adds five new major features including:
1. Distributed training view: This helps you understand the time and memory consumed in distributed training tasks. Suppose you have a training model, when you want to divide the load into Worker nodes and run them in parallel, it may be like a black box, and various problems may occur. The overall goal of the model is to increase the training speed. This distributed training view helps you diagnose and debug problems within a single node.
2. Memory view: With this view, you can better understand memory usage. This tool can show the active memory allocation of the program at different stages of operation, thus helping you avoid Out of Memory errors.
3. GPU application visualization: This tool ensures that the GPU is fully utilized.
4. Cloud storage support: The Tensorboard plugin can now read parsed data from Azure Blob Storage, Amazon S3, and Google Cloud Platform.
5. Jump to source code: This feature supports visualization of stack trace information and allows you to jump directly to the source code. This helps you quickly optimize and iterate your code based on the analysis results.
Chinese version of Colab portal
Colab content at a glance:
- Preparing data and models
- Recording execution events using Profiler
- Run Profiler
- Use TensorBoard to view results and analyze model performance
- Improving performance with Profiler
- Analyze performance using other advanced features
Getting started with PyTorch Profiling
first:
$ pip install torch-tb-profiler
import torch.profiler as profiler
With profiler.profile(XXXX)
Remark: For more information about CUDA and CPU analysis, see Here
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA],
- profiler.record_function("$NAME"): allows adding decorators to function blocks.
- The Profile_memory=True parameter under profiler.profile can analyze the memory usage of the CPU and GPU.
Visualizing PyTorch model performance
### Distributed Training
Recent advances in deep learning have demonstrated the value of large datasets and large models, which also means that model training requires more computing resources.
Distributed Data Parallelism (DDP) and NVIDIA Multi-Card Communication Framework (NCCL) are widely adopted paradigms in PyTorch to accelerate deep learning training.
In this version of PyTorch Profiler, DDP with NCCL backend is now supported.
Computing/Communications Overview
In the "Compute/Communication Overview" of the distributed training view, users can observe the computation and communication ratio of the "load balancer" nodes between all workers, which is measured according to the granularity.
Load balancer related links: Here
Scenario 1:
If one worker has longer computation and overlap times than other workers, this may indicate a problem with workload balancing or a straggler. Computation is the sum of the GPU kernel times minus the overlap time. Overlap time is the time saved by interleaving communications during computation.
A longer overlap indicates better parallelism between computation and communication.Ideally, computation and communication completely overlap each other. Communication is the total communication time minus the overlap time.
The following example shows how this might look on Tensorboard.
straggler example
Scenario 2:
If the batch size is small (that is, the calculations on all workers are small), or the data to be transmitted is large, then the calculation-to-communication ratio may also be small, and the GPU utilization rate and waiting time can be seen in the Profiler.
Users can review the code based on this computation/communication view and reduce communication by using gradient accumulation or by increasing the batch size. DDP communication time depends on the model size. Batch size is independent of model size. Therefore, increasing the batch size can make the computation time longer and the computation communication ratio larger.
### Synchronization/Communication Overview
In the Sync/Communication view, users can observe the communication efficiency. This is calculated by subtracting the computation and communication time from the step time. The sync time is the portion of the total communication time spent waiting and syncing with other workers. The sync/communication view includes initialization, data loaders, CPU computation, etc.
From this view, we can see that:What proportion of the total communication volume is actually used to exchange data, and what is the idle time waiting for other workers to provide data.

For example, if there is inefficient workload balancing or straggler issues, this can be discovered in the Sync/Communication view. This view will show that some workers are waiting longer than others.

From the above table, you can get detailed statistics of all communication operators in each node. Through this table, you can understand which operator types are called, how many times each operator is called, the size of the data transmitted by each operator, etc.
Memory View
This tool can help you understand the hardware resource consumption of operators in the model. Understanding the time and memory consumption at the operator level can help solve performance bottlenecks and speed up the model.Given that GPU memory size is limited, optimizing memory usage efficiency helps:
- Allows running larger models and performing better on terminal-level tasks.
- Allows larger batch sizes, increasing training speed.
The Profiler records all memory allocations during the Profiler interval. Select "Device" to see the memory usage details of each operator on the GPU side or the host side.
NOTE: profile_memory=True must be enabled to generate the following memory data.
Related Links: Here
With torch.profiler.profile(
Profiler_memory=True # this will take 1 – 2 minutes to complete.
)
Important definitions:
- "Size Increase" shows the sum of all allocated bytes, minus all memory deallocated bytes.
- "Allocation Size" shows the sum of all allocated bytes excluding memory deallocation.
- "Self" means that the allocated memory does not come from any child operator, but is allocated by the operator itself.

GPU Metrics on a Timeline
This feature allows you to easily debug performance issues when one or more GPUs are underutilized. Ideally, your program should have high GPU utilization (as high as 100% GPU utilization), minimal CPU-to-GPU communication cost, and no power consumption.
Overview: The overview page highlights the results of three important GPU usage indicators (GPU Utilization, Est. SM Efficiency, and Est. Achieved Occupancy) at different levels.
Essentially, each GPU has many SMs, and each SM has many Warps that can execute many threads simultaneously. Warps execute many threads because their number depends on the GPU. From a higher level perspective, GPU metrics on the timeline can help developers have a global view of the entire stack, which is very important.
If GPU utilization is low, this indicates a potential problem with your model. Common causes include:
- Insufficient parallelism in the kernel, i.e. batch size is too small
- Calling the small kernel in a loop means that the startup overhead is not amortized
- CPU or I/O bottlenecks lead to insufficient workload and low GPU utilization
In the Overview page, the Performance Suggestions section contains actionable suggestions that can improve GPU utilization. In this example, GPU utilization is low, so the performance suggestion is to increase the batch size. According to the performance suggestion, increasing the batch size from 4 to 32 increases GPU utilization by 60.68%.
GPU Utilization: In the Profiler, a step interval time appears when the GPU engine executes a workload. The higher the utilization percentage, the better. It is not accurate to judge performance bottlenecks only by GPU utilization. You cannot use this to know how many Streaming Multiprocessors are running.
Note that while this metric is useful for detecting idle periods, a high value does not necessarily mean that the GPU is being used very well. For example, a single-threaded kernel running continuously would have a GPU utilization of 100%.
Estimated stream processor efficiency (Est. SM Efficiency) is a more detailed indicator. It indicates the percentage of the SM that was in use during the trace, the percentage of time that there was at least one active warp on the SM, and the percentage of time that warps were idle.
NVIDIA Documentation: Here
Est. SM Efficiency also has limitations. For example, if each block has only one thread core, all SMs cannot be fully utilized. Based on SM Efficiency alone, it is not possible to know the utilization of each SM, but only the operations being performed by each SM, including pauses when waiting for memory load results.
In order to maintain high utilization of the SM, a sufficient number of ready wraps must be guaranteed to run whenever a stall occurs.
For performance diagnostics, Est. Achieved Occupancy is more accurate than Est. SM Efficiency and GPU Utilization. Est. Achieved Occupancy tells you how many warps per SM can be active at the same time. Having enough active warps is usually the key to achieving good throughput. Unlike GPU Utilization and SM Efficiency, getting this value as high as possible is not the ultimate goal.
From a empirical point of view, good throughput gains can be achieved by increasing this metric to 15% or above. However, at some point, diminishing returns are encountered. For example, if the value has reached 30%, the subsequent gains become uncertain. This metric shows the average of all warp schedulers during kernel execution.
NVIDIA Documentation: Here
The larger the value of Est. Achieve Occupancy, the better.

Details: Resnet50_batchsize4

Details: Resnet50_batchsize32
Kernel view: The kernel has "Blocks per SM" and "Est. Achieved Occupancy".
Est. Achieved Occupancy is a useful tool for comparing how well models are performing.

Mean Blocks per SM:
The number of blocks per SM = the number of blocks of the kernel / the number of SMs of the GPU. If this number is less than 1, it means that the GPU multiprocessor is not fully utilized. "Mean Blocks per SM" is the weighted average of all runs of this kernel name, using the duration of each run as the weight.
Mean Est. Achieved Occupancy
The definition of Est. Achieved Occupancy is the same as outlined above. Mean Est. Achieved Occupancy is the weighted average of all runs of this kernel name, using the duration of each run as a weight.
Tracking View:
The trace view shows a timeline that shows the duration of operators in the model and which system performed the operation. This view can help you identify high consumption and long execution, whether it is caused by input or model training. Currently, the trace view can show GPU utilization and Est. SM Efficiency within a timeline.

In the above example, the GPU utilization during "ProfilerStep5" in thread 28022 is higher than that during "Optimizer.step". You can zoom in to see the reason.

As can be seen from the above figure, the kernel of the former is longer than that of the latter. The kernel execution time of the latter is too short, resulting in reduced GPU utilization.
Est. SM Efficiency: Each kernel has a calculated EST. SM Efficiency between 0 and 100%. For example, the kernel below has only 64 tiles, and this GPU has 80 SMs, so its "Est. SM Efficiency" is 64/80, or 0.8.

Cloud storage support
After running pip install tensorboard , to read data via a cloud provider you can run:
torch-tb-profiler[blob]
torch-tb-profiler[gs]
torch-tb-profiler[s3]
You can use pip install torch-tb-profiler[blob], pip install torch-tb-profiler[gs], or pip install torch-tb-profiler[S3] to read data from a cloud provider.
For more information, see: Here
Jump to source code
One of the benefits of integrating TensorBoard and PyTorch Profiler directly into Visual Studio Code (VS Code) is the ability to jump directly to source code (file and line) from the Profiler stack trace. The VS Code Python extension now supports TensorBoard integration.
Jump to source code is only available when Tensorboard is running in VS Code. If profiling with_stack=True, the stack trace will appear in the plugin UI. Clicking the stack trace in PyTorch Profiler will cause VS Code to open the corresponding file and jump directly to the corresponding code for debugging. This allows you to quickly optimize and modify your code based on the analysis results and suggestions.

Jump to source code with Visual Studio Code Plug In UI
For more information on how to optimize batch size performance, please see the detailed tutorial: Here
PyTorch Profiler can also be integrated with PyTorch Lightning. Simply start a lightning training task with trainer.profiler=pytorch to generate a trace.
Detailed example: Here
Original address: Here