HyperAI超神经

Improvements in Profiler v1.9 focus on targeting the execution steps that are most energy intensive in terms of runtime and/or memory, while visualizing the workload distribution between the GPU and CPU.

Profiler v1.9 adds five new major features including:

1. Distributed training view: This helps you understand the time and memory consumed in distributed training tasks. Suppose you have a training model, when you want to divide the load into Worker nodes and run them in parallel, it may be like a black box, and various problems may occur. The overall goal of the model is to increase the training speed. This distributed training view helps you diagnose and debug problems within a single node.

2. Memory view: With this view, you can better understand memory usage. This tool can show the active memory allocation of the program at different stages of operation, thus helping you avoid Out of Memory errors.

3. GPU application visualization: This tool ensures that the GPU is fully utilized.

4. Cloud storage support: The Tensorboard plugin can now read parsed data from Azure Blob Storage, Amazon S3, and Google Cloud Platform.

5. Jump to source code: This feature supports visualization of stack trace information and allows you to jump directly to the source code. This helps you quickly optimize and iterate your code based on the analysis results.

PyTorch Profiler Colab Portal

Chinese version of Colab portal

Colab content at a glance:

Preparing data and models
Recording execution events using Profiler
Run Profiler
Use TensorBoard to view results and analyze model performance
Improving performance with Profiler
Analyze performance using other advanced features

Getting started with PyTorch Profiling

first:

$ pip install torch-tb-profiler

import torch.profiler as profiler
With profiler.profile(XXXX)

Remark: For more information about CUDA and CPU analysis, see Here

with torch.profiler.profile( 
activities=[ 
torch.profiler.ProfilerActivity.CPU, 
torch.profiler.ProfilerActivity.CUDA],

profiler.record_function("$NAME"): allows adding decorators to function blocks.
The Profile_memory=True parameter under profiler.profile can analyze the memory usage of the CPU and GPU.

Visualizing PyTorch model performance

### Distributed Training

Recent advances in deep learning have demonstrated the value of large datasets and large models, which also means that model training requires more computing resources.

Distributed Data Parallelism (DDP) and NVIDIA Multi-Card Communication Framework (NCCL) are widely adopted paradigms in PyTorch to accelerate deep learning training.

In this version of PyTorch Profiler, DDP with NCCL backend is now supported.
insert image description here

Computing/Communications Overview

In the "Compute/Communication Overview" of the distributed training view, users can observe the computation and communication ratio of the "load balancer" nodes between all workers, which is measured according to the granularity.

Load balancer related links: Here

Scenario 1:

If one worker has longer computation and overlap times than other workers, this may indicate a problem with workload balancing or a straggler. Computation is the sum of the GPU kernel times minus the overlap time. Overlap time is the time saved by interleaving communications during computation.

A longer overlap indicates better parallelism between computation and communication.Ideally, computation and communication completely overlap each other. Communication is the total communication time minus the overlap time.

The following example shows how this might look on Tensorboard.
insert image description here
straggler example

Scenario 2:

If the batch size is small (that is, the calculations on all workers are small), or the data to be transmitted is large, then the calculation-to-communication ratio may also be small, and the GPU utilization rate and waiting time can be seen in the Profiler.

Users can review the code based on this computation/communication view and reduce communication by using gradient accumulation or by increasing the batch size. DDP communication time depends on the model size. Batch size is independent of model size. Therefore, increasing the batch size can make the computation time longer and the computation communication ratio larger.

### Synchronization/Communication Overview

In the Sync/Communication view, users can observe the communication efficiency. This is calculated by subtracting the computation and communication time from the step time. The sync time is the portion of the total communication time spent waiting and syncing with other workers. The sync/communication view includes initialization, data loaders, CPU computation, etc.

From this view, we can see that:What proportion of the total communication volume is actually used to exchange data, and what is the idle time waiting for other workers to provide data.

For example, if there is inefficient workload balancing or straggler issues, this can be discovered in the Sync/Communication view. This view will show that some workers are waiting longer than others.

From the above table, you can get detailed statistics of all communication operators in each node. Through this table, you can understand which operator types are called, how many times each operator is called, the size of the data transmitted by each operator, etc.

Memory View

This tool can help you understand the hardware resource consumption of operators in the model. Understanding the time and memory consumption at the operator level can help solve performance bottlenecks and speed up the model.Given that GPU memory size is limited, optimizing memory usage efficiency helps:

Allows running larger models and performing better on terminal-level tasks.
Allows larger batch sizes, increasing training speed.

The Profiler records all memory allocations during the Profiler interval. Select "Device" to see the memory usage details of each operator on the GPU side or the host side.

NOTE: profile_memory=True must be enabled to generate the following memory data.

GPU Metrics on a Timeline

This feature allows you to easily debug performance issues when one or more GPUs are underutilized. Ideally, your program should have high GPU utilization (as high as 100% GPU utilization), minimal CPU-to-GPU communication cost, and no power consumption.

Overview: The overview page highlights the results of three important GPU usage indicators (GPU Utilization, Est. SM Efficiency, and Est. Achieved Occupancy) at different levels.

Essentially, each GPU has many SMs, and each SM has many Warps that can execute many threads simultaneously. Warps execute many threads because their number depends on the GPU. From a higher level perspective, GPU metrics on the timeline can help developers have a global view of the entire stack, which is very important.

If GPU utilization is low, this indicates a potential problem with your model. Common causes include:

Insufficient parallelism in the kernel, i.e. batch size is too small
Calling the small kernel in a loop means that the startup overhead is not amortized
CPU or I/O bottlenecks lead to insufficient workload and low GPU utilization

In the Overview page, the Performance Suggestions section contains actionable suggestions that can improve GPU utilization. In this example, GPU utilization is low, so the performance suggestion is to increase the batch size. According to the performance suggestion, increasing the batch size from 4 to 32 increases GPU utilization by 60.68%.

GPU Utilization: In the Profiler, a step interval time appears when the GPU engine executes a workload. The higher the utilization percentage, the better. It is not accurate to judge performance bottlenecks only by GPU utilization. You cannot use this to know how many Streaming Multiprocessors are running.

Note that while this metric is useful for detecting idle periods, a high value does not necessarily mean that the GPU is being used very well. For example, a single-threaded kernel running continuously would have a GPU utilization of 100%.

Estimated stream processor efficiency (Est. SM Efficiency) is a more detailed indicator. It indicates the percentage of the SM that was in use during the trace, the percentage of time that there was at least one active warp on the SM, and the percentage of time that warps were idle.

NVIDIA Documentation: Here

Est. SM Efficiency also has limitations. For example, if each block has only one thread core, all SMs cannot be fully utilized. Based on SM Efficiency alone, it is not possible to know the utilization of each SM, but only the operations being performed by each SM, including pauses when waiting for memory load results.

In order to maintain high utilization of the SM, a sufficient number of ready wraps must be guaranteed to run whenever a stall occurs.

For performance diagnostics, Est. Achieved Occupancy is more accurate than Est. SM Efficiency and GPU Utilization. Est. Achieved Occupancy tells you how many warps per SM can be active at the same time. Having enough active warps is usually the key to achieving good throughput. Unlike GPU Utilization and SM Efficiency, getting this value as high as possible is not the ultimate goal.

From a empirical point of view, good throughput gains can be achieved by increasing this metric to 15% or above. However, at some point, diminishing returns are encountered. For example, if the value has reached 30%, the subsequent gains become uncertain. This metric shows the average of all warp schedulers during kernel execution.

NVIDIA Documentation: Here

The larger the value of Est. Achieve Occupancy, the better.

Details: Resnet50_batchsize4

Details: Resnet50_batchsize32

Kernel view: The kernel has "Blocks per SM" and "Est. Achieved Occupancy".

Est. Achieved Occupancy is a useful tool for comparing how well models are performing.

Mean Blocks per SM:

The number of blocks per SM = the number of blocks of the kernel / the number of SMs of the GPU. If this number is less than 1, it means that the GPU multiprocessor is not fully utilized. "Mean Blocks per SM" is the weighted average of all runs of this kernel name, using the duration of each run as the weight.

Mean Est. Achieved Occupancy

The definition of Est. Achieved Occupancy is the same as outlined above. Mean Est. Achieved Occupancy is the weighted average of all runs of this kernel name, using the duration of each run as a weight.

Tracking View:

The trace view shows a timeline that shows the duration of operators in the model and which system performed the operation. This view can help you identify high consumption and long execution, whether it is caused by input or model training. Currently, the trace view can show GPU utilization and Est. SM Efficiency within a timeline.

In the above example, the GPU utilization during "ProfilerStep5" in thread 28022 is higher than that during "Optimizer.step". You can zoom in to see the reason.

As can be seen from the above figure, the kernel of the former is longer than that of the latter. The kernel execution time of the latter is too short, resulting in reduced GPU utilization.

Est. SM Efficiency: Each kernel has a calculated EST. SM Efficiency between 0 and 100%. For example, the kernel below has only 64 tiles, and this GPU has 80 SMs, so its "Est. SM Efficiency" is 64/80, or 0.8.

Cloud storage support

After running pip install tensorboard , to read data via a cloud provider you can run:

torch-tb-profiler[blob] 
torch-tb-profiler[gs] 
torch-tb-profiler[s3]

You can use pip install torch-tb-profiler[blob], pip install torch-tb-profiler[gs], or pip install torch-tb-profiler[S3] to read data from a cloud provider.

For more information, see: Here

Jump to source code

One of the benefits of integrating TensorBoard and PyTorch Profiler directly into Visual Studio Code (VS Code) is the ability to jump directly to source code (file and line) from the Profiler stack trace. The VS Code Python extension now supports TensorBoard integration.

Jump to source code is only available when Tensorboard is running in VS Code. If profiling with_stack=True, the stack trace will appear in the plugin UI. Clicking the stack trace in PyTorch Profiler will cause VS Code to open the corresponding file and jump directly to the corresponding code for debugging. This allows you to quickly optimize and modify your code based on the analysis results and suggestions.

Jump to source code with Visual Studio Code Plug In UI

For more information on how to optimize batch size performance, please see the detailed tutorial: Here

PyTorch Profiler can also be integrated with PyTorch Lightning. Simply start a lightning training task with trainer.profiler=pytorch to generate a trace.

Detailed example: Here

Original address: Here