chainer_profutil

This is an UNOFFICIAL Chainer related tool. This tool helps you to find forward, backward and update part from profiling details when you use NVIDIA Visual Profiler (https://developer.nvidia.com/nvidia-visual-profiler / https://docs.nvidia.com/cuda/profiler-users-guide/index.html) or NVIDIA Nsight Systems (https://developer.nvidia.com/nsight-systems / https://docs.nvidia.com/nsight-systems/#nsight_systems/2019.3.1-x86/01-overview.htm). As a result, you can improve your workload more efficiently.

How to use in case of NVIDIA Visual Profiler

Change your code according to example codes below
Run your code via nvprof (eg. nvprof -o prof.nvvp python main.py ...)
Load prof.nvvp to NVIDIA Visual Profiler nvvp
Enjoy your profiling and accelerating!

To track all child processes when using ChainerMN and/or `MultiprocessParallelUpdater`.

Additional option, --profile-child-processes, makes it to track all child processes. In addition, the profiler adds each process ID into the file name by using %p like prof%p.nvvp.

How to use in case of NVIDIA Nsight Systems

Change your code according to example codes below
Run your code via nsys (eg. nsys profile --trace cuda,cublas,cudnn,nvtx,osrt python3 train.py ...)
Load report1.qdstrm to NVIDIA Nsight Systems GUI
Let's analyze your bottleneck

Simple example.

Adding 2 lines is all you need. First, import a function. Second, call it.

Before.

optimizer = chainer.optimizers.Adam(alpha=0.001)
optimizer.setup(model)

After.

from chainer_profutil import create_marked_profile_optimizer

optimizer = create_marked_profile_optimizer(
    chainer.optimizers.Adam(alpha=0.001), sync=True, sync_level=2)
optimizer.setup(model)

Example for ChainerMN.

When you use ChainerMN's create_multi_node_optimizer(), you need to give an instance returned from create_multi_node_optimizer() to create_marked_profile_optimizer() as follows.

optimizer = create_marked_profile_optimizer(
    chainermn.create_multi_node_optimizer(
        chainer.optimizers.MomentumSGD(lr=0.01, momentum=0.9),
        comm),
    sync=False)
optimizer.setup(model)

Profiling tips.

Reducing the number of iterations.

A training script usually runs training procedure at multiple epochs. But, the size of nvprof output becomes large if the training runs for long hours.

Therefore, we strongly recommend that you add an additional option to your code corresponding to the number of iterations. If this iteration option is given, then a script stops by the given iteration instaed of epochs. By this change, we can get relatively small profiling output and operate it by NVIDIA Visual Profiler.

Synchronization level.

This tool has 3 synchronization levels and sync/async switch. Each level corresponds to each marker. When you disable synchronize mode (ie., sync=False), then all markers don't synchronize all GPU kernels as below.

When you enable synchronize mode (ie., sync=True), then some markers synchronize corresponding GPU kernels and other markers are asynchronous.

At level 1 (ie., sync_level=1), highest marker only synchronizes at the being and end of 1 iteration.

At level 2 (ie., sync_level=2), forward/backward/update markers synchronize at the begin and end of corresponding kernels.

On level 3 (ie., sync_level=3), all markers synchronize corresponding kernels.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
chainer_profutil		chainer_profutil
docs/imgs		docs/imgs
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chainer_profutil

How to use in case of NVIDIA Visual Profiler

To track all child processes when using ChainerMN and/or `MultiprocessParallelUpdater`.

How to use in case of NVIDIA Nsight Systems

Simple example.

Before.

After.

Example for ChainerMN.

Profiling tips.

Reducing the number of iterations.

Synchronization level.

About

Releases

Packages

Languages

License

lazykyama/chainer_profutil

Folders and files

Latest commit

History

Repository files navigation

chainer_profutil

How to use in case of NVIDIA Visual Profiler

To track all child processes when using ChainerMN and/or MultiprocessParallelUpdater.

How to use in case of NVIDIA Nsight Systems

Simple example.

Before.

After.

Example for ChainerMN.

Profiling tips.

Reducing the number of iterations.

Synchronization level.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

To track all child processes when using ChainerMN and/or `MultiprocessParallelUpdater`.

Packages