You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sigmas is generally going to be a GPU tensor. If it is on the GPU, then on the line: if old_denoised is None or sigmas[i + 1] == 0:
it will have to be synced to the CPU as part of control flow, which will block Pytorch dispatch until all operations called prior have completed, and as a result there will be a significant gap between every step where the GPU is idle. The actual impact varies depending on hardware, but having the dispatch queue completely unblocked is very beneficial since usually Pytorch can line up several steps of inference in advance and the GPU will then execute them completely uninterrupted.
For this sampler, you could avoid this by putting sigmas on the CPU, or making a copy of it used specifically for control flow. But this breaks other samplers (Heun and Euler at least), because they use sigmas in a way that actually does need to be a GPU tensor.
I think most samplers can have all tensors used for control flow precalculated before the for loop and that would solve the problem. I believe that it would also be preferable to have tensors used as scalars on CPU where possible in general, since that usually results in less kernel launches than doing the same operation off of a GPU tensor.
The text was updated successfully, but these errors were encountered:
There are blocking operations in several of the samplers. Looking at DPM-Solver++(2M), for example:
sigmas
is generally going to be a GPU tensor. If it is on the GPU, then on the line:if old_denoised is None or sigmas[i + 1] == 0:
it will have to be synced to the CPU as part of control flow, which will block Pytorch dispatch until all operations called prior have completed, and as a result there will be a significant gap between every step where the GPU is idle. The actual impact varies depending on hardware, but having the dispatch queue completely unblocked is very beneficial since usually Pytorch can line up several steps of inference in advance and the GPU will then execute them completely uninterrupted.
For this sampler, you could avoid this by putting sigmas on the CPU, or making a copy of it used specifically for control flow. But this breaks other samplers (Heun and Euler at least), because they use sigmas in a way that actually does need to be a GPU tensor.
I think most samplers can have all tensors used for control flow precalculated before the for loop and that would solve the problem. I believe that it would also be preferable to have tensors used as scalars on CPU where possible in general, since that usually results in less kernel launches than doing the same operation off of a GPU tensor.
The text was updated successfully, but these errors were encountered: