GPU Performance Optimizations #153

fatsmcgee · 2024-10-25T17:10:15Z

This PR introduces the following optimizations, which in total make small batch training a bit faster (batch=256 on a T4 is about 15-20% faster) and large batch training a lot faster:

Asynchronously load the next batch to the GPU while forward/backward pass is occurring.
Set fused=True for AdamW (small but non-trivial speedup)
In Gated MLP use torch.mean(..., keepdim=True) instead of broadcasting with torch.mean(...)[:,None,:] (small but non-trivial speedup)

… version to avoid future regressions

… 12.1, add zones just in case, use p4 instead of t4

…t this

… usage

fatsmcgee added 20 commits October 21, 2024 16:42

Experiment with MPS support

a483d88

Have GPUs (and MPS) use float32 to eliminate NaNs

1e385cf

Later pysam works better with GPU image build

acb58f0

Switch to a pytorch base image, primarily so we can have a known CUDA…

33c4338

… version to avoid future regressions

Log cuda availability

32eae1c

Explicitly specificy an nvidia driver version which will support CUDA…

eb0d3e3

… 12.1, add zones just in case, use p4 instead of t4

Revert requirement changes, not needed

b547970

Align wdl with GCE driver recommendation, add note in Dockerfile abou…

3fdc786

…t this

Get rid of debug changes

0ba9b38

Get rid of debug change

bcc8601

Enforce invariant that data is already on device

ee6da6a

When recording embeddings, do inference on the model device

897a987

Move copies together

44e063a

Print out time elapsed for epoch in seconds

dbea93a

Use settings for permutect base training more likely to result in GPU…

62ba624

… usage

Implement lookahead optimization

5d976b7

Make sure to pass in required non_blocking arg

39efe0d

Small optimization: use keepdim instead of broadcasting

ae9926f

Fused AdamW is slightly faster

3d1514a

merge

4abcbc3

fatsmcgee changed the title ~~[In Progress] GPU Lookahead~~ Async GPU Transfer Optimization Oct 27, 2024

fatsmcgee changed the title ~~Async GPU Transfer Optimization~~ GPU Performance Optimizations Oct 27, 2024

davidbenjamin merged commit b41565f into master Oct 28, 2024

davidbenjamin deleted the ebenj/gpu-lookahead branch October 28, 2024 04:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Performance Optimizations #153

GPU Performance Optimizations #153

fatsmcgee commented Oct 25, 2024 •

edited

Loading

GPU Performance Optimizations #153

GPU Performance Optimizations #153

Conversation

fatsmcgee commented Oct 25, 2024 • edited Loading

fatsmcgee commented Oct 25, 2024 •

edited

Loading