Artifact Model Training Optimizations #155

fatsmcgee · 2024-11-02T19:05:27Z

All performance numbers cited below are on an 8 CPU machine (n1-standard-8) with a Tesla T4 GPU.

This PR introduces the following:

For data iteration / inference outside of model learning, which includes both model evaluation and the construction of ArtifactDataset, allow for distinct inference batch sizes (larger = much faster) and use torch inference mode. When inference batch size is large (default is set to 8192), this makes inference much faster (E.g ArtifactDataset construction goes from 6.5 to 3.5 minutes).
Resolve performance problems with num_workers>0 which were most likely caused by issues with serialization of ArtifactBatch.original_data in DataLoader works. Instead of storing original_data, only store the pieces of that data which are accessed.
- Before with GPUs, num_workers=4 was much slower than num_workers=0. Now it is a bit faster (8 mins 20 second epoch versus 8 mins 34).
- This also resolves a runtime crash that occurs up in Linux environments where the open file limit is low (I observed this on a Google Cloud VM). This is probably because PyTorch opens shared memory "files" to communicate data between workers, and original_data had lots of nested Python (not PyTorch/Numpy) data which got serialized independenty.

…e mode

…tion optimization

…emory issues

davidbenjamin · 2024-11-04T19:22:22Z

permutect/parameters.py

@@ -117,6 +120,8 @@ def add_training_params_to_parser(parser):
                        help='number of epochs for primary training loop')
    parser.add_argument('--' + constants.NUM_CALIBRATION_EPOCHS_NAME, type=int, default=0, required=False,
                        help='number of calibration-only epochs')
+    parser.add_argument('--' + constants.INFERENCE_BATCH_SIZE_NAME, type=int, default=8192, required=False,


This will also need to go in the WDL scripts, but I'll handle that in a follow-up PR.

fatsmcgee added 6 commits November 1, 2024 22:53

Speed up ArtifactDataset constructor with larger batch size, inferenc…

71780cd

…e mode

Get rid of unused import, more hyooge batch size

27b64c3

See if storing only the needed original values delivers the serializa…

94eba19

…tion optimization

Yep, it works. Only store needed original data in ArtifactBatch

768948a

Use larger batch size for inference when learning is not happening

b89ed91

For now, get rid of inference-only train loader optimization due to m…

7e5dfff

…emory issues

davidbenjamin approved these changes Nov 4, 2024

View reviewed changes

davidbenjamin merged commit ebf9b20 into master Nov 4, 2024

davidbenjamin deleted the ebenj/faster-artifact branch November 4, 2024 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Artifact Model Training Optimizations #155

Artifact Model Training Optimizations #155

fatsmcgee commented Nov 2, 2024 •

edited

Loading

davidbenjamin Nov 4, 2024

Artifact Model Training Optimizations #155

Artifact Model Training Optimizations #155

Conversation

fatsmcgee commented Nov 2, 2024 • edited Loading

davidbenjamin Nov 4, 2024

Choose a reason for hiding this comment

fatsmcgee commented Nov 2, 2024 •

edited

Loading