Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artifact Model Training Optimizations #155

Merged
merged 6 commits into from
Nov 4, 2024
Merged

Conversation

fatsmcgee
Copy link
Collaborator

@fatsmcgee fatsmcgee commented Nov 2, 2024

All performance numbers cited below are on an 8 CPU machine (n1-standard-8) with a Tesla T4 GPU.

This PR introduces the following:

  • For data iteration / inference outside of model learning, which includes both model evaluation and the construction of ArtifactDataset, allow for distinct inference batch sizes (larger = much faster) and use torch inference mode. When inference batch size is large (default is set to 8192), this makes inference much faster (E.g ArtifactDataset construction goes from 6.5 to 3.5 minutes).
  • Resolve performance problems with num_workers>0 which were most likely caused by issues with serialization of ArtifactBatch.original_data in DataLoader works. Instead of storing original_data, only store the pieces of that data which are accessed.
    • Before with GPUs, num_workers=4 was much slower than num_workers=0. Now it is a bit faster (8 mins 20 second epoch versus 8 mins 34).
    • This also resolves a runtime crash that occurs up in Linux environments where the open file limit is low (I observed this on a Google Cloud VM). This is probably because PyTorch opens shared memory "files" to communicate data between workers, and original_data had lots of nested Python (not PyTorch/Numpy) data which got serialized independenty.

@@ -117,6 +120,8 @@ def add_training_params_to_parser(parser):
help='number of epochs for primary training loop')
parser.add_argument('--' + constants.NUM_CALIBRATION_EPOCHS_NAME, type=int, default=0, required=False,
help='number of calibration-only epochs')
parser.add_argument('--' + constants.INFERENCE_BATCH_SIZE_NAME, type=int, default=8192, required=False,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also need to go in the WDL scripts, but I'll handle that in a follow-up PR.

@davidbenjamin davidbenjamin merged commit ebf9b20 into master Nov 4, 2024
@davidbenjamin davidbenjamin deleted the ebenj/faster-artifact branch November 4, 2024 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants