Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding AxoNN's 3D tensor parallelism [WIP] #1086

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

siddharth9820
Copy link

@siddharth9820 siddharth9820 commented Nov 28, 2023

Steps to run -

Install AxoNN (dependencies - Pytorch and mpi4py) -

  • git clone [email protected]:axonn-ai/axonn.git
  • cd axonn
  • git checkout 45647ea
  • pip install -e .

Preparing a config file to use AxoNN -

First, set "use_axonn_model_parallelism": true
Then set "depth_model_parallel_size", "row_model_parallel_size", "column_model_parallel_size" as per the requirements of your model. The product of these should equal "model_parallel_size".

you can also set "optimize_axonn_communication: true" to enable communication optimizations. These also require you to set the environment variable - export CUDA_DEVICE_MAX_CONNECTIONS=1

At a high level, the matrix multiplications in your model will be sharded over "model_parallel_size" GPUs.

ToDos -

  • Implement axonn support for parallel_output=True (gpt j residual)
  • Implement axonn support for LLaMAMLP
  • Integrate communication optimizations

@Quentin-Anthony
Copy link
Member

Under testing

@Quentin-Anthony Quentin-Anthony self-assigned this Nov 29, 2023
@siddharth9820
Copy link
Author

Correctness check on 125M.yml with
use_axonn_model_parallelism:true, column_model_parallel_size=1, row_model_parallel_size=1, depth_model_parallel_size=2, model_parallel_size=2 on 2 GPUs.
Dataset - enwiki8

(the loss curve is smoothed over 100 iterations)

image

@siddharth9820
Copy link
Author

@Quentin-Anthony I have updated the install instructions to install axonn from a fixed commit - 3ebc34c

@Quentin-Anthony
Copy link
Member

@Quentin-Anthony I have updated the install instructions to install axonn from a fixed commit - 3ebc34c

Thanks!

@siddharth9820
Copy link
Author

siddharth9820 commented Jan 18, 2024

@Quentin-Anthony Pushed some communication optimizations and also updated the instructions to install axonn from a newer commit - 45647ea.

To enable these optimizations, you just need to set "optimize_axonn_communication: true" in your neo-x config files. These also require you to set the environment variable - export CUDA_DEVICE_MAX_CONNECTIONS=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants