Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accelerate multi-node multi-gpu oom #3310

Open
2 of 4 tasks
rastinrastinii opened this issue Dec 22, 2024 · 0 comments
Open
2 of 4 tasks

accelerate multi-node multi-gpu oom #3310

rastinrastinii opened this issue Dec 22, 2024 · 0 comments

Comments

@rastinrastinii
Copy link

System Info

Node1
- `Accelerate` version: 1.2.1
- Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.39
- `accelerate` bash location: /home/mshahsavari/.pyenv/versions/3.11.10_venv/bin/accelerate
- Python version: 3.11.10
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 62.50 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: fp16
        - use_cpu: False
        - debug: True
        - num_processes: 4
        - machine_rank: 0
        - num_machines: 2
        - gpu_ids: all
        - main_process_ip: 172.16.22.61
        - main_process_port: 6834
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []


Node2:
- `Accelerate` version: 1.2.1
- Platform: Linux-6.8.0-31-generic-x86_64-with-glibc2.39
- `accelerate` bash location: /home/mshahsavari/.pyenv/versions/3.11.10_venv/bin/accelerate
- Python version: 3.11.10
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 227.95 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: fp16
        - use_cpu: False
        - debug: True
        - num_processes: 2
        - machine_rank: 1
        - num_machines: 2
        - gpu_ids: all
        - main_process_ip: 172.16.22.61
        - main_process_port: 6834
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

run 'accelerate launch pippy_example2.py' on 2 node. each node has 2 gpu each one has 24gb vram.
before line 'model.eval()' getting oom.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

from accelerate import PartialState, prepare_pippy, init_empty_weights, load_checkpoint_and_dispatch
from torch.distributed import init_process_group
import os

def main():

    model_name = "google/gemma-2-27b-it"  # Replace with the correct Hugging Face model ID
    with init_empty_weights():
        model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

    model.tie_weights()
    print('\n\n###########################################\n###########################################\nload_checkpoint_and_dispatch2\n###########################################\n###########################################\n\n')
    model = load_checkpoint_and_dispatch(model, device_map="auto", checkpoint='/storage/.cache/huggingface/hub/models--google--gemma-2-27b-it/snapshots/aaf20e6b9f4c0fcf043f6fb2a2068419086d77b0')
    # model.tie_weights()
    print('\n\n###########################################\n###########################################\neval\n###########################################\n###########################################\n\n')
    model.eval()

    # Input configs
    # Create example inputs for the model
    print('\n\n###########################################\n###########################################\ntest\n###########################################\n###########################################\n\n')
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    prompts = ("I would like to", "I really like to")  # bs = 2, sending 2 per process
    tokenizer.pad_token = tokenizer.eos_token
    inputs = tokenizer(prompts, return_tensors="pt", padding=True)

    prompts = ("I would like to", "I really like to", "The weather is pretty")  # bs = 3
    inputs = tokenizer(prompts, return_tensors="pt", padding=True)
    inputs = inputs.to(0)
    with torch.no_grad():
        output = model(**inputs)

    # The outputs are only on the final process by default
    if PartialState().is_last_process:
        next_token_logits = output[0][:, -1, :]
        next_token = torch.argmax(next_token_logits, dim=-1)
        print(tokenizer.batch_decode(next_token))
    PartialState().destroy_process_group()

if __name__ == "__main__":
    main()

Expected behavior

load model destributed between 4 gpu on 2 node. i want inference with multi node while no one node can completely load model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant