Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue on model.to("cuda") with device_map="auto" #61

Open
UmutAlihan opened this issue May 28, 2024 · 1 comment
Open

issue on model.to("cuda") with device_map="auto" #61

UmutAlihan opened this issue May 28, 2024 · 1 comment

Comments

@UmutAlihan
Copy link

UmutAlihan commented May 28, 2024

Hi,

I am having below error, while trying to load model on my 2x RTX 3060 using device_map="auto" param:

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py:1395, in check_device_map(model, device_map)
1393 if len(all_model_tensors) > 0:
1394 non_covered_params = ", ".join(all_model_tensors)
-> 1395 raise ValueError(
1396 f"The device_map provided does not give any device for the following parameters: {non_covered_params}"
1397 )
ValueError: The device_map provided does not give any device for the following parameters: backbone.unembed.weight

my code is:

In [2]: from transformers import AutoConfig, AutoModelForCausalLM
   ...:
   ...: model_name = 'togethercomputer/evo-1-8k-base'
   ...: #model_name = "togethercomputer/evo-1-131k-base"
   ...:
   ...: model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix")
   ...: model_config.use_cache = True
   ...:
   ...: model = AutoModelForCausalLM.from_pretrained(
   ...:     model_name,
   ...:     config=model_config,
   ...:     trust_remote_code=True,
   ...:     revision="1.1_fix",
   ...:     cache_dir="/llms/evo",
   ...:     low_cpu_mem_usage=True,
   ...:     device_map="auto". ## only updated here from repo code, so that it distributes the weights to multiple GPUs 
   ...: )

What would be the root cause here and possible solution approaches?

Any help is much appreciated. Thanks

Here you can check out the whole stderr output:

Loading checkpoint shards: 100%|████████████████| 3/3 [00:03<00:00, 1.11s/it] Some weights of StripedHyenaModelForCausalLM were not initialized from the model checkpoint at togethercomputer/evo-1-8k-base and are newly initialized: ['backbone.unembed.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[1], line 9 6 model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix") 7 model_config.use_cache = True ----> 9 model = AutoModelForCausalLM.from_pretrained( 10 model_name, 11 config=model_config, 12 trust_remote_code=True, 13 revision="1.1_fix", 14 cache_dir="/media/raid/llms/evo", 15 low_cpu_mem_usage=True, 16 device_map="auto" 17 )

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:558, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
556 else:
557 cls.register(config.class, model_class, exist_ok=True)
--> 558 return model_class.from_pretrained(
559 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
560 )
561 elif type(config) in cls._model_mapping.keys():
562 model_class = _get_model_class(config, cls._model_mapping)

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/transformers/modeling_utils.py:3820, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
3818 device_map_kwargs["force_hooks"] = True
3819 if not is_fsdp_enabled() and not is_deepspeed_zero3_enabled():
-> 3820 dispatch_model(model, **device_map_kwargs)
3822 if hf_quantizer is not None:
3823 hf_quantizer.postprocess_model(model)

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/big_modeling.py:351, in dispatch_model(model, device_map, main_device, state_dict, offload_dir, offload_index, offload_buffers, skip_keys, preload_module_classes, force_hooks)
317 """
318 Dispatches a model according to a given device map. Layers of the model might be spread across GPUs, offloaded on
319 the CPU or even the disk.
(...)
348 single device.
349 """
350 # Error early if the device map is incomplete.
--> 351 check_device_map(model, device_map)
353 # for backward compatibility
354 is_bnb_quantized = (
355 getattr(model, "is_quantized", False) or getattr(model, "is_loaded_in_8bit", False)
356 ) and getattr(model, "quantization_method", "bitsandbytes") == "bitsandbytes"

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py:1419, in check_device_map(model, device_map)
1417 if len(all_model_tensors) > 0:
1418 non_covered_params = ", ".join(all_model_tensors)
-> 1419 raise ValueError(
1420 f"The device_map provided does not give any device for the following parameters: {non_covered_params}"
1421 )

ValueError: The device_map provided does not give any device for the following parameters: backbone.unembed.weight

@mbi2gs
Copy link

mbi2gs commented Jun 4, 2024

I ran into the same issue. For some reason the backbone.unembed.weight parameters are not included in the default device map. I got it working with a custom device map like the following:

def make_new_device_map(num_devices:int, out_map_file:str):
    # Read in default device map as basis for new
    with open(DEFAULT_DEVICE_MAP, 'r') as indm:
        device_map = json.load(indm)
    
    # Distribute evenly across as many devices as available
    # Count all blocks
    device_modules = {}
    device_list = list(range(num_devices))
    for layer_name in device_map.keys():
        module = '.'.join(layer_name.split('.')[:3])
        device_modules[module] = None
    device_modules['backbone.unembed'] = None
    num_modules = len([x for x in device_modules.keys()])

    # Assign blocks to devices
    even_split = num_modules / num_devices
    for i, key in enumerate(device_modules.keys()):
        cur_device_idx = int(np.floor(i / even_split))
        device_modules[key] = cur_device_idx

    # Assign individual layers to devices (all within a block share same device)
    for layer_name in device_map.keys():
        module = '.'.join(layer_name.split('.')[:3])
        device_map[layer_name] = device_modules[module]
    device_map['backbone.unembed.weight'] = device_modules['backbone.unembed']

    with open(out_map_file, 'w') as outdm:
        json.dump(device_map, outdm)

And then you supply the new json device map to the load_checkpoint_and_dispatch() function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants