issue on model.to("cuda") with device_map="auto" #61

UmutAlihan · 2024-05-28T09:02:03Z

Hi,

I am having below error, while trying to load model on my 2x RTX 3060 using device_map="auto" param:

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py:1395, in check_device_map(model, device_map)
1393 if len(all_model_tensors) > 0:
1394 non_covered_params = ", ".join(all_model_tensors)
-> 1395 raise ValueError(
1396 f"The device_map provided does not give any device for the following parameters: {non_covered_params}"
1397 )
ValueError: The device_map provided does not give any device for the following parameters: backbone.unembed.weight

my code is:

In [2]: from transformers import AutoConfig, AutoModelForCausalLM
   ...:
   ...: model_name = 'togethercomputer/evo-1-8k-base'
   ...: #model_name = "togethercomputer/evo-1-131k-base"
   ...:
   ...: model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix")
   ...: model_config.use_cache = True
   ...:
   ...: model = AutoModelForCausalLM.from_pretrained(
   ...:     model_name,
   ...:     config=model_config,
   ...:     trust_remote_code=True,
   ...:     revision="1.1_fix",
   ...:     cache_dir="/llms/evo",
   ...:     low_cpu_mem_usage=True,
   ...:     device_map="auto". ## only updated here from repo code, so that it distributes the weights to multiple GPUs 
   ...: )

What would be the root cause here and possible solution approaches?

Any help is much appreciated. Thanks

Here you can check out the whole stderr output:

Loading checkpoint shards: 100%|████████████████| 3/3 [00:03<00:00, 1.11s/it] Some weights of StripedHyenaModelForCausalLM were not initialized from the model checkpoint at togethercomputer/evo-1-8k-base and are newly initialized: ['backbone.unembed.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[1], line 9 6 model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix") 7 model_config.use_cache = True ----> 9 model = AutoModelForCausalLM.from_pretrained( 10 model_name, 11 config=model_config, 12 trust_remote_code=True, 13 revision="1.1_fix", 14 cache_dir="/media/raid/llms/evo", 15 low_cpu_mem_usage=True, 16 device_map="auto" 17 )

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:558, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
556 else:
557 cls.register(config.class, model_class, exist_ok=True)
--> 558 return model_class.from_pretrained(
559 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
560 )
561 elif type(config) in cls._model_mapping.keys():
562 model_class = _get_model_class(config, cls._model_mapping)

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/transformers/modeling_utils.py:3820, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
3818 device_map_kwargs["force_hooks"] = True
3819 if not is_fsdp_enabled() and not is_deepspeed_zero3_enabled():
-> 3820 dispatch_model(model, **device_map_kwargs)
3822 if hf_quantizer is not None:
3823 hf_quantizer.postprocess_model(model)

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/big_modeling.py:351, in dispatch_model(model, device_map, main_device, state_dict, offload_dir, offload_index, offload_buffers, skip_keys, preload_module_classes, force_hooks)
317 """
318 Dispatches a model according to a given device map. Layers of the model might be spread across GPUs, offloaded on
319 the CPU or even the disk.
(...)
348 single device.
349 """
350 # Error early if the device map is incomplete.
--> 351 check_device_map(model, device_map)
353 # for backward compatibility
354 is_bnb_quantized = (
355 getattr(model, "is_quantized", False) or getattr(model, "is_loaded_in_8bit", False)
356 ) and getattr(model, "quantization_method", "bitsandbytes") == "bitsandbytes"

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py:1419, in check_device_map(model, device_map)
1417 if len(all_model_tensors) > 0:
1418 non_covered_params = ", ".join(all_model_tensors)
-> 1419 raise ValueError(
1420 f"The device_map provided does not give any device for the following parameters: {non_covered_params}"
1421 )

ValueError: The device_map provided does not give any device for the following parameters: backbone.unembed.weight

mbi2gs · 2024-06-04T18:02:27Z

I ran into the same issue. For some reason the backbone.unembed.weight parameters are not included in the default device map. I got it working with a custom device map like the following:

def make_new_device_map(num_devices:int, out_map_file:str):
    # Read in default device map as basis for new
    with open(DEFAULT_DEVICE_MAP, 'r') as indm:
        device_map = json.load(indm)
    
    # Distribute evenly across as many devices as available
    # Count all blocks
    device_modules = {}
    device_list = list(range(num_devices))
    for layer_name in device_map.keys():
        module = '.'.join(layer_name.split('.')[:3])
        device_modules[module] = None
    device_modules['backbone.unembed'] = None
    num_modules = len([x for x in device_modules.keys()])

    # Assign blocks to devices
    even_split = num_modules / num_devices
    for i, key in enumerate(device_modules.keys()):
        cur_device_idx = int(np.floor(i / even_split))
        device_modules[key] = cur_device_idx

    # Assign individual layers to devices (all within a block share same device)
    for layer_name in device_map.keys():
        module = '.'.join(layer_name.split('.')[:3])
        device_map[layer_name] = device_modules[module]
    device_map['backbone.unembed.weight'] = device_modules['backbone.unembed']

    with open(out_map_file, 'w') as outdm:
        json.dump(device_map, outdm)

And then you supply the new json device map to the load_checkpoint_and_dispatch() function.

Earthones mentioned this issue Dec 2, 2024

GPU error with DataParallel #86

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue on model.to("cuda") with device_map="auto" #61

issue on model.to("cuda") with device_map="auto" #61

UmutAlihan commented May 28, 2024 •

edited

Loading

mbi2gs commented Jun 4, 2024

issue on model.to("cuda") with device_map="auto" #61

issue on model.to("cuda") with device_map="auto" #61

Comments

UmutAlihan commented May 28, 2024 • edited Loading

mbi2gs commented Jun 4, 2024

UmutAlihan commented May 28, 2024 •

edited

Loading