You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have a problem when I run readme_example.py to infer Mixtral-8x7B on a A100 GPU. The error message is as follows:
/home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: torch.cuda.amp.custom_fwd(args...)is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')instead. def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq): /home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning:torch.cuda.amp.custom_bwd(args...)is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')instead. def backward(ctx, grad_output): /home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning:torch.cuda.amp.custom_fwd(args...)is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
Do not detect pre-installed ops, use JIT mode
[WARNING] FlashAttention is not available in the current environment. Using default attention.
Using /data/xxx/mirror/.cache/torch_extensions/py39_cu124 as PyTorch extensions root...
Emitting ninja build file /data/xxx/mirror/.cache/torch_extensions/py39_cu124/prefetch/build.ninja...
Building extension module prefetch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module prefetch...
Time to load prefetch op: 2.545353889465332 seconds
SPDLOG_LEVEL : (null)
2024-12-20 10:34:40.267 INFO Create ArcherAioThread for thread: , 0
2024-12-20 10:34:40.268 INFO Loading index file from , /home/xxx/moe-infinity/archer_index
2024-12-20 10:34:40.268 INFO Index file size , 995
2024-12-20 10:34:40.269 INFO Device count , 1
2024-12-20 10:34:40.269 INFO Enabled peer access for all devices
Loading model from offload_path ...
Model create: 76%|████████████████████████████████████████████████████████▌ | 760/994 [00:00<00:00, 2359.42it/s]MixtralConfig {
"_name_or_path": "/data/model_and_dataset/Mixtral-8x7B-Instruct-v0.1",
"architectures": [
"MixtralForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "mixtral",
"num_attention_heads": 32,
"num_experts_per_tok": 2,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"num_local_experts": 8,
"output_router_logits": false,
"rms_norm_eps": 1e-05,
"rope_theta": 1000000.0,
"router_aux_loss_coef": 0.02,
"router_jitter_noise": 0.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.47.1",
"use_cache": true,
"vocab_size": 32000
}
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
/home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/transformers/generation/utils.py:2134: UserWarning: You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('cpu') before running .generate().
warnings.warn(
Model create: 94%|█████████████████████████████████████████████████████████████████████▏ | 930/994 [00:21<00:00, 2359.42it/s]translate English to German: How old are you?;
;inchct-- REYetetetetctet
ArcherTaskPool destructor`
The text was updated successfully, but these errors were encountered:
Hi, I have a problem when I run readme_example.py to infer Mixtral-8x7B on a A100 GPU. The error message is as follows:
/home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning:
torch.cuda.amp.custom_fwd(args...)is deprecated. Please use
torch.amp.custom_fwd(args..., device_type='cuda')instead. def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq): /home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning:
torch.cuda.amp.custom_bwd(args...)is deprecated. Please use
torch.amp.custom_bwd(args..., device_type='cuda')instead. def backward(ctx, grad_output): /home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning:
torch.cuda.amp.custom_fwd(args...)is deprecated. Please use
torch.amp.custom_fwd(args..., device_type='cuda')` instead.@custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
Do not detect pre-installed ops, use JIT mode
[WARNING] FlashAttention is not available in the current environment. Using default attention.
Using /data/xxx/mirror/.cache/torch_extensions/py39_cu124 as PyTorch extensions root...
Emitting ninja build file /data/xxx/mirror/.cache/torch_extensions/py39_cu124/prefetch/build.ninja...
Building extension module prefetch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module prefetch...
Time to load prefetch op: 2.545353889465332 seconds
SPDLOG_LEVEL : (null)
2024-12-20 10:34:40.267 INFO Create ArcherAioThread for thread: , 0
2024-12-20 10:34:40.268 INFO Loading index file from , /home/xxx/moe-infinity/archer_index
2024-12-20 10:34:40.268 INFO Index file size , 995
2024-12-20 10:34:40.269 INFO Device count , 1
2024-12-20 10:34:40.269 INFO Enabled peer access for all devices
Loading model from offload_path ...
Model create: 76%|████████████████████████████████████████████████████████▌ | 760/994 [00:00<00:00, 2359.42it/s]MixtralConfig {
"_name_or_path": "/data/model_and_dataset/Mixtral-8x7B-Instruct-v0.1",
"architectures": [
"MixtralForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "mixtral",
"num_attention_heads": 32,
"num_experts_per_tok": 2,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"num_local_experts": 8,
"output_router_logits": false,
"rms_norm_eps": 1e-05,
"rope_theta": 1000000.0,
"router_aux_loss_coef": 0.02,
"router_jitter_noise": 0.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.47.1",
"use_cache": true,
"vocab_size": 32000
}
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's
attention_mask
to obtain reliable results.Setting
pad_token_id
toeos_token_id
:2 for open-end generation.The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's
attention_mask
to obtain reliable results./home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/transformers/generation/utils.py:2134: UserWarning: You are calling .generate() with the
input_ids
being on a device type different than your model's device.input_ids
is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have putinput_ids
to the correct device by calling for example input_ids = input_ids.to('cpu') before running.generate()
.warnings.warn(
Model create: 94%|█████████████████████████████████████████████████████████████████████▏ | 930/994 [00:21<00:00, 2359.42it/s]translate English to German: How old are you?;
;inchct-- REYetetetetctet
ArcherTaskPool destructor`
The text was updated successfully, but these errors were encountered: