Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to run readme_example.py #33

Open
pegga1225 opened this issue Dec 20, 2024 · 0 comments
Open

Fail to run readme_example.py #33

pegga1225 opened this issue Dec 20, 2024 · 0 comments

Comments

@pegga1225
Copy link

Hi, I have a problem when I run readme_example.py to infer Mixtral-8x7B on a A100 GPU. The error message is as follows:

/home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: torch.cuda.amp.custom_fwd(args...)is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')instead. def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq): /home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning:torch.cuda.amp.custom_bwd(args...)is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')instead. def backward(ctx, grad_output): /home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning:torch.cuda.amp.custom_fwd(args...)is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
Do not detect pre-installed ops, use JIT mode
[WARNING] FlashAttention is not available in the current environment. Using default attention.
Using /data/xxx/mirror/.cache/torch_extensions/py39_cu124 as PyTorch extensions root...
Emitting ninja build file /data/xxx/mirror/.cache/torch_extensions/py39_cu124/prefetch/build.ninja...
Building extension module prefetch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module prefetch...
Time to load prefetch op: 2.545353889465332 seconds
SPDLOG_LEVEL : (null)
2024-12-20 10:34:40.267 INFO Create ArcherAioThread for thread: , 0
2024-12-20 10:34:40.268 INFO Loading index file from , /home/xxx/moe-infinity/archer_index
2024-12-20 10:34:40.268 INFO Index file size , 995
2024-12-20 10:34:40.269 INFO Device count , 1
2024-12-20 10:34:40.269 INFO Enabled peer access for all devices
Loading model from offload_path ...
Model create: 76%|████████████████████████████████████████████████████████▌ | 760/994 [00:00<00:00, 2359.42it/s]MixtralConfig {
"_name_or_path": "/data/model_and_dataset/Mixtral-8x7B-Instruct-v0.1",
"architectures": [
"MixtralForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "mixtral",
"num_attention_heads": 32,
"num_experts_per_tok": 2,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"num_local_experts": 8,
"output_router_logits": false,
"rms_norm_eps": 1e-05,
"rope_theta": 1000000.0,
"router_aux_loss_coef": 0.02,
"router_jitter_noise": 0.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.47.1",
"use_cache": true,
"vocab_size": 32000
}

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
/home/xxx/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/transformers/generation/utils.py:2134: UserWarning: You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('cpu') before running .generate().
warnings.warn(
Model create: 94%|█████████████████████████████████████████████████████████████████████▏ | 930/994 [00:21<00:00, 2359.42it/s]translate English to German: How old are you?;
;inchct-- REYetetetetctet

ArcherTaskPool destructor`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant