-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Can't use lighteval to evaluate the nanotron #395
Comments
Hi @alexchen4ai , thanks for the issue! Could you provide your config.yaml file? |
Thanks for the reply. This is the config of my current checkpoint: checkpoints:
checkpoint_interval: 100000
checkpoints_path: checkpoints
checkpoints_path_is_shared_file_system: false
resume_checkpoint_path: null
save_final_state: false
save_initial_state: false
data_stages:
- data:
dataset:
dataset_folder:
- /dataset1
dataset_weights:
- 1
num_loading_workers: 1
seed: 42
name: Stable Training Stage 1
start_training_step: 1
- data:
dataset:
dataset_folder:
-/dataset2
dataset_weights:
- 1
num_loading_workers: 1
seed: 42
name: Stable Training Stage 2
start_training_step: 1797727
general:
benchmark_csv_path: null
consumed_train_samples: 6400000
ignore_sanity_checks: true
project: llama3-tiny-training
run: tiny_llama_debug
seed: 42
step: 100000
lighteval: null
logging:
iteration_step_info_interval: 1
log_level: info
log_level_replica: info
model:
ddp_bucket_cap_mb: 25
dtype: bfloat16
init_method:
std: 0.025
make_vocab_size_divisible_by: 1
model_config:
bos_token_id: 1
eos_token_id: 1
hidden_act: silu
hidden_size: 576
initializer_range: 0.02
intermediate_size: 1536
is_llama_config: true
max_position_embeddings: 2048
num_attention_heads: 8
num_hidden_layers: 30
num_key_value_heads: 4
pad_token_id: null
pretraining_tp: 1
rms_norm_eps: 1.0e-05
rope_interleaved: false
rope_scaling: null
rope_theta: 100000
tie_word_embeddings: true
use_cache: true
vocab_size: 128256
optimizer:
accumulate_grad_in_fp32: true
clip_grad: 1.0
learning_rate_scheduler:
learning_rate: 0.0008
lr_decay_starting_step: null
lr_decay_steps: 4497000
lr_decay_style: cosine
lr_warmup_steps: 3000
lr_warmup_style: linear
min_decay_lr: 8.0e-05
optimizer_factory:
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 1.0e-08
name: adamW
torch_adam_is_fused: true
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 2
expert_parallel_size: 1
pp: 2
pp_engine: 1f1b
recompute_layer: false
tp: 2
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
tp_recompute_allgather: true
profiler: null
s3_upload: null
tokenizer:
tokenizer_max_length: null
tokenizer_name_or_path: meta-llama/Llama-3.2-1B
tokenizer_revision: null
tokens:
batch_accumulation_per_replica: 1
limit_test_batches: 0
limit_val_batches: 10
micro_batch_size: 32
sequence_length: 2048
train_steps: 4500000
val_check_interval: 10000 For the config of the lighteval: batch_size: 8
generation: null
logging:
output_dir: "outputs"
save_details: false
push_results_to_hub: false
push_details_to_hub: false
push_results_to_tensorboard: false
public_run: false
results_org: null
tensorboard_metric_prefix: "eval"
parallelism:
dp: 1
pp: 1
pp_engine: 1f1b
tp: 1
tp_linear_async_communication: false
tp_mode: ALL_REDUCE
tasks:
dataset_loading_processes: 8
max_samples: 10
multichoice_continuations_start_space: null
num_fewshot_seeds: null
tasks: leaderboard|hellaswag|0|0 |
Hi @alexchen4ai I would suggest installing nanotron from source. |
Describe the bug
lighteval nanotron --checkpoint_config_path ../nexatron/examples/tiny_llama3/checkpoints/100000/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_o
verride_template.yaml
/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/flash_attn/ops/triton/layer_norm.py:984: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.@custom_fwd
/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/flash_attn/ops/triton/layer_norm.py:1043: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.@custom_bwd
WARNING:lighteval.logging.hierarchical_logger:main: (0, '../nexatron/examples/tiny_llama3_nanoset/checkpoints/100000/config.yaml'), (1, 'examples/nanotron/lighteval_config_override_template.yaml'), (2, '/data/.cache/huggingface'), {
WARNING:lighteval.logging.hierarchical_logger: Load nanotron config {
skip_unused_config_keys set
Skip_null_keys set
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.005991]
WARNING:lighteval.logging.hierarchical_logger:} [0:00:00.006073]
Traceback (most recent call last):
File "/opt/anaconda3/envs/lighteval/bin/lighteval", line 8, in
sys.exit(cli_evaluate())
^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/lighteval/main.py", line 67, in cli_evaluate
main_nanotron(args.checkpoint_config_path, args.lighteval_config_path, args.cache_dir)
File "/data/alex_dev/lighteval/src/lighteval/logging/hierarchical_logger.py", line 175, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/lighteval/main_nanotron.py", line 57, in main
model_config = get_config_from_file(
^^^^^^^^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/nanotron/src/nanotron/config/config.py", line 403, in get_config_from_file
config = get_config_from_dict(
^^^^^^^^^^^^^^^^^^^^^
File "/data/alex_dev/lighteval/src/nanotron/src/nanotron/config/config.py", line 364, in get_config_from_dict
return from_dict(
^^^^^^^^^^
File "/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/dacite/core.py", line 64, in from_dict
value = build_value(type=field_type, data=field_data, config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/dacite/core.py", line 99, in build_value
data = from_dict(data_class=type, data=data, config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/envs/lighteval/lib/python3.11/site-packages/dacite/core.py", line 58, in from_dict
raise UnexpectedDataError(keys=extra_fields)
dacite.exceptions.UnexpectedDataError: can not match "tp_recompute_allgather", "recompute_layer" to any data class field
To Reproduce
lighteval nanotron --checkpoint_config_path ../nanotron/examples/tiny_llama3/checkpoints/100000/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml
Version info
I use the latest nanotron and lighteval using pip install lighteval[nanotron]
The text was updated successfully, but these errors were encountered: