[draft] equivalent mixtral mlperf data pipeline #1157

ZhiyuLi-goog · 2025-01-10T08:51:30Z

Description

Add mlperf 5.0 moe data pipeline.

We reuse the same mlperf dataset in gpt3-175b submission:

stay the same in train dataset
add a text eval dataset pipeline: In gpt3, eval dataset is pretokenized, see this PR, while we need to tokenize text eval dataset on the fly for moe data pipeline given a different tokenizer

Matched the total_weights in the eval dataset, i.e. the total number of next token prediction. see notebook, and please remove this notebook before merge the codes.

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

If I am not be able to merge it, @suexu1025 @RissyRan could you help me address the comments?

equivalent mixtral mlperf data pipeline

0cedf78

ZhiyuLi-goog requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla and RissyRan as code owners January 10, 2025 08:51

ZhiyuLi-goog assigned suexu1025 and RissyRan Jan 10, 2025

ZhiyuLi-goog added 2 commits January 11, 2025 00:20

eval before training

7e31905

update

443aaf8