Dataset Sanity Check bug report and seeking help for training data preparation #6

EigenTom · 2024-12-22T18:56:52Z

Thank you for your diligence and hard work of this project.

I am working on replicating the training procedure of the CEPE with minimal training data. During the data preparation process,
I encountered some questions regarding：

downloading the training data： c4.
preparing the downloaded training data.

What I have done:

follow the guidance at: C4 download guide to successfully download and obtained jsonl files as shown below:

with the content shown as the below screenshot:

I placed all processed jsonl files onto ./data/redpajama/c4-rp/*.jsonl in the cloned CEPE repository, which I suspect may not be the correct filepath, although I couldn't find any clear suggestion in README.md in ./data about how to organize the downloaded and preprocessed dataset from different domains.
I successfully run the get_all_jsonl.py and obtained the txtfile as shown below:

I try to directly run bash run_tokenize.sh, where I encountered several issues:
i. The code will only process one jsonl file. In the tokenize_files.py it calls, on line $74$ it suggests this code is prepared for slrum clusters only.

ii. After I modify line $78$ to allow variable file_names include all .jsonl files for preprocessing, I found the preprocessed file were saved as .npy files, but they cannot be correctly read by sanity_check.py:

iii. I looked into the code responsible for saving and reading the .npy file. It suggests that in tokenize_files.py, line $70$, np.save() was called but in sanity_check.py, line $37$, pickle.read() was called.

The error stops me from processing the dataset further for training. I am wondering what caused the .npy file to be corrupted and what is the best practice to preprocess the dataset.

Many thanks!

The text was updated successfully, but these errors were encountered:

howard-yen · 2025-01-14T01:16:15Z

Hello, thank you for your interest in our work and I apologize for the late reply due to the recent holidays.

how to organize the downloaded and preprocessed dataset from different domains

it doesn't really matter where you put these files as long as the all_jsonl.txt contains the paths to each file

this code is prepared for slrum clusters only

Yes, the tokenization script was prepared for slurm clusters, but your changes sound correct.

pickle.read() was called

this is a bug, thanks for catching this! it should have been just np.load

            with open(file, "rb") as f:
                data = pickle.load(f)
            for d in data:
                num_chunks += len(d) // chunk_size

should be something like

data = np.load(file)
for x in data[1:data[0]+1]:
    num_chunks += x // chunk_size

After completing this step, and the sanity check passes, you should be ready to run the sampling step.

Please let me know if you run into any other problems!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Sanity Check bug report and seeking help for training data preparation #6

Dataset Sanity Check bug report and seeking help for training data preparation #6

EigenTom commented Dec 22, 2024

howard-yen commented Jan 14, 2025

Dataset Sanity Check bug report and seeking help for training data preparation #6

Dataset Sanity Check bug report and seeking help for training data preparation #6

Comments

EigenTom commented Dec 22, 2024

howard-yen commented Jan 14, 2025