Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Sanity Check bug report and seeking help for training data preparation #6

Open
EigenTom opened this issue Dec 22, 2024 · 1 comment

Comments

@EigenTom
Copy link

Thank you for your diligence and hard work of this project.

I am working on replicating the training procedure of the CEPE with minimal training data. During the data preparation process,
I encountered some questions regarding:

  1. downloading the training data: c4.
  2. preparing the downloaded training data.

What I have done:

  1. follow the guidance at: C4 download guide to successfully download and obtained jsonl files as shown below:
image

with the content shown as the below screenshot:
image

  1. I placed all processed jsonl files onto ./data/redpajama/c4-rp/*.jsonl in the cloned CEPE repository, which I suspect may not be the correct filepath, although I couldn't find any clear suggestion in README.md in ./data about how to organize the downloaded and preprocessed dataset from different domains.

  2. I successfully run the get_all_jsonl.py and obtained the txtfile as shown below:

image
  1. I try to directly run bash run_tokenize.sh, where I encountered several issues:
    i. The code will only process one jsonl file. In the tokenize_files.py it calls, on line $74$ it suggests this code is prepared for slrum clusters only.

ii. After I modify line $78$ to allow variable file_names include all .jsonl files for preprocessing, I found the preprocessed file were saved as .npy files, but they cannot be correctly read by sanity_check.py:
image

iii. I looked into the code responsible for saving and reading the .npy file. It suggests that in tokenize_files.py, line $70$, np.save() was called but in sanity_check.py, line $37$, pickle.read() was called.

The error stops me from processing the dataset further for training. I am wondering what caused the .npy file to be corrupted and what is the best practice to preprocess the dataset.

Many thanks!

@howard-yen
Copy link
Collaborator

Hello, thank you for your interest in our work and I apologize for the late reply due to the recent holidays.

how to organize the downloaded and preprocessed dataset from different domains

it doesn't really matter where you put these files as long as the all_jsonl.txt contains the paths to each file

this code is prepared for slrum clusters only

Yes, the tokenization script was prepared for slurm clusters, but your changes sound correct.

pickle.read() was called

this is a bug, thanks for catching this! it should have been just np.load

            with open(file, "rb") as f:
                data = pickle.load(f)
            for d in data:
                num_chunks += len(d) // chunk_size

should be something like

data = np.load(file)
for x in data[1:data[0]+1]:
    num_chunks += x // chunk_size

After completing this step, and the sanity check passes, you should be ready to run the sampling step.

Please let me know if you run into any other problems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants