Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenization on-the-fly for long documents #106

Open
dangxuanhong opened this issue Jul 31, 2024 · 2 comments
Open

tokenization on-the-fly for long documents #106

dangxuanhong opened this issue Jul 31, 2024 · 2 comments
Assignees

Comments

@dangxuanhong
Copy link

dangxuanhong commented Jul 31, 2024

As we may have to deal with very long documents up to millions of characters/tokens, the dataloader may need to be tested and revised as needed when it aims at tokenizing these long documents on-the-fly.

An approach of splittng a long document into chunks should be considered as an example here.

@thinkahead
Copy link
Collaborator

thinkahead commented Aug 7, 2024

The problem is not with long documents, I tried by splitting the long documents into chunks

Removing the SamplingDataSet that is used in multi-dataset handing allows us to bypass the problem.

The SamplingDataSet has more heterogeneity than iterating from one entire file to the next. We do want document mixing between datasets. Although the SamplingDataSet shouldn't cause every file to open, but rather one from each dataset, it seems like it is opening all parquet files causing the node to go out of memory

@daviswer
Copy link
Collaborator

Checking on the status of this - the memory consumption ended up being related to how the legal-file-detection was working IIRC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants