tokenization on-the-fly for long documents #106

dangxuanhong · 2024-07-31T17:28:15Z

As we may have to deal with very long documents up to millions of characters/tokens, the dataloader may need to be tested and revised as needed when it aims at tokenizing these long documents on-the-fly.

An approach of splittng a long document into chunks should be considered as an example here.

The text was updated successfully, but these errors were encountered:

thinkahead · 2024-08-07T14:24:52Z

The problem is not with long documents, I tried by splitting the long documents into chunks

Removing the SamplingDataSet that is used in multi-dataset handing allows us to bypass the problem.

The SamplingDataSet has more heterogeneity than iterating from one entire file to the next. We do want document mixing between datasets. Although the SamplingDataSet shouldn't cause every file to open, but rather one from each dataset, it seems like it is opening all parquet files causing the node to go out of memory

daviswer · 2024-08-29T15:23:24Z

Checking on the status of this - the memory consumption ended up being related to how the legal-file-detection was working IIRC?

daviswer self-assigned this Jul 31, 2024

thinkahead added a commit that referenced this issue Aug 7, 2024

[#106] Removing the SamplingDataSet used in multi-dataset handling

c2a5c02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization on-the-fly for long documents #106

tokenization on-the-fly for long documents #106

dangxuanhong commented Jul 31, 2024 •

edited

Loading

thinkahead commented Aug 7, 2024 •

edited

Loading

daviswer commented Aug 29, 2024

tokenization on-the-fly for long documents #106

tokenization on-the-fly for long documents #106

Comments

dangxuanhong commented Jul 31, 2024 • edited Loading

thinkahead commented Aug 7, 2024 • edited Loading

daviswer commented Aug 29, 2024

dangxuanhong commented Jul 31, 2024 •

edited

Loading

thinkahead commented Aug 7, 2024 •

edited

Loading