You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we may have to deal with very long documents up to millions of characters/tokens, the dataloader may need to be tested and revised as needed when it aims at tokenizing these long documents on-the-fly.
An approach of splittng a long document into chunks should be considered as an example here.
The text was updated successfully, but these errors were encountered:
Removing the SamplingDataSet that is used in multi-dataset handing allows us to bypass the problem.
The SamplingDataSet has more heterogeneity than iterating from one entire file to the next. We do want document mixing between datasets. Although the SamplingDataSet shouldn't cause every file to open, but rather one from each dataset, it seems like it is opening all parquet files causing the node to go out of memory
As we may have to deal with very long documents up to millions of characters/tokens, the
dataloader
may need to be tested and revised as needed when it aims at tokenizing these long documents on-the-fly.An approach of splittng a long document into chunks should be considered as an example here.
The text was updated successfully, but these errors were encountered: