-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add .collate
for .map(Collater)
#67
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Just left two minor comments. I think you should rebase it though. It seems you also have the doc changes included in this PR.
opt_overrides = *std::move(maybe_opt_overrides); | ||
|
||
map_fn f = collater(opts, std::move(opt_overrides)); | ||
element_mapper mapper{f, std::nullopt}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an element_mapper
here? Since we are explicitly passing nullopt
, it feels superfluous. It should be possible to pass f
directly to map()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/fairseq2/data/data_pipeline.py
Outdated
The pipeline state can be persisted to the disk, allowing it to be resumed later. | ||
It is a Python Iterable, but it also contains the iterator states. | ||
Calling `iter` a second time while the first iterator is still being used | ||
will segfault or worse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the or worse
sounds a bit ambiguous to put in a doc. Can you elaborate ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I just wrote "will cause Undefined Beahvior", but I find that a bit unclear. "or worse" here means data corruption if you have two threads writing and reading from shared buffer concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure when exactly you observe this kind of behavior. Considering that the iterator is always called under GIL, from data pipeline's point of view, they are just a sequence of next() calls. It shouldn't cause any race condition or segfault. If you have a use case that we can reproduce, please share it. It should be treated as bug and we should fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to reproduce segfault anymore, but it is still a footgun to call __iter__
twice on the same object.
See eg the following test:
def test_two_iterators_interfere_with_each_others() -> None:
dataloader = (
fairseq2.data.text.read_text(FILE, rtrim=True)
.map(lambda line: torch.tensor([c for c in bytes(line)]))
.bucket_by_length([(10, 10), (5, 20), (1, 100)])
.prefetch(5)
.and_return()
)
it1 = iter(dataloader)
it2 = iter(dataloader)
l1 = list(itertools.islice(it1, 10))
l2 = list(itertools.islice(it2, 10))
# it1 and it2 are reading from the same dataloader,
# so they interfere with each others.
assert l1 != FILE_LINES[:10]
assert l2 != FILE_LINES[:10]
assert l1 != l2
can we merge this ? I've addressed all the feedback. |
As disscussed in #64 I propose to add a shortcut
.collate
to replace.map(Collater)
.This simplifies the API for the user, because they don't have to learn about the
Collater
class, and it's more discoverable because the.collate
function will be documented next to the other data pipeline ops.