Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add .collate for .map(Collater) #67

Merged
merged 4 commits into from
Oct 10, 2023
Merged

add .collate for .map(Collater) #67

merged 4 commits into from
Oct 10, 2023

Conversation

gwenzek
Copy link
Contributor

@gwenzek gwenzek commented Sep 26, 2023

As disscussed in #64 I propose to add a shortcut .collate to replace .map(Collater).
This simplifies the API for the user, because they don't have to learn about the Collater class, and it's more discoverable because the .collate function will be documented next to the other data pipeline ops.

@gwenzek gwenzek requested a review from cbalioglu as a code owner September 26, 2023 09:27
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 26, 2023
Base automatically changed from doc to main September 26, 2023 21:49
Copy link
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just left two minor comments. I think you should rebase it though. It seems you also have the doc changes included in this PR.

opt_overrides = *std::move(maybe_opt_overrides);

map_fn f = collater(opts, std::move(opt_overrides));
element_mapper mapper{f, std::nullopt};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an element_mapper here? Since we are explicitly passing nullopt, it feels superfluous. It should be possible to pass f directly to map().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

src/fairseq2/data/data_pipeline.py Show resolved Hide resolved
The pipeline state can be persisted to the disk, allowing it to be resumed later.
It is a Python Iterable, but it also contains the iterator states.
Calling `iter` a second time while the first iterator is still being used
will segfault or worse.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the or worse sounds a bit ambiguous to put in a doc. Can you elaborate ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I just wrote "will cause Undefined Beahvior", but I find that a bit unclear. "or worse" here means data corruption if you have two threads writing and reading from shared buffer concurrently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure when exactly you observe this kind of behavior. Considering that the iterator is always called under GIL, from data pipeline's point of view, they are just a sequence of next() calls. It shouldn't cause any race condition or segfault. If you have a use case that we can reproduce, please share it. It should be treated as bug and we should fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to reproduce segfault anymore, but it is still a footgun to call __iter__ twice on the same object.
See eg the following test:

def test_two_iterators_interfere_with_each_others() -> None:
    dataloader = (
        fairseq2.data.text.read_text(FILE, rtrim=True)
        .map(lambda line: torch.tensor([c for c in bytes(line)]))
        .bucket_by_length([(10, 10), (5, 20), (1, 100)])
        .prefetch(5)
        .and_return()
    )
    it1 = iter(dataloader)
    it2 = iter(dataloader)
    l1 = list(itertools.islice(it1, 10))
    l2 = list(itertools.islice(it2, 10))

    # it1 and it2 are reading from the same dataloader,
    # so they interfere with each others.
    assert l1 != FILE_LINES[:10]
    assert l2 != FILE_LINES[:10]
    assert l1 != l2

@gwenzek
Copy link
Contributor Author

gwenzek commented Oct 10, 2023

can we merge this ? I've addressed all the feedback.

@cbalioglu cbalioglu merged commit 8deaba4 into main Oct 10, 2023
18 of 19 checks passed
@cbalioglu cbalioglu deleted the collate branch October 10, 2023 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants