add `.collate` for `.map(Collater)` #67

gwenzek · 2023-09-26T09:27:24Z

As disscussed in #64 I propose to add a shortcut .collate to replace .map(Collater).
This simplifies the API for the user, because they don't have to learn about the Collater class, and it's more discoverable because the .collate function will be documented next to the other data pipeline ops.

cbalioglu

Looks good to me. Just left two minor comments. I think you should rebase it though. It seems you also have the doc changes included in this PR.

cbalioglu · 2023-09-26T21:55:00Z

fairseq2n/python/src/fairseq2n/bindings/data/data_pipeline.cc

+                    opt_overrides = *std::move(maybe_opt_overrides);
+
+                map_fn f = collater(opts, std::move(opt_overrides));
+                element_mapper mapper{f, std::nullopt};


Do we need an element_mapper here? Since we are explicitly passing nullopt, it feels superfluous. It should be possible to pass f directly to map().

src/fairseq2/data/data_pipeline.py

najielhachem · 2023-10-03T09:59:17Z

src/fairseq2/data/data_pipeline.py

+        The pipeline state can be persisted to the disk, allowing it to be resumed later.
+        It is a Python Iterable, but it also contains the iterator states.
+        Calling `iter` a second time while the first iterator is still being used
+        will segfault or worse.


the or worse sounds a bit ambiguous to put in a doc. Can you elaborate ?

Initially I just wrote "will cause Undefined Beahvior", but I find that a bit unclear. "or worse" here means data corruption if you have two threads writing and reading from shared buffer concurrently.

I am not sure when exactly you observe this kind of behavior. Considering that the iterator is always called under GIL, from data pipeline's point of view, they are just a sequence of next() calls. It shouldn't cause any race condition or segfault. If you have a use case that we can reproduce, please share it. It should be treated as bug and we should fix it.

I wasn't able to reproduce segfault anymore, but it is still a footgun to call __iter__ twice on the same object.
See eg the following test:

def test_two_iterators_interfere_with_each_others() -> None: dataloader = ( fairseq2.data.text.read_text(FILE, rtrim=True) .map(lambda line: torch.tensor([c for c in bytes(line)])) .bucket_by_length([(10, 10), (5, 20), (1, 100)]) .prefetch(5) .and_return() ) it1 = iter(dataloader) it2 = iter(dataloader) l1 = list(itertools.islice(it1, 10)) l2 = list(itertools.islice(it2, 10)) # it1 and it2 are reading from the same dataloader, # so they interfere with each others. assert l1 != FILE_LINES[:10] assert l2 != FILE_LINES[:10] assert l1 != l2

gwenzek · 2023-10-10T08:11:09Z

can we merge this ? I've addressed all the feedback.

gwenzek requested a review from cbalioglu as a code owner September 26, 2023 09:27

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 26, 2023

Base automatically changed from doc to main September 26, 2023 21:49

cbalioglu reviewed Sep 26, 2023

View reviewed changes

najielhachem reviewed Oct 3, 2023

View reviewed changes

gwenzek force-pushed the collate branch from 72f2586 to 787d4f5 Compare October 3, 2023 16:29

gwenzek added 4 commits October 10, 2023 10:10

add .collate for .map(Collater)

4b71687

nit

51f4d2f

hide test case

b4fb6d1

less spooky comment

9df6ebd

gwenzek force-pushed the collate branch from cf98f79 to 9df6ebd Compare October 10, 2023 08:10

cbalioglu approved these changes Oct 10, 2023

View reviewed changes

cbalioglu merged commit 8deaba4 into main Oct 10, 2023
18 of 19 checks passed

cbalioglu deleted the collate branch October 10, 2023 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `.collate` for `.map(Collater)` #67

add `.collate` for `.map(Collater)` #67

gwenzek commented Sep 26, 2023

cbalioglu left a comment

cbalioglu Sep 26, 2023

gwenzek Oct 10, 2023

najielhachem Oct 3, 2023

gwenzek Oct 3, 2023

cbalioglu Oct 3, 2023

gwenzek Oct 10, 2023

gwenzek commented Oct 10, 2023

add .collate for .map(Collater) #67

add .collate for .map(Collater) #67

Conversation

gwenzek commented Sep 26, 2023

cbalioglu left a comment

Choose a reason for hiding this comment

cbalioglu Sep 26, 2023

Choose a reason for hiding this comment

gwenzek Oct 10, 2023

Choose a reason for hiding this comment

najielhachem Oct 3, 2023

Choose a reason for hiding this comment

gwenzek Oct 3, 2023

Choose a reason for hiding this comment

cbalioglu Oct 3, 2023

Choose a reason for hiding this comment

gwenzek Oct 10, 2023

Choose a reason for hiding this comment

gwenzek commented Oct 10, 2023

add `.collate` for `.map(Collater)` #67

add `.collate` for `.map(Collater)` #67