Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise datasets #955

Merged
merged 1 commit into from
Jan 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/source/basics/assets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ How to Customize Your Assets

* Add the ``name``, ``dataset_family``, and ``data`` fields, which allows fairseq2 to find the corresponding dataset loader

* For more detailed information about ``dataset_family``, please refer to :doc:`Dataset Loaders </reference/api/fairseq2.datasets/loader>`
* For more detailed information about ``dataset_family``, please refer to :doc:`Dataset Loaders </reference/api/fairseq2.datasets/index>`

.. code-block:: yaml

Expand Down Expand Up @@ -105,4 +105,4 @@ In fairseq2, a model card is accessed via :py:class:`fairseq2.assets.AssetCard`.
See Also
--------

- :doc:`Dataset Loaders </reference/api/fairseq2.datasets/loader>`
- :doc:`Datasets </reference/api/fairseq2.datasets/index>`
60 changes: 53 additions & 7 deletions doc/source/reference/api/fairseq2.datasets/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,59 @@ fairseq2.datasets

.. module:: fairseq2.datasets

This module contains dataset loaders.
===============
Dataset Loaders
===============

.. autoclasstree:: fairseq2.datasets
:full:
:zoom:
The dataset loader system in fairseq2 provides a flexible and extensible way to load different types of datasets.
The system uses the concept of dataset families to organize and manage different dataset formats.

.. toctree::
:maxdepth: 1
Dataset Family
--------------

loader
A dataset family represents a specific format or structure of data that requires specialized loading logic.
Each dataset is associated with a family through the ``dataset_family`` field in its asset card.

Built-in Dataset Families
^^^^^^^^^^^^^^^^^^^^^^^^^

fairseq2 includes several built-in dataset families:

- ``generic_text``: For plain text datasets
- ``generic_parallel_text``: For parallel text/translation datasets
- ``generic_asr``: For automatic speech recognition datasets
- ``generic_speech``: For speech-only datasets
- ``generic_instruction``: For instruction-tuning datasets
- ``generic_preference_optimization``: For preference optimization datasets

Example Asset Card
^^^^^^^^^^^^^^^^^^

.. code-block:: yaml

name: librispeech_asr
dataset_family: generic_asr
tokenizer: "https://example.com/tokenizer.model"
tokenizer_family: char_tokenizer

Usage Examples
--------------

Loading a Dataset Using Family
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

from fairseq2.datasets import load_text_dataset

# Load using dataset name (will look up asset card)
dataset = load_text_dataset("my_text_dataset")

# Load using explicit asset card
card = AssetCard(name="custom_dataset", dataset_family="generic_text")
dataset = load_text_dataset(card)

See Also
--------

- :doc:`Text Dataset </reference/api/fairseq2.data/text/index>`
110 changes: 0 additions & 110 deletions doc/source/reference/api/fairseq2.datasets/loader.rst

This file was deleted.

1 change: 0 additions & 1 deletion src/fairseq2/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@

# isort: split

import fairseq2.datasets
import fairseq2.models

# isort: split
Expand Down
3 changes: 2 additions & 1 deletion src/fairseq2/chatbots/llama.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@

from fairseq2.chatbots.chatbot import AbstractChatbot, Chatbot, ChatDialog, ChatMessage
from fairseq2.chatbots.handler import ChatbotHandler
from fairseq2.data.text import LLaMA3Tokenizer, TextTokenEncoder, TextTokenizer
from fairseq2.data.text import TextTokenEncoder, TextTokenizer
from fairseq2.data.text.tokenizers.llama import LLaMA3Tokenizer
from fairseq2.generation import SequenceGenerator
from fairseq2.nn.utils.module import infer_device

Expand Down
84 changes: 5 additions & 79 deletions src/fairseq2/data/text/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,82 +11,8 @@
from fairseq2.data.text.converters import StrToTensorConverter as StrToTensorConverter
from fairseq2.data.text.text_reader import LineEnding as LineEnding
from fairseq2.data.text.text_reader import read_text as read_text
from fairseq2.data.text.tokenizers.char_tokenizer import (
CHAR_TOKENIZER_FAMILY as CHAR_TOKENIZER_FAMILY,
)
from fairseq2.data.text.tokenizers.handler import (
StandardTextTokenizerHandler as StandardTextTokenizerHandler,
)
from fairseq2.data.text.tokenizers.handler import (
TextTokenizerHandler as TextTokenizerHandler,
)
from fairseq2.data.text.tokenizers.handler import (
TextTokenizerLoader as TextTokenizerLoader,
)
from fairseq2.data.text.tokenizers.handler import (
TextTokenizerNotFoundError as TextTokenizerNotFoundError,
)
from fairseq2.data.text.tokenizers.handler import (
get_text_tokenizer_family as get_text_tokenizer_family,
)
from fairseq2.data.text.tokenizers.llama import (
LLAMA_TOKENIZER_FAMILY as LLAMA_TOKENIZER_FAMILY,
)
from fairseq2.data.text.tokenizers.llama import LLaMA3Tokenizer as LLaMA3Tokenizer
from fairseq2.data.text.tokenizers.mistral import (
MISTRAL_TOKENIZER_FAMILY as MISTRAL_TOKENIZER_FAMILY,
)
from fairseq2.data.text.tokenizers.nllb import (
NLLB_TOKENIZER_FAMILY as NLLB_TOKENIZER_FAMILY,
)
from fairseq2.data.text.tokenizers.nllb import NllbTokenizer as NllbTokenizer
from fairseq2.data.text.tokenizers.ref import (
resolve_text_tokenizer_reference as resolve_text_tokenizer_reference,
)
from fairseq2.data.text.tokenizers.s2t_transformer import (
S2T_TRANSFORMER_TOKENIZER_FAMILY as S2T_TRANSFORMER_TOKENIZER_FAMILY,
)
from fairseq2.data.text.tokenizers.s2t_transformer import (
S2TTransformerTokenizer as S2TTransformerTokenizer,
)
from fairseq2.data.text.tokenizers.sentencepiece import (
BasicSentencePieceTokenizer as BasicSentencePieceTokenizer,
)
from fairseq2.data.text.tokenizers.sentencepiece import (
RawSentencePieceTokenizer as RawSentencePieceTokenizer,
)
from fairseq2.data.text.tokenizers.sentencepiece import (
SentencePieceDecoder as SentencePieceDecoder,
)
from fairseq2.data.text.tokenizers.sentencepiece import (
SentencePieceEncoder as SentencePieceEncoder,
)
from fairseq2.data.text.tokenizers.sentencepiece import (
SentencePieceModel as SentencePieceModel,
)
from fairseq2.data.text.tokenizers.sentencepiece import (
SentencePieceTokenizer as SentencePieceTokenizer,
)
from fairseq2.data.text.tokenizers.sentencepiece import (
load_basic_sentencepiece as load_basic_sentencepiece,
)
from fairseq2.data.text.tokenizers.sentencepiece import (
load_raw_sentencepiece as load_raw_sentencepiece,
)
from fairseq2.data.text.tokenizers.sentencepiece import (
vocab_info_from_sentencepiece as vocab_info_from_sentencepiece,
)
from fairseq2.data.text.tokenizers.static import (
load_text_tokenizer as load_text_tokenizer,
)
from fairseq2.data.text.tokenizers.tiktoken import TiktokenDecoder as TiktokenDecoder
from fairseq2.data.text.tokenizers.tiktoken import TiktokenEncoder as TiktokenEncoder
from fairseq2.data.text.tokenizers.tiktoken import (
TiktokenTokenizer as TiktokenTokenizer,
)
from fairseq2.data.text.tokenizers.tokenizer import (
AbstractTextTokenizer as AbstractTextTokenizer,
)
from fairseq2.data.text.tokenizers.tokenizer import TextTokenDecoder as TextTokenDecoder
from fairseq2.data.text.tokenizers.tokenizer import TextTokenEncoder as TextTokenEncoder
from fairseq2.data.text.tokenizers.tokenizer import TextTokenizer as TextTokenizer
from fairseq2.data.text.tokenizers import AbstractTextTokenizer as AbstractTextTokenizer
from fairseq2.data.text.tokenizers import TextTokenDecoder as TextTokenDecoder
from fairseq2.data.text.tokenizers import TextTokenEncoder as TextTokenEncoder
from fairseq2.data.text.tokenizers import TextTokenizer as TextTokenizer
from fairseq2.data.text.tokenizers import load_text_tokenizer as load_text_tokenizer
35 changes: 35 additions & 0 deletions src/fairseq2/data/text/tokenizers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

from __future__ import annotations

from fairseq2.data.text.tokenizers.handler import (
StandardTextTokenizerHandler as StandardTextTokenizerHandler,
)
from fairseq2.data.text.tokenizers.handler import (
TextTokenizerHandler as TextTokenizerHandler,
)
from fairseq2.data.text.tokenizers.handler import (
TextTokenizerLoader as TextTokenizerLoader,
)
from fairseq2.data.text.tokenizers.handler import (
TextTokenizerNotFoundError as TextTokenizerNotFoundError,
)
from fairseq2.data.text.tokenizers.handler import (
get_text_tokenizer_family as get_text_tokenizer_family,
)
from fairseq2.data.text.tokenizers.ref import (
resolve_text_tokenizer_reference as resolve_text_tokenizer_reference,
)
from fairseq2.data.text.tokenizers.static import (
load_text_tokenizer as load_text_tokenizer,
)
from fairseq2.data.text.tokenizers.tokenizer import (
AbstractTextTokenizer as AbstractTextTokenizer,
)
from fairseq2.data.text.tokenizers.tokenizer import TextTokenDecoder as TextTokenDecoder
from fairseq2.data.text.tokenizers.tokenizer import TextTokenEncoder as TextTokenEncoder
from fairseq2.data.text.tokenizers.tokenizer import TextTokenizer as TextTokenizer
5 changes: 4 additions & 1 deletion src/fairseq2/data/text/tokenizers/llama.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,12 @@ def create_encoder(
case "prompt_response":
prefix_tokens = []
suffix_tokens = [self._eos_token]
case "as_is":
prefix_tokens = []
suffix_tokens = []
case _:
raise ValueError(
f"`mode` must be 'default' or 'prompt', but is '{mode}' instead."
f"`mode` must be one of the following values, but is '{mode}' instead: default, prompt, prompt_response, as_is"
)

return TiktokenEncoder(
Expand Down
2 changes: 1 addition & 1 deletion src/fairseq2/data/text/tokenizers/sentencepiece.py
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ def create_encoder(
suffix_tokens = ["</s>"]
case _:
raise ValueError(
f"`mode` must be 'default' or 'prompt', but is '{mode}' instead."
f"`mode` must be one of the following values, but is '{mode}' instead: default, prompt, prompt_response"
)

return SentencePieceEncoder(
Expand Down
21 changes: 8 additions & 13 deletions src/fairseq2/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,12 @@
from fairseq2.datasets.batching import StaticBatching as StaticBatching
from fairseq2.datasets.data_reader import DataPipelineReader as DataPipelineReader
from fairseq2.datasets.data_reader import DataReader as DataReader
from fairseq2.datasets.data_reader import SyncMode as SyncMode
from fairseq2.datasets.error import DataReadError as DataReadError
from fairseq2.datasets.error import DatasetError as DatasetError
from fairseq2.datasets.loader import AbstractDatasetLoader as AbstractDatasetLoader
from fairseq2.datasets.loader import DatasetLoader as DatasetLoader
from fairseq2.datasets.loader import DelegatingDatasetLoader as DelegatingDatasetLoader
from fairseq2.datasets.loader import get_dataset_family as get_dataset_family
from fairseq2.datasets.loader import is_dataset_card as is_dataset_card

# isort: split

import fairseq2.datasets.asr
import fairseq2.datasets.instruction
import fairseq2.datasets.parallel_text
import fairseq2.datasets.speech
import fairseq2.datasets.text
from fairseq2.datasets.handler import DatasetHandler as DatasetHandler
from fairseq2.datasets.handler import DatasetLoader as DatasetLoader
from fairseq2.datasets.handler import DatasetNotFoundError as DatasetNotFoundError
from fairseq2.datasets.handler import StandardDatasetHandler as StandardDatasetHandler
from fairseq2.datasets.handler import get_dataset_family as get_dataset_family
from fairseq2.datasets.static import load_dataset as load_dataset
Loading
Loading