diff --git a/.doctrees/developer_notes/text_splitter.doctree b/.doctrees/developer_notes/text_splitter.doctree index 8abb69fd..f30a86d7 100644 Binary files a/.doctrees/developer_notes/text_splitter.doctree and b/.doctrees/developer_notes/text_splitter.doctree differ diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle index 13199f33..5ece64e4 100644 Binary files a/.doctrees/environment.pickle and b/.doctrees/environment.pickle differ diff --git a/_sources/developer_notes/text_splitter.rst.txt b/_sources/developer_notes/text_splitter.rst.txt index 75cd177e..e0f7da55 100644 --- a/_sources/developer_notes/text_splitter.rst.txt +++ b/_sources/developer_notes/text_splitter.rst.txt @@ -1,11 +1,11 @@ Text Splitter ------------------ +====================== .. .. admonition:: Author .. :class: highlight .. `Xiaoyi Gu `_ -In this tutorial, we will learn: +In this tutorial, we will discuss: #. TextSplitter Overview @@ -13,41 +13,51 @@ In this tutorial, we will learn: #. How to use it +#. Chunking Tips + +#. Integration with Other Document Types and Customization Tips + TextSplitter Overview -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +----------------------------- LLMs’s context window is limited and the performance often drops with very long and nonsense input. Shorter content is more manageable and fits memory constraint. -The goal of the text splitter is to chunk large data into smaller ones, potentially improving embedding and retrieving. +The goal of the ``TextSplitter`` is to chunk large data into smaller ones, potentially improving embedding and retrieving. The ``TextSplitter`` is designed to efficiently process and chunk **plain text**. It leverages configurable separators to facilitate the splitting of :obj:`document object ` into smaller manageable document chunks. How does it work -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +----------------------------- ``TextSplitter`` first utilizes ``split_by`` to specify the text-splitting criterion and breaks the long text into smaller texts. -Then we create a sliding window with length= ``chunk_size``. It moves at step= ``chunk_size`` - ``chunk_overlap``. +Then we create a sliding window with ``length= chunk_size``. It moves at ``step= chunk_size - chunk_overlap``. The texts inside each window will get merged to a smaller chunk. The generated chunks from the splitted text will be returned. -**Splitting Types** - +Splitting Types +^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``TextSplitter`` supports 2 types of splitting. +Here are sample examples and you will see the real output of ``TextSplitter`` in the usage section. -* **Type 1:** Specify the exact text splitting point such as space<" "> and periods<".">. It is intuitive, for example, split_by "word": +* **Type 1:** Specify the exact text splitting point such as space<" "> and periods<".">. E.g. if you set ``split_by = "word"``, you will get: :: "Hello, world!" -> ["Hello, " ,"world!"] -* **Type 2:** Use :class:`tokenizer `. It works as: +* **Type 2:** Use :class:`core.tokenizer.Tokenizer`. It works as: :: - "Hello, world!" -> ['Hello', ',', ' world', '!'] + "Hello, world!" -> ["Hello", ",", " world", "!"] + +Tokenization aligns with how models see text in the form of tokens (`Reference `_). +Consider using tokenization when your embedding model works better for tokens or you will input the data chunks to LLM models that are sensitive to token limit. + +.. note:: -This aligns with how models see text in the form of tokens (`Reference `_), -Tokenizer reflects the real token numbers the models take in and helps the developers control budgets. + Tokenizer reflects the real token numbers the models take in and helps the developers control budgets. -**Definitions** +Definitions +^^^^^^^^^^^^^^^^^^^^^^^^^^^ * **split_by** specifies the split rule, i.e. the smallest unit during splitting. We support ``"word"``, ``"sentence"``, ``"page"``, ``"passage"``, and ``"token"``. The splitter utilizes the corresponding separator from the ``SEPARATORS`` dictionary. For Type 1 splitting, we apply ``Python str.split()`` to break the text. @@ -55,14 +65,15 @@ For Type 1 splitting, we apply ``Python str.split()`` to break the text. * **SEPARATORS**: Maps ``split_by`` criterions to their exact text separators, e.g., spaces <" "> for "word" or periods <"."> for "sentence". .. note:: - For option ``token``, its separator is "" because we directly split by a tokenizer, instead of text point. + For option ``token``, its separator is "" because we directly split by a tokenizer, instead of specific text point. -* **chunk_size** is the the maximum number of units in each chunk. +* **chunk_size** is the the maximum number of units in each chunk. To figure out which ``chunk_size`` works best for you, you can firstly preprocess your raw data, select a range of the ``chunk_size`` and then run the evaluation on your use case with a bunch of queries. * **chunk_overlap** is the number of units that each chunk should overlap. Including context at the borders prevents sudden meaning shift in text between sentences/context, especially in sentiment analysis. Here are examples of how ``split_by``, ``chunk_size`` works with ``chunk_overlap``. -Document Text: + +Input Document Text: :: @@ -90,7 +101,7 @@ Document Text: - 2 - "Hello, this is l", "is lightrag.", "trag. Please implement your", "implement your splitter here." -When splitting by ``word`` with ``chunk_size`` = 5 and ``chunk_overlap`` = 2, +When splitting by ``word`` with ``chunk_size = 5`` and ``chunk_overlap = 2``, each chunk will repeat 2 words from the previous chunk. These 2 words are set by ``chunk_overlap``. This means each chunk has ``5-2=3`` word(split unit) difference compared with its previous. @@ -99,12 +110,15 @@ For example, the tokenizer transforms ``lightrag`` to ['l', 'igh', 'trag']. So t .. note:: ``chunk_overlap`` should always be smaller than ``chunk_size``, otherwise the window won't move and the splitting stucks. - When ``split_by`` = ``token``, the punctuation is considered as a token. + Our default tokenization model is ``cl100k_base``. If you use tokenization (``split_by`` = ``token``), the punctuations are also considered as tokens. How to use it -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +----------------------------- What you need is to specify the arguments and input your documents this way: +Split by word +^^^^^^^^^^^^^^^^^^ + .. code-block:: python from lightrag.components.data_process.text_splitter import TextSplitter @@ -134,13 +148,79 @@ What you need is to specify the arguments and input your documents this way: # Document(id=ca0af45b-4f88-49b5-97db-163da9868ea4, text='text. Even more text to ', meta_data=None, vector=[], parent_doc_id=doc1, order=1, score=None) # Document(id=e7b617b2-3927-4248-afce-ec0fc247ac8b, text='to illustrate.', meta_data=None, vector=[], parent_doc_id=doc1, order=2, score=None) +Split by token +^^^^^^^^^^^^^^^^^^ +.. code-block:: python + + from lightrag.components.data_process.text_splitter import TextSplitter + from lightrag.core.types import Document + import tiktoken + + # Configure the splitter settings + text_splitter = TextSplitter( + split_by="token", + chunk_size=5, + chunk_overlap=0 + ) + + doc = Document( + text="Example text. More example text. Even more text to illustrate.", + id = "doc1" + ) + + splitted_docs = (text_splitter.call(documents=[doc])) + + for doc in splitted_docs: + print(doc) + + # Output: + # Document(id=27cec433-b400-4f11-8871-e386e774d150, text='Example text. More example', meta_data=None, vector=[], parent_doc_id=doc1, order=0, score=None) + # Document(id=8905dc5f-8be5-4ca4-88b1-2ae492258b53, text=' text. Even more text', meta_data=None, vector=[], parent_doc_id=doc1, order=1, score=None) + # Document(id=ba8e1e23-82fb-4aa8-bfc5-e22084984bb9, text=' to illustrate.', meta_data=None, vector=[], parent_doc_id=doc1, order=2, score=None) + +Chunking Tips +----------------------------- +Choosing the proper chunking strategy involves considering several key factors: + +- **Content Type**: Adapt your chunking approach to matching the specific type of content, such as articles, books, social media posts, or genetic sequences. +- **Embedding Model**: Select a chunking method that aligns with your embedding model's training to optimize performance. For example, sentence-based splitting pairs well with `sentence-transformer `_ models, while token-based splitting is ideal for OpenAI's `text-embedding-ada-002 `_. +- **Query Dynamics**: The length and complexity of queries should influence your chunking strategy. Larger chunks may be better for shorter queries lacking detailed specifications and needing broad context, whereas longer queries(more specific) might have higher accuracy with finer granularity. +- **Application of Results**: The application, whether it be semantic search, question answering, or summarization, dictates the appropriate chunking method, especially considering the limitations of content windows in large language models (LLMs). +- **System Integration**: Efficient chunking aligns with system capabilities. For example, `Full-Text Search:` Use larger chunks to allow algorithms to explore broader contexts effectively. For example, search books based on extensive excerpts or chapters. `Granular Search Systems:` Employ smaller chunks to precisely retrieve information relevant to user queries, such as retrieving specific instructions directly in response to a user’s question. For example, if a user asks, "How do I reset my password?". The system can retrieve a specific sentence or paragraph addressing that action directly. + + +Chunking Strategies +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Fixed-Size Chunking +"""""""""""""""""""""""""" + +- Ideal for content requiring uniform chunk sizes like genetic sequences or standardized data entries. This method, which involves splitting text into equal-sized word blocks, is simple and efficient but may compromise semantic coherence and risk breaking important contextual links. + +Content-Aware Chunking +"""""""""""""""""""""""""" + +- **Split by Sentence**: Proper for texts needing a deep understanding of complete sentences, such as academic articles or medical reports. This method maintains grammatical integrity and contextual flow. +- **Split by Passage**: Useful for maintaining the structure and coherence of large documents. Supports detailed tasks like question answering and summarization by focusing on specific text sections. +- **Split by Page**: Effective for large documents where each page contains distinct information, such as legal or academic texts, facilitating precise navigation and information extraction. + +Token-Based Splitting +"""""""""""""""""""""""""" + +- Beneficial for scenarios where embedding models have strict token limitations. This method divides text based on token count, optimizing compatibility with LLMs like GPT, though it may slow down processing due to model complexities. + +Upcoming Splitting Features +"""""""""""""""""""""""""""""""" + +- **Semantic Splitting**: Focuses on grouping texts by meaning rather than structure, enhancing the relevance for thematic searches or advanced contextual retrieval tasks. + Integration with Other Document Types -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +---------------------------------------------------------- This functionality is ideal for segmenting texts into sentences, words, pages, or passages, which can then be processed further for NLP applications. -For **PDFs**, developers will need to extract the text before using the splitter. Libraries like ``PyPDF2`` or ``PDFMiner`` can be utilized for this purpose. +For `PDFs`, developers will need to extract the text before using the splitter. Libraries like ``PyPDF2`` or ``PDFMiner`` can be utilized for this purpose. ``LightRAG``'s future implementations will introduce splitters for ``JSON``, ``HTML``, ``markdown``, and ``code``. Customization Tips -~~~~~~~~~~~~~~~~~~~~~ -You can also customize the ``SEPARATORS``. For example, by defining ``SEPARATORS`` = {"question": "?"} and setting ``split_by`` = "question", the document will be split at each ``?``, ideal for processing text structured -as a series of questions. If you need to customize :class:`tokenizer `, please check `Reference `_. +----------------------------- +You can also customize the ``SEPARATORS``. For example, by defining ``SEPARATORS`` = ``{"question": "?"} ``and setting ``split_by = "question"``, the document will be split at each ``?``, ideal for processing text structured +as a series of questions. If you need to customize :class:`core.tokenizer.Tokenizer`, please check `Reference `_. diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt index 519db7a7..64cb9026 100644 --- a/_sources/index.rst.txt +++ b/_sources/index.rst.txt @@ -235,4 +235,4 @@ We are building a library that unites the two worlds, forming a healthy LLM appl .. :caption: For Contributors .. :hidden: -.. contributor/index +.. contributor/index \ No newline at end of file diff --git a/developer_notes/text_splitter.html b/developer_notes/text_splitter.html index c0656ab3..fd22d4e9 100644 --- a/developer_notes/text_splitter.html +++ b/developer_notes/text_splitter.html @@ -464,42 +464,52 @@

Text Splitter#

-

In this tutorial, we will learn:

+

In this tutorial, we will discuss:

  1. TextSplitter Overview

  2. How does it work

  3. How to use it

  4. +
  5. Chunking Tips

  6. +
  7. Integration with Other Document Types and Customization Tips

TextSplitter Overview#

LLMs’s context window is limited and the performance often drops with very long and nonsense input. Shorter content is more manageable and fits memory constraint. -The goal of the text splitter is to chunk large data into smaller ones, potentially improving embedding and retrieving.

+The goal of the TextSplitter is to chunk large data into smaller ones, potentially improving embedding and retrieving.

The TextSplitter is designed to efficiently process and chunk plain text. It leverages configurable separators to facilitate the splitting of document object into smaller manageable document chunks.

How does it work#

TextSplitter first utilizes split_by to specify the text-splitting criterion and breaks the long text into smaller texts. -Then we create a sliding window with length= chunk_size. It moves at step= chunk_size - chunk_overlap. +Then we create a sliding window with length= chunk_size. It moves at step= chunk_size - chunk_overlap. The texts inside each window will get merged to a smaller chunk. The generated chunks from the splitted text will be returned.

-

Splitting Types

-

TextSplitter supports 2 types of splitting.

+
+

Splitting Types#

+

TextSplitter supports 2 types of splitting. +Here are sample examples and you will see the real output of TextSplitter in the usage section.

    -
  • Type 1: Specify the exact text splitting point such as space<” “> and periods<”.”>. It is intuitive, for example, split_by “word”:

  • +
  • Type 1: Specify the exact text splitting point such as space<” “> and periods<”.”>. E.g. if you set split_by = "word", you will get:

"Hello, world!" -> ["Hello, " ,"world!"]
 
-
"Hello, world!" -> ['Hello', ',', ' world', '!']
+
"Hello, world!" -> ["Hello", ",", " world", "!"]
 
-

This aligns with how models see text in the form of tokens (Reference), -Tokenizer reflects the real token numbers the models take in and helps the developers control budgets.

-

Definitions

+

Tokenization aligns with how models see text in the form of tokens (Reference). +Consider using tokenization when your embedding model works better for tokens or you will input the data chunks to LLM models that are sensitive to token limit.

+
+

Note

+

Tokenizer reflects the real token numbers the models take in and helps the developers control budgets.

+
+
+
+

Definitions#

  • split_by specifies the split rule, i.e. the smallest unit during splitting. We support "word", "sentence", "page", "passage", and "token". The splitter utilizes the corresponding separator from the SEPARATORS dictionary.

@@ -509,14 +519,14 @@

How does it work

Note

-

For option token, its separator is “” because we directly split by a tokenizer, instead of text point.

+

For option token, its separator is “” because we directly split by a tokenizer, instead of specific text point.

    -
  • chunk_size is the the maximum number of units in each chunk.

  • +
  • chunk_size is the the maximum number of units in each chunk. To figure out which chunk_size works best for you, you can firstly preprocess your raw data, select a range of the chunk_size and then run the evaluation on your use case with a bunch of queries.

  • chunk_overlap is the number of units that each chunk should overlap. Including context at the borders prevents sudden meaning shift in text between sentences/context, especially in sentiment analysis.

-

Here are examples of how split_by, chunk_size works with chunk_overlap. -Document Text:

+

Here are examples of how split_by, chunk_size works with chunk_overlap.

+

Input Document Text:

Hello, this is lightrag. Please implement your splitter here.
 
@@ -554,7 +564,7 @@

How does it workword with chunk_size = 5 and chunk_overlap = 2, +

When splitting by word with chunk_size = 5 and chunk_overlap = 2, each chunk will repeat 2 words from the previous chunk. These 2 words are set by chunk_overlap. This means each chunk has 5-2=3 word(split unit) difference compared with its previous.

When splitting using tokenizer, each chunk still keeps 5 tokens. @@ -562,12 +572,15 @@

How does it work

Note

chunk_overlap should always be smaller than chunk_size, otherwise the window won’t move and the splitting stucks. -When split_by = token, the punctuation is considered as a token.

+Our default tokenization model is cl100k_base. If you use tokenization (split_by = token), the punctuations are also considered as tokens.

+

How to use it#

What you need is to specify the arguments and input your documents this way:

+
+

Split by word#

from lightrag.components.data_process.text_splitter import TextSplitter
 from lightrag.core.types import Document
 
@@ -597,16 +610,87 @@ 

How to use it +

Split by token#

+
from lightrag.components.data_process.text_splitter import TextSplitter
+from lightrag.core.types import Document
+import tiktoken
+
+# Configure the splitter settings
+text_splitter = TextSplitter(
+    split_by="token",
+    chunk_size=5,
+    chunk_overlap=0
+)
+
+doc = Document(
+    text="Example text. More example text. Even more text to illustrate.",
+    id = "doc1"
+    )
+
+splitted_docs = (text_splitter.call(documents=[doc]))
+
+for doc in splitted_docs:
+    print(doc)
+
+# Output:
+# Document(id=27cec433-b400-4f11-8871-e386e774d150, text='Example text. More example', meta_data=None, vector=[], parent_doc_id=doc1, order=0, score=None)
+# Document(id=8905dc5f-8be5-4ca4-88b1-2ae492258b53, text=' text. Even more text', meta_data=None, vector=[], parent_doc_id=doc1, order=1, score=None)
+# Document(id=ba8e1e23-82fb-4aa8-bfc5-e22084984bb9, text=' to illustrate.', meta_data=None, vector=[], parent_doc_id=doc1, order=2, score=None)
+
+
+

+
+
+

Chunking Tips#

+

Choosing the proper chunking strategy involves considering several key factors:

+
    +
  • Content Type: Adapt your chunking approach to matching the specific type of content, such as articles, books, social media posts, or genetic sequences.

  • +
  • Embedding Model: Select a chunking method that aligns with your embedding model’s training to optimize performance. For example, sentence-based splitting pairs well with sentence-transformer models, while token-based splitting is ideal for OpenAI’s text-embedding-ada-002.

  • +
  • Query Dynamics: The length and complexity of queries should influence your chunking strategy. Larger chunks may be better for shorter queries lacking detailed specifications and needing broad context, whereas longer queries(more specific) might have higher accuracy with finer granularity.

  • +
  • Application of Results: The application, whether it be semantic search, question answering, or summarization, dictates the appropriate chunking method, especially considering the limitations of content windows in large language models (LLMs).

  • +
  • System Integration: Efficient chunking aligns with system capabilities. For example, Full-Text Search: Use larger chunks to allow algorithms to explore broader contexts effectively. For example, search books based on extensive excerpts or chapters. Granular Search Systems: Employ smaller chunks to precisely retrieve information relevant to user queries, such as retrieving specific instructions directly in response to a user’s question. For example, if a user asks, “How do I reset my password?”. The system can retrieve a specific sentence or paragraph addressing that action directly.

  • +
+
+

Chunking Strategies#

+
+

Fixed-Size Chunking#

+
    +
  • Ideal for content requiring uniform chunk sizes like genetic sequences or standardized data entries. This method, which involves splitting text into equal-sized word blocks, is simple and efficient but may compromise semantic coherence and risk breaking important contextual links.

  • +
+
+
+

Content-Aware Chunking#

+
    +
  • Split by Sentence: Proper for texts needing a deep understanding of complete sentences, such as academic articles or medical reports. This method maintains grammatical integrity and contextual flow.

  • +
  • Split by Passage: Useful for maintaining the structure and coherence of large documents. Supports detailed tasks like question answering and summarization by focusing on specific text sections.

  • +
  • Split by Page: Effective for large documents where each page contains distinct information, such as legal or academic texts, facilitating precise navigation and information extraction.

  • +
+
+
+

Token-Based Splitting#

+
    +
  • Beneficial for scenarios where embedding models have strict token limitations. This method divides text based on token count, optimizing compatibility with LLMs like GPT, though it may slow down processing due to model complexities.

  • +
+
+
+

Upcoming Splitting Features#

+
    +
  • Semantic Splitting: Focuses on grouping texts by meaning rather than structure, enhancing the relevance for thematic searches or advanced contextual retrieval tasks.

  • +
+
+
+

Integration with Other Document Types#

This functionality is ideal for segmenting texts into sentences, words, pages, or passages, which can then be processed further for NLP applications. -For PDFs, developers will need to extract the text before using the splitter. Libraries like PyPDF2 or PDFMiner can be utilized for this purpose. +For PDFs, developers will need to extract the text before using the splitter. Libraries like PyPDF2 or PDFMiner can be utilized for this purpose. LightRAG’s future implementations will introduce splitters for JSON, HTML, markdown, and code.

-
-

Customization Tips#

-

You can also customize the SEPARATORS. For example, by defining SEPARATORS = {“question”: “?”} and setting split_by = “question”, the document will be split at each ?, ideal for processing text structured -as a series of questions. If you need to customize tokenizer, please check Reference.

+
+

Customization Tips#

+

You can also customize the SEPARATORS. For example, by defining SEPARATORS = {"question": "?"} ``and setting ``split_by = "question", the document will be split at each ?, ideal for processing text structured +as a series of questions. If you need to customize core.tokenizer.Tokenizer, please check Reference.

@@ -657,12 +741,28 @@

Customization Tips