Skip to content

Commit

Permalink
deploy: 50600d2
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] committed Jul 9, 2024
1 parent 0d48e5b commit 3e6047b
Show file tree
Hide file tree
Showing 7 changed files with 233 additions and 53 deletions.
Binary file modified .doctrees/developer_notes/text_splitter.doctree
Binary file not shown.
Binary file modified .doctrees/environment.pickle
Binary file not shown.
130 changes: 105 additions & 25 deletions _sources/developer_notes/text_splitter.rst.txt
Original file line number Diff line number Diff line change
@@ -1,68 +1,79 @@
Text Splitter
-----------------
======================
.. .. admonition:: Author
.. :class: highlight
.. `Xiaoyi Gu <https://github.com/Alleria1809>`_
In this tutorial, we will learn:
In this tutorial, we will discuss:

#. TextSplitter Overview

#. How does it work

#. How to use it

#. Chunking Tips

#. Integration with Other Document Types and Customization Tips

TextSplitter Overview
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-----------------------------
LLMs’s context window is limited and the performance often drops with very long and nonsense input.
Shorter content is more manageable and fits memory constraint.
The goal of the text splitter is to chunk large data into smaller ones, potentially improving embedding and retrieving.
The goal of the ``TextSplitter`` is to chunk large data into smaller ones, potentially improving embedding and retrieving.

The ``TextSplitter`` is designed to efficiently process and chunk **plain text**.
It leverages configurable separators to facilitate the splitting of :obj:`document object <core.types.Document>` into smaller manageable document chunks.

How does it work
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-----------------------------
``TextSplitter`` first utilizes ``split_by`` to specify the text-splitting criterion and breaks the long text into smaller texts.
Then we create a sliding window with length= ``chunk_size``. It moves at step= ``chunk_size`` - ``chunk_overlap``.
Then we create a sliding window with ``length= chunk_size``. It moves at ``step= chunk_size - chunk_overlap``.
The texts inside each window will get merged to a smaller chunk. The generated chunks from the splitted text will be returned.

**Splitting Types**

Splitting Types
^^^^^^^^^^^^^^^^^^^^^^^^^^^
``TextSplitter`` supports 2 types of splitting.
Here are sample examples and you will see the real output of ``TextSplitter`` in the usage section.

* **Type 1:** Specify the exact text splitting point such as space<" "> and periods<".">. It is intuitive, for example, split_by "word":
* **Type 1:** Specify the exact text splitting point such as space<" "> and periods<".">. E.g. if you set ``split_by = "word"``, you will get:

::

"Hello, world!" -> ["Hello, " ,"world!"]

* **Type 2:** Use :class:`tokenizer <lightrag.core.tokenizer.Tokenizer>`. It works as:
* **Type 2:** Use :class:`core.tokenizer.Tokenizer`. It works as:

::

"Hello, world!" -> ['Hello', ',', ' world', '!']
"Hello, world!" -> ["Hello", ",", " world", "!"]

Tokenization aligns with how models see text in the form of tokens (`Reference <https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb>`_).
Consider using tokenization when your embedding model works better for tokens or you will input the data chunks to LLM models that are sensitive to token limit.

.. note::

This aligns with how models see text in the form of tokens (`Reference <https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb>`_),
Tokenizer reflects the real token numbers the models take in and helps the developers control budgets.
Tokenizer reflects the real token numbers the models take in and helps the developers control budgets.

**Definitions**
Definitions
^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **split_by** specifies the split rule, i.e. the smallest unit during splitting. We support ``"word"``, ``"sentence"``, ``"page"``, ``"passage"``, and ``"token"``. The splitter utilizes the corresponding separator from the ``SEPARATORS`` dictionary.
For Type 1 splitting, we apply ``Python str.split()`` to break the text.

* **SEPARATORS**: Maps ``split_by`` criterions to their exact text separators, e.g., spaces <" "> for "word" or periods <"."> for "sentence".

.. note::
For option ``token``, its separator is "" because we directly split by a tokenizer, instead of text point.
For option ``token``, its separator is "" because we directly split by a tokenizer, instead of specific text point.

* **chunk_size** is the the maximum number of units in each chunk.
* **chunk_size** is the the maximum number of units in each chunk. To figure out which ``chunk_size`` works best for you, you can firstly preprocess your raw data, select a range of the ``chunk_size`` and then run the evaluation on your use case with a bunch of queries.

* **chunk_overlap** is the number of units that each chunk should overlap. Including context at the borders prevents sudden meaning shift in text between sentences/context, especially in sentiment analysis.

Here are examples of how ``split_by``, ``chunk_size`` works with ``chunk_overlap``.
Document Text:

Input Document Text:

::

Expand Down Expand Up @@ -90,7 +101,7 @@ Document Text:
- 2
- "Hello, this is l", "is lightrag.", "trag. Please implement your", "implement your splitter here."

When splitting by ``word`` with ``chunk_size`` = 5 and ``chunk_overlap`` = 2,
When splitting by ``word`` with ``chunk_size = 5`` and ``chunk_overlap = 2``,
each chunk will repeat 2 words from the previous chunk. These 2 words are set by ``chunk_overlap``.
This means each chunk has ``5-2=3`` word(split unit) difference compared with its previous.

Expand All @@ -99,12 +110,15 @@ For example, the tokenizer transforms ``lightrag`` to ['l', 'igh', 'trag']. So t

.. note::
``chunk_overlap`` should always be smaller than ``chunk_size``, otherwise the window won't move and the splitting stucks.
When ``split_by`` = ``token``, the punctuation is considered as a token.
Our default tokenization model is ``cl100k_base``. If you use tokenization (``split_by`` = ``token``), the punctuations are also considered as tokens.

How to use it
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-----------------------------
What you need is to specify the arguments and input your documents this way:

Split by word
^^^^^^^^^^^^^^^^^^

.. code-block:: python
from lightrag.components.data_process.text_splitter import TextSplitter
Expand Down Expand Up @@ -134,13 +148,79 @@ What you need is to specify the arguments and input your documents this way:
# Document(id=ca0af45b-4f88-49b5-97db-163da9868ea4, text='text. Even more text to ', meta_data=None, vector=[], parent_doc_id=doc1, order=1, score=None)
# Document(id=e7b617b2-3927-4248-afce-ec0fc247ac8b, text='to illustrate.', meta_data=None, vector=[], parent_doc_id=doc1, order=2, score=None)
Split by token
^^^^^^^^^^^^^^^^^^
.. code-block:: python
from lightrag.components.data_process.text_splitter import TextSplitter
from lightrag.core.types import Document
import tiktoken
# Configure the splitter settings
text_splitter = TextSplitter(
split_by="token",
chunk_size=5,
chunk_overlap=0
)
doc = Document(
text="Example text. More example text. Even more text to illustrate.",
id = "doc1"
)
splitted_docs = (text_splitter.call(documents=[doc]))
for doc in splitted_docs:
print(doc)
# Output:
# Document(id=27cec433-b400-4f11-8871-e386e774d150, text='Example text. More example', meta_data=None, vector=[], parent_doc_id=doc1, order=0, score=None)
# Document(id=8905dc5f-8be5-4ca4-88b1-2ae492258b53, text=' text. Even more text', meta_data=None, vector=[], parent_doc_id=doc1, order=1, score=None)
# Document(id=ba8e1e23-82fb-4aa8-bfc5-e22084984bb9, text=' to illustrate.', meta_data=None, vector=[], parent_doc_id=doc1, order=2, score=None)
Chunking Tips
-----------------------------
Choosing the proper chunking strategy involves considering several key factors:

- **Content Type**: Adapt your chunking approach to matching the specific type of content, such as articles, books, social media posts, or genetic sequences.
- **Embedding Model**: Select a chunking method that aligns with your embedding model's training to optimize performance. For example, sentence-based splitting pairs well with `sentence-transformer <https://huggingface.co/sentence-transformers>`_ models, while token-based splitting is ideal for OpenAI's `text-embedding-ada-002 <https://openai.com/index/new-and-improved-embedding-model>`_.
- **Query Dynamics**: The length and complexity of queries should influence your chunking strategy. Larger chunks may be better for shorter queries lacking detailed specifications and needing broad context, whereas longer queries(more specific) might have higher accuracy with finer granularity.
- **Application of Results**: The application, whether it be semantic search, question answering, or summarization, dictates the appropriate chunking method, especially considering the limitations of content windows in large language models (LLMs).
- **System Integration**: Efficient chunking aligns with system capabilities. For example, `Full-Text Search:` Use larger chunks to allow algorithms to explore broader contexts effectively. For example, search books based on extensive excerpts or chapters. `Granular Search Systems:` Employ smaller chunks to precisely retrieve information relevant to user queries, such as retrieving specific instructions directly in response to a user’s question. For example, if a user asks, "How do I reset my password?". The system can retrieve a specific sentence or paragraph addressing that action directly.


Chunking Strategies
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Fixed-Size Chunking
""""""""""""""""""""""""""

- Ideal for content requiring uniform chunk sizes like genetic sequences or standardized data entries. This method, which involves splitting text into equal-sized word blocks, is simple and efficient but may compromise semantic coherence and risk breaking important contextual links.

Content-Aware Chunking
""""""""""""""""""""""""""

- **Split by Sentence**: Proper for texts needing a deep understanding of complete sentences, such as academic articles or medical reports. This method maintains grammatical integrity and contextual flow.
- **Split by Passage**: Useful for maintaining the structure and coherence of large documents. Supports detailed tasks like question answering and summarization by focusing on specific text sections.
- **Split by Page**: Effective for large documents where each page contains distinct information, such as legal or academic texts, facilitating precise navigation and information extraction.

Token-Based Splitting
""""""""""""""""""""""""""

- Beneficial for scenarios where embedding models have strict token limitations. This method divides text based on token count, optimizing compatibility with LLMs like GPT, though it may slow down processing due to model complexities.

Upcoming Splitting Features
""""""""""""""""""""""""""""""""

- **Semantic Splitting**: Focuses on grouping texts by meaning rather than structure, enhancing the relevance for thematic searches or advanced contextual retrieval tasks.

Integration with Other Document Types
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
----------------------------------------------------------
This functionality is ideal for segmenting texts into sentences, words, pages, or passages, which can then be processed further for NLP applications.
For **PDFs**, developers will need to extract the text before using the splitter. Libraries like ``PyPDF2`` or ``PDFMiner`` can be utilized for this purpose.
For `PDFs`, developers will need to extract the text before using the splitter. Libraries like ``PyPDF2`` or ``PDFMiner`` can be utilized for this purpose.
``LightRAG``'s future implementations will introduce splitters for ``JSON``, ``HTML``, ``markdown``, and ``code``.

Customization Tips
~~~~~~~~~~~~~~~~~~~~~
You can also customize the ``SEPARATORS``. For example, by defining ``SEPARATORS`` = {"question": "?"} and setting ``split_by`` = "question", the document will be split at each ``?``, ideal for processing text structured
as a series of questions. If you need to customize :class:`tokenizer <lightrag.core.tokenizer.Tokenizer>`, please check `Reference <https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb>`_.
-----------------------------
You can also customize the ``SEPARATORS``. For example, by defining ``SEPARATORS`` = ``{"question": "?"} ``and setting ``split_by = "question"``, the document will be split at each ``?``, ideal for processing text structured
as a series of questions. If you need to customize :class:`core.tokenizer.Tokenizer`, please check `Reference <https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb>`_.
2 changes: 1 addition & 1 deletion _sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -235,4 +235,4 @@ We are building a library that unites the two worlds, forming a healthy LLM appl
.. :caption: For Contributors
.. :hidden:
.. contributor/index
.. contributor/index
Loading

0 comments on commit 3e6047b

Please sign in to comment.