add returnEmbedding, returnTokenLength, and chunkPrefix options

jparkerweb · Nov 1, 2024 · 3af2806 · 3af2806
1 parent 927672b
commit 3af2806
Show file tree

Hide file tree

Showing 11 changed files with 358 additions and 168 deletions.
diff --git a/.cursorignore b/.cursorignore
@@ -1,7 +1,9 @@
 #  directories to ignore during indexing
 .git/
+.github/
 node_modules/
 models/
+example/
 
 # file patterns to ignore during indexing
 package-lock.json
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,58 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+## [2.0.0] - 2024-11-01
+### Added
+- Added `returnEmbedding` option to `chunkit` and `cramit` functions to include embeddings in the output.
+- Added `returnTokenLength` option to `chunkit` and `cramit` functions to include token length in the output.
+- Added `chunkPrefix` option to prefix each chunk with a task instruction (e.g., "search_document: ", "search_query: ").
+- Updated README to document new options and add RAG tips for using `chunkPrefix` with embedding models that support task prefixes.
+
+### ⚠️ Breaking Changes
+- Returned array of chunks is now an array of objects with `text`, `embedding`, and `tokenLength` properties. Previous versions returned an array of strings.
+
+---
+
+## [1.5.1] - 2024-11-01
+### Fixed
+- Fixed sentence splitter logic in `cramit` function..
+
+---
+
+## [1.5.0] - 2024-10-11
+### Updated
+- Replaced sentence splitter with a new algorithm that is more accurate and faster.
+
+---
+
+## [1.4.0] - 2024-09-24
+### Added
+- Breakup library into modules for easier maintenance and updates going forward.
+
+---
+
+## [1.3.0] - 2024-09-09
+### Added
+- Added download script to pre-download models for users that want pre-package them with their application.
+- Added model path/cache directory options.
+
+### Updated
+- Updated package dependencies.
+- Updated example scripts.
+- Updated README.
+
+---
+
+## [1.1.0] - 2024-05-09
+### Added
+- Added dynamic combining of final chunks based on similarity threshold.
+
+### Updated
+- Improved initial chunking algorithm to reduce the number of chunks.
+
+---
+
+## [1.0.0] - 2024-02-29
+### Added
+- Initial release with basic chunking functionality. 
diff --git a/README.md b/README.md
@@ -44,6 +44,17 @@ const myChunks = await chunkit(text, chunkitOptions);
   - `onnxEmbeddingModelQuantized`: Boolean (optional, default `true`) - Indicates whether to use a quantized version of the embedding model.
   - `localModelPath`: String (optional, default `null`) - Local path to save and load models (example: `./models`).
   - `modelCacheDir`: String (optional, default `null`) - Directory to cache downloaded models (example: `./models`).
+  - `returnEmbedding`: Boolean (optional, default `false`) - If set to `true`, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in `onnxEmbeddingModel`.
+  - `returnTokenLength`: Boolean (optional, default `false`) - If set to `true`, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in `onnxEmbeddingModel`.
+  - `chunkPrefix`: String (optional, default `null`) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths.
+
+## Output
+
+The output is an array of chunks, each containing the following properties:
+
+- `text`: String - The chunked text.
+- `embedding`: Array - The embedding vector (if `returnEmbedding` is `true`).
+- `tokenLength`: Integer - The token length (if `returnTokenLength` is `true`).
 
 ## Workflow
 
@@ -89,7 +100,7 @@ main();
 
 ```
 
-Look at the `example.js` file in the root of this project for a more complex example of using all the optional parameters.
+Look at the `example\example-chunkit.js` file for a more complex example of using all the optional parameters.
 
 
 ## Tuning
@@ -164,6 +175,8 @@ The behavior of the `chunkit` function can be finely tuned using several optiona
 
 | Model                                        | Quantized | Link                                                                                                                                       | Size    |
 | -------------------------------------------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------- |
+| nomic-ai/nomic-embed-text-v1.5               | true      | [https://huggingface.co/nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5)                             | 138 MB  |
+| nomic-ai/nomic-embed-text-v1.5               | false     | [https://huggingface.co/nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5)                             | 548 MB  |
 | Xenova/all-MiniLM-L6-v2                      | true      | [https://huggingface.co/Xenova/all-MiniLM-L6-v2](https://huggingface.co/Xenova/all-MiniLM-L6-v2)                                           | 23 MB   |
 | Xenova/all-MiniLM-L6-v2                      | false     | [https://huggingface.co/Xenova/all-MiniLM-L6-v2](https://huggingface.co/Xenova/all-MiniLM-L6-v2)                                           | 90.4 MB |
 | Xenova/paraphrase-multilingual-MiniLM-L12-v2 | true      | [https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2) | 118 MB  |
@@ -195,7 +208,7 @@ main();
 
 ```
 
-Look at the `example2.js` file in the root of this project for a more complex example of using all the optional parameters.
+Look at the `example\example-cramit.js` file in the root of this project for a more complex example of using all the optional parameters.
 
 ### Tuning
 
@@ -238,6 +251,28 @@ Fill out the `tools/download-models.list.json` file with a list of models you wa
 
 ---
 
+## 🔍 RAG Tip!
+
+If you are using this library for a RAG application, consider using the `chunkPrefix` option to add a prefix to each chunk. This can help improve the quality of the embeddings and reduce the amount of context needed to be passed to the LLM for embedding models that support task prefixes.
+
+Chunk your large document like this:
+```javascript
+const text = await fs.promises.readFile('./large-document.txt', 'utf8');
+const myDocumentChunks = await chunkit(text, { chunkPrefix: "search_document" });
+```
+
+Get your search queries ready like this (use cramit for a quick large chunk):
+```javascript
+const mySearchQuery = "What is the capital of France?";
+const mySearchQueryChunk = await chunkit(mySearchQuery, { chunkPrefix: "search_query" });
+```
+
+Now you can use the `myDocumentChunks` and `mySearchQueryChunk` arrays in your RAG application or find the closest match using cosine similarity in memory.
+
+Happy Chunking!
+
+---
+
 ## Appreciation
-If you enjoy this plugin please consider sending me a tip to support my work 😀
+If you enjoy this library please consider sending me a tip to support my work 😀
 ### [🍵 tip me here](https://ko-fi.com/jparkerweb)
diff --git a/chunkingUtils.js b/chunkingUtils.js
@@ -97,4 +97,15 @@ export async function optimizeAndRebalanceChunks(combinedChunks, tokenizer, maxT
     if (currentChunkText) optimizedChunks.push(currentChunkText);
 
     return optimizedChunks.filter(chunk => chunk);
-}
+}
+
+
+// ------------------------------------------------
+// -- Helper function to apply prefix to a chunk --
+// ------------------------------------------------
+export function applyPrefixToChunk(chunkPrefix, chunk) {
+    if (chunkPrefix && chunkPrefix.trim()) {
+        return `${chunkPrefix}: ${chunk}`;
+    }
+    return chunk;
+};