Skip to content

Commit

Permalink
add returnEmbedding, returnTokenLength, and chunkPrefix options
Browse files Browse the repository at this point in the history
  • Loading branch information
jparkerweb committed Nov 1, 2024
1 parent 927672b commit 3af2806
Show file tree
Hide file tree
Showing 11 changed files with 358 additions and 168 deletions.
2 changes: 2 additions & 0 deletions .cursorignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# directories to ignore during indexing
.git/
.github/
node_modules/
models/
example/

# file patterns to ignore during indexing
package-lock.json
58 changes: 58 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Changelog

All notable changes to this project will be documented in this file.

## [2.0.0] - 2024-11-01
### Added
- Added `returnEmbedding` option to `chunkit` and `cramit` functions to include embeddings in the output.
- Added `returnTokenLength` option to `chunkit` and `cramit` functions to include token length in the output.
- Added `chunkPrefix` option to prefix each chunk with a task instruction (e.g., "search_document: ", "search_query: ").
- Updated README to document new options and add RAG tips for using `chunkPrefix` with embedding models that support task prefixes.

### ⚠️ Breaking Changes
- Returned array of chunks is now an array of objects with `text`, `embedding`, and `tokenLength` properties. Previous versions returned an array of strings.

---

## [1.5.1] - 2024-11-01
### Fixed
- Fixed sentence splitter logic in `cramit` function..

---

## [1.5.0] - 2024-10-11
### Updated
- Replaced sentence splitter with a new algorithm that is more accurate and faster.

---

## [1.4.0] - 2024-09-24
### Added
- Breakup library into modules for easier maintenance and updates going forward.

---

## [1.3.0] - 2024-09-09
### Added
- Added download script to pre-download models for users that want pre-package them with their application.
- Added model path/cache directory options.

### Updated
- Updated package dependencies.
- Updated example scripts.
- Updated README.

---

## [1.1.0] - 2024-05-09
### Added
- Added dynamic combining of final chunks based on similarity threshold.

### Updated
- Improved initial chunking algorithm to reduce the number of chunks.

---

## [1.0.0] - 2024-02-29
### Added
- Initial release with basic chunking functionality.
41 changes: 38 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,17 @@ const myChunks = await chunkit(text, chunkitOptions);
- `onnxEmbeddingModelQuantized`: Boolean (optional, default `true`) - Indicates whether to use a quantized version of the embedding model.
- `localModelPath`: String (optional, default `null`) - Local path to save and load models (example: `./models`).
- `modelCacheDir`: String (optional, default `null`) - Directory to cache downloaded models (example: `./models`).
- `returnEmbedding`: Boolean (optional, default `false`) - If set to `true`, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in `onnxEmbeddingModel`.
- `returnTokenLength`: Boolean (optional, default `false`) - If set to `true`, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in `onnxEmbeddingModel`.
- `chunkPrefix`: String (optional, default `null`) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths.

## Output

The output is an array of chunks, each containing the following properties:

- `text`: String - The chunked text.
- `embedding`: Array - The embedding vector (if `returnEmbedding` is `true`).
- `tokenLength`: Integer - The token length (if `returnTokenLength` is `true`).

## Workflow

Expand Down Expand Up @@ -89,7 +100,7 @@ main();

```

Look at the `example.js` file in the root of this project for a more complex example of using all the optional parameters.
Look at the `example\example-chunkit.js` file for a more complex example of using all the optional parameters.


## Tuning
Expand Down Expand Up @@ -164,6 +175,8 @@ The behavior of the `chunkit` function can be finely tuned using several optiona

| Model | Quantized | Link | Size |
| -------------------------------------------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------- |
| nomic-ai/nomic-embed-text-v1.5 | true | [https://huggingface.co/nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) | 138 MB |
| nomic-ai/nomic-embed-text-v1.5 | false | [https://huggingface.co/nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) | 548 MB |
| Xenova/all-MiniLM-L6-v2 | true | [https://huggingface.co/Xenova/all-MiniLM-L6-v2](https://huggingface.co/Xenova/all-MiniLM-L6-v2) | 23 MB |
| Xenova/all-MiniLM-L6-v2 | false | [https://huggingface.co/Xenova/all-MiniLM-L6-v2](https://huggingface.co/Xenova/all-MiniLM-L6-v2) | 90.4 MB |
| Xenova/paraphrase-multilingual-MiniLM-L12-v2 | true | [https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2) | 118 MB |
Expand Down Expand Up @@ -195,7 +208,7 @@ main();

```

Look at the `example2.js` file in the root of this project for a more complex example of using all the optional parameters.
Look at the `example\example-cramit.js` file in the root of this project for a more complex example of using all the optional parameters.

### Tuning

Expand Down Expand Up @@ -238,6 +251,28 @@ Fill out the `tools/download-models.list.json` file with a list of models you wa

---

## 🔍 RAG Tip!

If you are using this library for a RAG application, consider using the `chunkPrefix` option to add a prefix to each chunk. This can help improve the quality of the embeddings and reduce the amount of context needed to be passed to the LLM for embedding models that support task prefixes.

Chunk your large document like this:
```javascript
const text = await fs.promises.readFile('./large-document.txt', 'utf8');
const myDocumentChunks = await chunkit(text, { chunkPrefix: "search_document" });
```

Get your search queries ready like this (use cramit for a quick large chunk):
```javascript
const mySearchQuery = "What is the capital of France?";
const mySearchQueryChunk = await chunkit(mySearchQuery, { chunkPrefix: "search_query" });
```

Now you can use the `myDocumentChunks` and `mySearchQueryChunk` arrays in your RAG application or find the closest match using cosine similarity in memory.

Happy Chunking!

---

## Appreciation
If you enjoy this plugin please consider sending me a tip to support my work 😀
If you enjoy this library please consider sending me a tip to support my work 😀
### [🍵 tip me here](https://ko-fi.com/jparkerweb)
13 changes: 12 additions & 1 deletion chunkingUtils.js
Original file line number Diff line number Diff line change
Expand Up @@ -97,4 +97,15 @@ export async function optimizeAndRebalanceChunks(combinedChunks, tokenizer, maxT
if (currentChunkText) optimizedChunks.push(currentChunkText);

return optimizedChunks.filter(chunk => chunk);
}
}


// ------------------------------------------------
// -- Helper function to apply prefix to a chunk --
// ------------------------------------------------
export function applyPrefixToChunk(chunkPrefix, chunk) {
if (chunkPrefix && chunkPrefix.trim()) {
return `${chunkPrefix}: ${chunk}`;
}
return chunk;
};
Loading

0 comments on commit 3af2806

Please sign in to comment.