Skip to content

Latest commit

 

History

History
96 lines (56 loc) · 3.18 KB

File metadata and controls

96 lines (56 loc) · 3.18 KB

Text Encoder Transform

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Contributors

Description

This transform is using sentence encoder models to create embedding vectors of the text in each row of the input .parquet table.

The embeddings vectors generated by the transform are useful for tasks like sentence similarity, features extraction, etc which are also at the core of retrieval-augmented generation (RAG) applications.

Input

input column name data type description
the one specified in content_column_name configuration string the content used in this transform

Output columns

output column name data type description
the one specified in output_embeddings_column_name configuration array[float] the embeddings vectors of the content

Configuration

The transform can be tuned with the following parameters.

Parameter Default Description
model_name BAAI/bge-small-en-v1.5 The HF model to use for encoding the text.
content_column_name contents Name of the column containing the text to be encoded.
output_embeddings_column_name embeddings Column name to store the embeddings in the output table.

Usage

Launched Command Line Options

When invoking the CLI, the parameters must be set as --text_encoder_<name>, e.g. --text_encoder_column_name_key=myoutput.

Code example

Here is a sample notebook

Transforming data using the transform image

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.

Testing

Following the testing strategy of data-processing-lib

Currently we have:

TextEncoder Ray Transform

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Summary

This project wraps the text_encoder transform with a Ray runtime.

Configuration and command line Options

Text Encoder configuration and command line options are the same as for the base python transform.

Code example

Here is a sample notebook

Launched Command Line Options

In addition to those available to the transform as defined here, ray launcher options are available.

Transforming data using the transform image

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.