Skip to content

Latest commit

 

History

History
171 lines (129 loc) · 7.38 KB

File metadata and controls

171 lines (129 loc) · 7.38 KB

Similarity Transform

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Contributors

Summary

The similarity transforms annotates each input document with potential matches found in a document collection. The annotation consists of a json object proving the id of the matched document in the collection and the specific sentenced deemed as "similar" by the tranform. The Similarity Transform relies on a running ElasticSearch Index. We assume (and provide) a functioning endpoint, but you can spin up your own service (details in ElasticSearch Configuration).

Is this transform for me?

In it current implementation, this tranform helps identifying if any input text is nearly-verbatim reproducing content in a target collection. The main purpouse is "Text Attribution", i.e. vetting text against proprietary content - to identify data leakages or potential copyright violations.

This is particularly useful in the contest of synthetically generated text. One of the many concerns about using LLMs is the inadvertent incorporation of proprietary content that may violate copyright laws. Our assumption is that we are in possession of the data we want to check against - this is the data collection that we want to protect, and we look for verbatim or semi-verbatim reuse of any of its content. To perform search and similarity against our reference data, the data needs to be indexed (how to index your own data).

I am curious: how is it implemented? [optional read]

We then take the input text to be vetted, and generate multimple shingle queries to efficiently find matches against the ElasticSearch index.

For example, take the following text to check for similarity:

Now is the winter of our discontent

We can set the shingle size to 3 with a skip value (how far the shingle window slides) of 1 to get the following shingles:

now is the
is the winter
the winter of
winter of our
of our discontent

For this implementation we have a level of flexibility for matching text in cases where a near-exact match is suitable. In Elasticsearch, this primarily manifests as a value for “slop”, which allows for words to appear in a different order as well as slight variations. The allowed variations increase in correlation with the slop value. Calculating this value at runtime allows for longer shingles to still find matches, even without a 100% exact match.

Configuration

Input files

This transform supports the input of parquet files that contain a single column, called "contents", where each row is a a string that will be searched for in a target document collection. Your contents column may for example contain a collection of texts generated by a LLM, or a collection of student essays, i.e. any text of which you want to verify originality against your target corpus.

The target corpus (your elasticserach index) is specified with configuration parameters. You can index your document collection using a procedure we provide. To start, you can use our provided ElasticSearch instance, which contains some pre-indexed news articles.

Output format

The output table will contain a single additional column:

output column name data type description
contents string the original input text
similarity_score json the annotations that describe in which document a potential match was found and which sentence in the document was the closest match
Note: similarity_score will be soon changed into similarity_annotiation. Within the annotation, the field score will be changed to rank. The current score numbe shouldn't be taken as an absolute value, rather a rank for the returned results. Higher ranking results are more closely similar to the input text.

Example of single cell contents in the output column:

I bet the company staffs want an increase in the wages

Example of single cell content in the similarity_score column:

  {
      'contents': array(['I bet the company staffs want to have an increase in the wages.'], dtype=object), 
      'id': '123456789', 
      'index': 'myPrivateDocumentsIndex', 
      'score': 29.345
  }

Initialization

The transform can be initialized with the following parameters.

Parameter Default Description
similarity_es_endpoint - The URL for Elasticsearch
similarity_es_userid - Elasticsearch user ID
similarity_es_pwd - Elasticsearch password
similarity_es_index - The Elasticsearch index to query
similarity_shingle_size 8 Shingle size for query construction (default is 8)
similarity_result_size 1 result size for matched sentences (default is 1)
similarity_annotation_column similarity_score The column name that will contain the similarity annotations, in json format

Example

{
      "similarity_es_pwd" :"my password",
      "similarity_es_userid":"myElasticsearchID",
      "similarity_es_endpoint":"https://thisIsWhere.MyElasticIsRunning.com",
      "similarity_es_index" :"myPrivateDocumentsIndex"
}

Running

Launched Command Line Options

The following command line arguments are available in addition to the options provided by the python launcher options.

  --similarity_es_endpoint SIMILARITY_ES_ENDPOINT
                        The URL for Elasticsearch
  --similarity_es_userid SIMILARITY_ES_USERID
                        Elasticsearch user ID
  --similarity_es_pwd SIMILARITY_ES_PWD
                        Elasticsearch password
  --similarity_es_index SIMILARITY_ES_INDEX
                        The Elasticsearch index to query
  --similarity_shingle_size SIMILARITY_SHINGLE_SIZE
                        Shingle size for query construction (default is 8)
  --similarity_result_size SIMILARITY_RESULT_SIZE
                        result size for matched sentences (default is 1)
  --similarity_annotation_column SIMILARITY_ANNOTATION_COLUMN
                        The column name that will contain the similarity score
  --similarity_doc_text_column SIMILARITY_DOC_TEXT_COLUMN
                        The column name that contains the document text

These correspond to the configuration keys described above.

Launched Command Line Options

When invoking the CLI, the parameters must be set as --similarity_<name>, e.g. --similarity_es_pwd=pass.

Running the samples

To run the samples, use the following make targets

  • run-cli-sample - runs src/similarity_transform_python.py using command line args
  • run-local-sample - runs src/similarity_local.py
  • run-local-python-sample - runs src/similarity_local_python.py

These targets will activate the virtual environment and set up any configuration needed. Use the -n option of make to see the detail of what is done to run the sample.

For example,

make run-local-python-sample
...

Then

ls output

To see results of the transform.

Code example

See the sample notebook for an example