Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
- Chad DeLuca ([email protected])
- Anna Lisa Gentile ([email protected])
The similarity transforms annotates each input document with potential matches found in a document collection. The annotation consists of a json object proving the id of the matched document in the collection and the specific sentenced deemed as "similar" by the tranform. The Similarity Transform relies on a running ElasticSearch Index. We assume (and provide) a functioning endpoint, but you can spin up your own service (details in ElasticSearch Configuration).
In it current implementation, this tranform helps identifying if any input text is nearly-verbatim reproducing content in a target collection. The main purpouse is "Text Attribution", i.e. vetting text against proprietary content - to identify data leakages or potential copyright violations.
This is particularly useful in the contest of synthetically generated text. One of the many concerns about using LLMs is the inadvertent incorporation of proprietary content that may violate copyright laws. Our assumption is that we are in possession of the data we want to check against - this is the data collection that we want to protect, and we look for verbatim or semi-verbatim reuse of any of its content. To perform search and similarity against our reference data, the data needs to be indexed (how to index your own data).
We then take the input text to be vetted, and generate multimple shingle queries to efficiently find matches against the ElasticSearch index.
For example, take the following text to check for similarity:
Now is the winter of our discontent
We can set the shingle size to 3 with a skip value (how far the shingle window slides) of 1 to get the following shingles:
now is the
is the winter
the winter of
winter of our
of our discontent
For this implementation we have a level of flexibility for matching text in cases where a near-exact match is suitable. In Elasticsearch, this primarily manifests as a value for “slop”, which allows for words to appear in a different order as well as slight variations. The allowed variations increase in correlation with the slop value. Calculating this value at runtime allows for longer shingles to still find matches, even without a 100% exact match.
This transform supports the input of parquet files that contain a single column, called "contents", where each row is a a string that will be searched for in a target document collection. Your contents column may for example contain a collection of texts generated by a LLM, or a collection of student essays, i.e. any text of which you want to verify originality against your target corpus.
The target corpus (your elasticserach index) is specified with configuration parameters. You can index your document collection using a procedure we provide. To start, you can use our provided ElasticSearch instance, which contains some pre-indexed news articles.
The output table will contain a single additional column:
output column name | data type | description |
---|---|---|
contents | string | the original input text |
similarity_score | json | the annotations that describe in which document a potential match was found and which sentence in the document was the closest match |
Example of single cell contents in the output column:
I bet the company staffs want an increase in the wages
Example of single cell content in the similarity_score column:
{
'contents': array(['I bet the company staffs want to have an increase in the wages.'], dtype=object),
'id': '123456789',
'index': 'myPrivateDocumentsIndex',
'score': 29.345
}
The transform can be initialized with the following parameters.
Parameter | Default | Description |
---|---|---|
similarity_es_endpoint |
- | The URL for Elasticsearch |
similarity_es_userid |
- | Elasticsearch user ID |
similarity_es_pwd |
- | Elasticsearch password |
similarity_es_index |
- | The Elasticsearch index to query |
similarity_shingle_size |
8 | Shingle size for query construction (default is 8) |
similarity_result_size |
1 | result size for matched sentences (default is 1) |
similarity_annotation_column |
similarity_score | The column name that will contain the similarity annotations, in json format |
Example
{
"similarity_es_pwd" :"my password",
"similarity_es_userid":"myElasticsearchID",
"similarity_es_endpoint":"https://thisIsWhere.MyElasticIsRunning.com",
"similarity_es_index" :"myPrivateDocumentsIndex"
}
The following command line arguments are available in addition to the options provided by the python launcher options.
--similarity_es_endpoint SIMILARITY_ES_ENDPOINT
The URL for Elasticsearch
--similarity_es_userid SIMILARITY_ES_USERID
Elasticsearch user ID
--similarity_es_pwd SIMILARITY_ES_PWD
Elasticsearch password
--similarity_es_index SIMILARITY_ES_INDEX
The Elasticsearch index to query
--similarity_shingle_size SIMILARITY_SHINGLE_SIZE
Shingle size for query construction (default is 8)
--similarity_result_size SIMILARITY_RESULT_SIZE
result size for matched sentences (default is 1)
--similarity_annotation_column SIMILARITY_ANNOTATION_COLUMN
The column name that will contain the similarity score
--similarity_doc_text_column SIMILARITY_DOC_TEXT_COLUMN
The column name that contains the document text
These correspond to the configuration keys described above.
When invoking the CLI, the parameters must be set as --similarity_<name>
, e.g. --similarity_es_pwd=pass
.
To run the samples, use the following make
targets
run-cli-sample
- runs src/similarity_transform_python.py using command line argsrun-local-sample
- runs src/similarity_local.pyrun-local-python-sample
- runs src/similarity_local_python.py
These targets will activate the virtual environment and set up any configuration needed.
Use the -n
option of make
to see the detail of what is done to run the sample.
For example,
make run-local-python-sample
...
Then
ls output
To see results of the transform.
See the sample notebook for an example