Skip to content

Commit

Permalink
Add unittest for ray text dedup (#540)
Browse files Browse the repository at this point in the history
  • Loading branch information
chenyushuo authored Jan 23, 2025
1 parent 7ca6ba6 commit ba40e47
Show file tree
Hide file tree
Showing 6 changed files with 1,977 additions and 14 deletions.
8 changes: 4 additions & 4 deletions docs/Operators.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,10 +78,10 @@ All the specific operators are listed below, each featured with several capabili
| document_simhash_deduplicator | 🔤Text 💻CPU 🟢Stable | Deduplicator to deduplicate samples at document-level using SimHash. Deduplicator使用SimHash在文档级别对样本进行重复数据删除。 | [code](../data_juicer/ops/deduplicator/document_simhash_deduplicator.py) | [tests](../tests/ops/deduplicator/test_document_simhash_deduplicator.py) |
| image_deduplicator | 🏞Image 💻CPU 🟢Stable | Deduplicator to deduplicate samples at document-level using exact matching of images between documents. Deduplicator使用文档之间的图像精确匹配在文档级别删除重复的样本。 | [code](../data_juicer/ops/deduplicator/image_deduplicator.py) | [tests](../tests/ops/deduplicator/test_image_deduplicator.py) |
| ray_basic_deduplicator | 💻CPU 🔴Alpha | Backend for deduplicator. deduplicator的后端。 | [code](../data_juicer/ops/deduplicator/ray_basic_deduplicator.py) | - |
| ray_bts_minhash_deduplicator | 🔤Text 💻CPU 🔴Alpha | A distributed implementation of Union-Find with load balancing. 具有负载平衡的Union-Find的分布式实现。 | [code](../data_juicer/ops/deduplicator/ray_bts_minhash_deduplicator.py) | - |
| ray_document_deduplicator | 🔤Text 💻CPU 🔴Alpha | Deduplicator to deduplicate samples at document-level using exact matching. Deduplicator使用精确匹配在文档级别删除重复的样本。 | [code](../data_juicer/ops/deduplicator/ray_document_deduplicator.py) | - |
| ray_image_deduplicator | 🏞Image 💻CPU 🔴Alpha | Deduplicator to deduplicate samples at document-level using exact matching of images between documents. Deduplicator使用文档之间的图像精确匹配在文档级别删除重复的样本。 | [code](../data_juicer/ops/deduplicator/ray_image_deduplicator.py) | - |
| ray_video_deduplicator | 🎬Video 💻CPU 🔴Alpha | Deduplicator to deduplicate samples at document-level using exact matching of videos between documents. Deduplicator使用文档之间的视频精确匹配在文档级别删除重复的样本。 | [code](../data_juicer/ops/deduplicator/ray_video_deduplicator.py) | - |
| ray_bts_minhash_deduplicator | 🔤Text 💻CPU 🟡Beta | A distributed implementation of Union-Find with load balancing. 具有负载平衡的Union-Find的分布式实现。 | [code](../data_juicer/ops/deduplicator/ray_bts_minhash_deduplicator.py) | [tests](../tests/ops/deduplicator/test_ray_bts_minhash_deduplicator.py) |
| ray_document_deduplicator | 🔤Text 💻CPU 🟡Beta | Deduplicator to deduplicate samples at document-level using exact matching. Deduplicator使用精确匹配在文档级别删除重复的样本。 | [code](../data_juicer/ops/deduplicator/ray_document_deduplicator.py) | [tests](../tests/ops/deduplicator/test_ray_document_deduplicator.py) |
| ray_image_deduplicator | 🏞Image 💻CPU 🟡Beta | Deduplicator to deduplicate samples at document-level using exact matching of images between documents. Deduplicator使用文档之间的图像精确匹配在文档级别删除重复的样本。 | [code](../data_juicer/ops/deduplicator/ray_image_deduplicator.py) | [tests](../tests/ops/deduplicator/test_ray_image_deduplicator.py) |
| ray_video_deduplicator | 🎬Video 💻CPU 🟡Beta | Deduplicator to deduplicate samples at document-level using exact matching of videos between documents. Deduplicator使用文档之间的视频精确匹配在文档级别删除重复的样本。 | [code](../data_juicer/ops/deduplicator/ray_video_deduplicator.py) | [tests](../tests/ops/deduplicator/test_ray_video_deduplicator.py) |
| video_deduplicator | 🎬Video 💻CPU 🟢Stable | Deduplicator to deduplicate samples at document-level using exact matching of videos between documents. Deduplicator使用文档之间的视频精确匹配在文档级别删除重复的样本。 | [code](../data_juicer/ops/deduplicator/video_deduplicator.py) | [tests](../tests/ops/deduplicator/test_video_deduplicator.py) |

## filter <a name="filter"/>
Expand Down
Loading

0 comments on commit ba40e47

Please sign in to comment.