Replies: 1 comment
-
The tokenizer currently does not parse Chinese. I think stop words might work for Chinese but you need a custom splitter like jieba; I have not tried yet. If you have a working example please share here! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
text = ["黄山为三山五岳中三山之一,五岳归来不看山,黄山归来不看岳。", "华山是一个很好旅游景点,人们常常去爬华山。"]
tids = bm25s.tokenize(text, stopwords="zh")
print(tids)
------log------
Tokenized(ids=[[0, 1, 2], [3, 4]], vocab={'黄山为三山五岳中三山之一': 0, '五岳归来不看山': 1, '黄山归来不看岳': 2, '华山是一个很好旅游景点': 3, '人们常常去爬华山': 4})
tids = bm25s.tokenize(text, stopwords=["是", "好"])
print(tids)
------log------
Tokenized(ids=[[0, 1, 2], [3, 4]], vocab={'黄山为三山五岳中三山之一': 0, '五岳归来不看山': 1, '黄山归来不看岳': 2, '华山是一个很好旅游景点': 3, '人们常常去爬华山': 4})
Stopwords does't work.
Beta Was this translation helpful? Give feedback.
All reactions