Malayalam Language support #487
ManyTheFish
started this conversation in
Feedback & Feature Proposal
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Malayalam support
Malayalam is a Dravidian Language spoken by 34 million people in India that contains some specificities:
Words are split by spaces but it is an agglutinative Language
The current behavior of unicode-segmenter, the default segmenter for "officially unsupported Languages", is to split words by spaces which kind of works.
However, because of the agglutinative morphology of Malayalam, we have to split words into sensible sub-words to enhance the probability to find a relevant document with Meilisearch.
Voyels are combined to Consonants
Standalone Voyels are combined with Consonants giving another utf8 character, for instance,
ക
+ആ
givesകാ
, meaning that could prevent prefix search and typo tolerance to finds relevant documents.We should uncombine characters during a normalization process.
Malayalam haves diacritics
Diacritics can be, kind of, considered the same as accents in Latin Languages.
And so, we should remove these Malayalam diacritical marks during a normalization process.
Consonants with a virama diacritic can be alternatively written in chillu
Chillu letters are an alternative form of some of the Malayalam Consonants with a virama diacritic, for example,
ൾ
is the alternative form ofള്
(ള
+ virama).We should convert Chillu letters into common Consonants during a normalization process.
Notes
Contribute!
In Meilisearch, we don't speak nor understand all the Languages in the world, we could be wrong in our interpretation of how to support a new Language in order to provide a relevant search experience.
However, if you are a native speaker, don't hesitate to contribute to enhancing this experience:
Thanks for your help!
Beta Was this translation helpful? Give feedback.
All reactions