Malayalam Language support #487

ManyTheFish · 2022-06-14T08:51:29Z

ManyTheFish
Jun 14, 2022
Collaborator

Malayalam support

officially supported

Malayalam is a Dravidian Language spoken by 34 million people in India that contains some specificities:

Words are split by spaces but it is an agglutinative Language

The current behavior of unicode-segmenter, the default segmenter for "officially unsupported Languages", is to split words by spaces which kind of works.
However, because of the agglutinative morphology of Malayalam, we have to split words into sensible sub-words to enhance the probability to find a relevant document with Meilisearch.

I Didn't manage to find a rust library helping us to segment Malayalam, however, I have found a Repository, containing Malayalam dictionaries, that could help us to segment agglutinated words.

Voyels are combined to Consonants

Standalone Voyels are combined with Consonants giving another utf8 character, for instance, ക + ആ gives കാ, meaning that could prevent prefix search and typo tolerance to finds relevant documents.
We should uncombine characters during a normalization process.

Malayalam haves diacritics

Diacritics can be, kind of, considered the same as accents in Latin Languages.
And so, we should remove these Malayalam diacritical marks during a normalization process.

Consonants with a virama diacritic can be alternatively written in chillu

Chillu letters are an alternative form of some of the Malayalam Consonants with a virama diacritic, for example, ൾ is the alternative form of ള് (ള + virama).
We should convert Chillu letters into common Consonants during a normalization process.

Notes

Malayalam have ligatures but it souldn't impact Meilisearch behaviors

Contribute!

In Meilisearch, we don't speak nor understand all the Languages in the world, we could be wrong in our interpretation of how to support a new Language in order to provide a relevant search experience.
However, if you are a native speaker, don't hesitate to contribute to enhancing this experience:

⬆️ by upvoting this discussion to help us in prioritizing language supports
💬 by pointing out errors and oversights in this discussion
🧑‍💻 by opening a pull request on charabia, the tokenizer used by Meilisearch

Thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meilisearch

Malayalam Language support #487

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Meilisearch

Malayalam Language support #487

ManyTheFish Jun 14, 2022 Collaborator

Malayalam support

Words are split by spaces but it is an agglutinative Language

Voyels are combined to Consonants

Malayalam haves diacritics

Consonants with a virama diacritic can be alternatively written in chillu

Notes

Contribute!

Replies: 0 comments

ManyTheFish
Jun 14, 2022
Collaborator