combine ngram and full-token fields #477

missinglink · 2021-03-10T00:34:34Z

I think this has been discussed in the past but there was no issue open for it.

background

For many fields we have two 'subfields', one which contains the complete token (for search and autocomplete) and one which contains the prefix ngrams (for the final autocomplete token).

In the case of the name field this is implemented as separate fields (eg. name.en and phrase.en), for other fields it's implemented as a 'subfield' (eg. parent.continent and parent.continent.ngram).

At some point I'd love to fix that and make it more consistent, but that's a different issue ;)

proposal

On reflection we can provide both prefix and exact token matching using a single field.

The trick is very simply adding an end of text character to each token when indexing and again when searching for exact matches.

the 'work'

There would need to be some changes to the code to support this, much of which could be hidden from the application by using a query-time analyzer which handles adding the 'end of text' character when required.

There may need to be some consideration to synonyms to ensure that they continued to operate as expected.

summary

The pros would be that we could simplify the field mapping to remove the duplication required per-field to support ngrams, this would in turn clean up the query logic so it didn't need to be aware of the different field names of ngram fields.

The cons are that we would introduce a new convention which would require adapting the code to accommodate.

The text was updated successfully, but these errors were encountered:

orangejulius · 2021-03-15T15:40:29Z

Yeah, this is pretty elegant and would definitely bring a lot of simplicity. I imagine that the size of the indices would go down quite a bit, which would help performance and maintenance as well.

With our current Pelias code, I can't immediately think of anything that would be negatively affected by this. Like you said, we'd have to do a little work to manage the end of text characters, either in an analyzer or in other code, but it shouldn't be too bad.

I can think of one place where this might not work though: spelling correction.

Consider the following query against the search endpoint: /v1/search?text=india.

Since it's the search endpoint, and not autocomplete, we'd want to return only "exact" (not including spelling correction) matches. However, india^ (using a carat in place of the end of text character), is within edit distance 1 of india. The token india would be associated with several partial matches for Indiana, Indianapolis, etc.

So it might still be useful to have a field that is known to have only complete tokens for cases like this.

missinglink · 2021-03-16T02:09:10Z

Yeah good point about the spelling correction, I wonder if using a non-printing char would mitigate that, either way I'm somewhat reluctant to introduce too much 'magic' which would be difficult for other devs to figure out at a later date.

I'm assuming (and may be wrong here) that using a 'sub field' is far more compact than using a completely separate field, if that's the case then the additional storage requirement of having an ngrams subfield may be less duplicative than we might imagine, making this change less valuable as a result and possibly 'not worth it' when weighed up against the mental cost of introducing the 'magic'.

missinglink added the discuss label Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

combine ngram and full-token fields #477

combine ngram and full-token fields #477

missinglink commented Mar 10, 2021

orangejulius commented Mar 15, 2021

missinglink commented Mar 16, 2021 •

edited

Loading

combine ngram and full-token fields #477

combine ngram and full-token fields #477

Comments

missinglink commented Mar 10, 2021

orangejulius commented Mar 15, 2021

missinglink commented Mar 16, 2021 • edited Loading

missinglink commented Mar 16, 2021 •

edited

Loading