Allow searching on size of fields #240

majsan · 2024-01-22T08:42:35Z

Allow searching on size of fields, for example, all entries that have more than one word class.

Needs new construction in query language.

nick8325 · 2024-01-22T08:47:56Z

I implemented this! But it won't work until we use Elasticsearch 7+. See: 37a03da

You can write fieldname.length in the query (I just picked an arbitrary syntax). But for text fields you have to sometimes write fieldname.raw.length, I don't understand why.

nick8325 · 2024-01-22T09:10:50Z

Well, I kinda understand but not really... when you have a string field then we create two fields in Elasticsearch, fieldname of type text and fieldname.raw of type keyword:

https://github.com/spraakbanken/karp-backend/blob/main/karp-backend/src/karp/search_infrastructure/repositories/es6_indicies.py?view=plain#L165

Then it seems like the query that we run to get the length only works on keyword fields and not on text fields. But I'm not sure why that is, or also what this raw field is supposed to be used for.

majsan · 2024-01-23T07:11:43Z

When adding an entry into Elasticsearch, fieldname is the name of the field and contains the data. Then we have the analysis step, and for text, we usually have one analysis that lower-cases and tokenizes. Then to allow exact searches on field content, fieldname is indexed once again under the name fieldname.raw, but here no analysis is done (type keyword).

I think that fieldname.length works on the stored content of the field (which is the same as we added), while fieldname.raw is a field that only exists in the index.

detfunkarinte · 2024-01-23T09:50:55Z

equals|ortografi|"på" will find på, begripa sig på, tutta på, krya på sig ...
equals|ortografi.raw|"på" will find only på

Both of them seems to be case insensitive though 🤔

majsan · 2024-03-18T09:06:04Z

@nick8325 after the update to ES8, this works. But maybe the backend can figure out if we should use .raw or not and just let the user use fieldname

majsan · 2024-03-19T12:47:31Z

Another thing. Current solution only works on leaves in the document, not nodes. For example, in our test corpus places it is possible to do municipality.length, but not _municipality.length. Elasticsearch does not recognize _municipality as an actual field, however equals|_municipality.code.length|2 works (no hits).

Also, if we have an entry:

{
    "betydelse": [
        { "id": "betydelse1", "definition": ["Definition 1"] },
        { "id": "betydelse2", "definition": ["Definition 2"] },
        { "id": "betydelse3", "definition": ["Definition 3", "Definition 4"] },
    ]
}

betydelse.definition.length in Elasticsearch (not our query lang) will return 4.

nick8325 · 2024-03-25T10:34:24Z

I suppose we should automatically rewrite _municipality.length into e.g. _municipality.code.length (I guess looking for a field which is required and not a collection).

The thing with betydelse.definition.length, that's because right now we need to write betydelse.definition.raw.length, right?

Another thing: Right now there's no indexing for length queries. There's documentation how to do it here. But then, I guess we don't want to add indexes for all possible length fields, only the ones that actually end up being used in queries?

nick8325 · 2024-03-25T10:41:50Z

Oh another thing: it seems like length queries return the number of queries not counting duplicates. I just tried this out, a version of testsalex where lamaull has two SAOLLemman entries:

...,
"SAOLLemman": [
    {"id": "50678", "lemmatyp": "lemma", ...},
    {"id": "1234", "lemmatyp": "lemma", ...}], ...

And now a query with equals|SAOLLemman.id.raw.length|2 finds lamaull, but a query with equals|SAOLLemman.id.lemmatyp.raw.length|2 doesn't find it (because there is only one value for that ignoring duplicates, namely "lemma"). That's kind of bad because if the user queries SAOLLemman.length, we need to know to translate that into SAOLLemman.id.length and not SAOLLemman.lemmatyp.length.

majsan · 2024-03-25T14:58:32Z

The thing with betydelse.definition.length, that's because right now we need to write betydelse.definition.raw.length, right?

I think that this is because everything is flattened in the index. So betydelse.defintion has four values in this document. Related to the fact that it does not count duplicates.

Maybe indexing the length in some way would be preferable, maybe even adding it to the document ourselves, for example, SAOLLemman__length, and maybe ES has some way to delete this from the _source so we never see it after adding the document.

majsan mentioned this issue Mar 26, 2024

Enable search on cardinality of fields #175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow searching on size of fields #240

Allow searching on size of fields #240

majsan commented Jan 22, 2024

nick8325 commented Jan 22, 2024 •

edited

Loading

nick8325 commented Jan 22, 2024

majsan commented Jan 23, 2024

detfunkarinte commented Jan 23, 2024

majsan commented Mar 18, 2024

majsan commented Mar 19, 2024

nick8325 commented Mar 25, 2024

nick8325 commented Mar 25, 2024

majsan commented Mar 25, 2024

Allow searching on size of fields #240

Allow searching on size of fields #240

Comments

majsan commented Jan 22, 2024

nick8325 commented Jan 22, 2024 • edited Loading

nick8325 commented Jan 22, 2024

majsan commented Jan 23, 2024

detfunkarinte commented Jan 23, 2024

majsan commented Mar 18, 2024

majsan commented Mar 19, 2024

nick8325 commented Mar 25, 2024

nick8325 commented Mar 25, 2024

majsan commented Mar 25, 2024

nick8325 commented Jan 22, 2024 •

edited

Loading