Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow searching on size of fields #240

Open
majsan opened this issue Jan 22, 2024 · 9 comments
Open

Allow searching on size of fields #240

majsan opened this issue Jan 22, 2024 · 9 comments

Comments

@majsan
Copy link
Member

majsan commented Jan 22, 2024

Allow searching on size of fields, for example, all entries that have more than one word class.

Needs new construction in query language.

@nick8325
Copy link
Contributor

nick8325 commented Jan 22, 2024

I implemented this! But it won't work until we use Elasticsearch 7+. See: 37a03da

You can write fieldname.length in the query (I just picked an arbitrary syntax). But for text fields you have to sometimes write fieldname.raw.length, I don't understand why.

@nick8325
Copy link
Contributor

Well, I kinda understand but not really... when you have a string field then we create two fields in Elasticsearch, fieldname of type text and fieldname.raw of type keyword:

https://github.com/spraakbanken/karp-backend/blob/main/karp-backend/src/karp/search_infrastructure/repositories/es6_indicies.py?view=plain#L165

Then it seems like the query that we run to get the length only works on keyword fields and not on text fields. But I'm not sure why that is, or also what this raw field is supposed to be used for.

@majsan
Copy link
Member Author

majsan commented Jan 23, 2024

When adding an entry into Elasticsearch, fieldname is the name of the field and contains the data. Then we have the analysis step, and for text, we usually have one analysis that lower-cases and tokenizes. Then to allow exact searches on field content, fieldname is indexed once again under the name fieldname.raw, but here no analysis is done (type keyword).

I think that fieldname.length works on the stored content of the field (which is the same as we added), while fieldname.raw is a field that only exists in the index.

@detfunkarinte
Copy link
Member

equals|ortografi|"på" will find , begripa sig på, tutta på, krya på sig ...
equals|ortografi.raw|"på" will find only

Both of them seems to be case insensitive though 🤔

@majsan
Copy link
Member Author

majsan commented Mar 18, 2024

@nick8325 after the update to ES8, this works. But maybe the backend can figure out if we should use .raw or not and just let the user use fieldname

@majsan
Copy link
Member Author

majsan commented Mar 19, 2024

Another thing. Current solution only works on leaves in the document, not nodes. For example, in our test corpus places it is possible to do municipality.length, but not _municipality.length. Elasticsearch does not recognize _municipality as an actual field, however equals|_municipality.code.length|2 works (no hits).

Also, if we have an entry:

{
    "betydelse": [
        { "id": "betydelse1", "definition": ["Definition 1"] },
        { "id": "betydelse2", "definition": ["Definition 2"] },
        { "id": "betydelse3", "definition": ["Definition 3", "Definition 4"] },
    ]
}

betydelse.definition.length in Elasticsearch (not our query lang) will return 4.

@nick8325
Copy link
Contributor

I suppose we should automatically rewrite _municipality.length into e.g. _municipality.code.length (I guess looking for a field which is required and not a collection).

The thing with betydelse.definition.length, that's because right now we need to write betydelse.definition.raw.length, right?

Another thing: Right now there's no indexing for length queries. There's documentation how to do it here. But then, I guess we don't want to add indexes for all possible length fields, only the ones that actually end up being used in queries?

@nick8325
Copy link
Contributor

Oh another thing: it seems like length queries return the number of queries not counting duplicates. I just tried this out, a version of testsalex where lamaull has two SAOLLemman entries:

...,
"SAOLLemman": [
    {"id": "50678", "lemmatyp": "lemma", ...},
    {"id": "1234", "lemmatyp": "lemma", ...}], ...

And now a query with equals|SAOLLemman.id.raw.length|2 finds lamaull, but a query with equals|SAOLLemman.id.lemmatyp.raw.length|2 doesn't find it (because there is only one value for that ignoring duplicates, namely "lemma"). That's kind of bad because if the user queries SAOLLemman.length, we need to know to translate that into SAOLLemman.id.length and not SAOLLemman.lemmatyp.length.

@majsan
Copy link
Member Author

majsan commented Mar 25, 2024

The thing with betydelse.definition.length, that's because right now we need to write betydelse.definition.raw.length, right?

I think that this is because everything is flattened in the index. So betydelse.defintion has four values in this document. Related to the fact that it does not count duplicates.

Maybe indexing the length in some way would be preferable, maybe even adding it to the document ourselves, for example, SAOLLemman__length, and maybe ES has some way to delete this from the _source so we never see it after adding the document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants