-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use buffer for VarBinView views #1121
Conversation
Also implements an enhanced DictArray canonicalize that canonicalizes the values then just does a gather over them.
I'm getting failures currently because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it's (generally?) faster to canonicalize and then take...
But shouldn't the implementations of take have been doing this anyway if it was better? Why is it the caller's responsibility to optimize?
Should this logic live inside the take entry-point fn?
encodings/dict/src/array.rs
Outdated
} | ||
} | ||
|
||
/// Canonicalize a set of codes and values. | ||
fn canonicalize_string(array: DictArray) -> VortexResult<Canonical> { | ||
let values = array.values().into_varbinview()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess into_varbinview()
runs a canonicalize internally? The name of this function doesn't scream quite how expensive it is I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made #507. Per https://rust-lang.github.io/api-guidelines/naming.html#ad-hoc-conversions-follow-as_-to_-into_-conventions-c-conv this should likely be to_varbinview()
but also since this is owned -> owned conversion the docs suggest into_
, however, to_
in Rust is meant to indicate expensive conversions
I think nullable compare was a shortcut, we should respect existing nullability |
@gatesn are you referring to the indices getting canonicalized so we can iterate them? For VarBinView we want take to just take on the views, but b/c Vortex doesn't support |
There's a fair bit of accumulated context on this in #1041 (I just updated with latest). The BLUF is that forcing the canonicalization was definitely the right move for the case of DictArray-for-strings where the values were then FSST encoded. But German strings + this PR fixes that the "right way". My guess is that we can get rid of the primitive values canonicalization, might just need to optimize BitPackedArray take to fix any perf regression (which is the right way to do it) |
|
||
// convert nullable to non-nullable, only safe if there are no nulls present. | ||
(Nullability::Nullable, Nullability::NonNullable) => { | ||
if self.validity() != Validity::AllValid { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annoyingly, you probably want this check to be on the LogicalValidity
(which would allow the cast to happen even if Validity::Array where all the array values are true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, you're referring to this: https://github.com/spiraldb/vortex/blob/015b06715dcda474be672ec82829274bad9c93e1/vortex-array/src/validity.rs#L165-L164
Going to redo this |
Fixes #1111
Also implements an enhanced
DictArray
canonicalize that canonicalizes the values then gathers them using the codes, accounting for nullability.