Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use buffer for VarBinView views #1121

Closed
wants to merge 6 commits into from
Closed

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Oct 23, 2024

Fixes #1111

Also implements an enhanced DictArray canonicalize that canonicalizes the values then gathers them using the codes, accounting for nullability.

image

a10y added 5 commits October 22, 2024 21:29
Also implements an enhanced DictArray canonicalize that
canonicalizes the values then just does a gather over them.
@a10y
Copy link
Contributor Author

a10y commented Oct 23, 2024

I'm getting failures currently because compare seems to always return a nullable array. Is that necessary? It feels like the nullability of the compare result should depend on the nullability of the operands

Copy link
Contributor

@gatesn gatesn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it's (generally?) faster to canonicalize and then take...

But shouldn't the implementations of take have been doing this anyway if it was better? Why is it the caller's responsibility to optimize?

Should this logic live inside the take entry-point fn?

}
}

/// Canonicalize a set of codes and values.
fn canonicalize_string(array: DictArray) -> VortexResult<Canonical> {
let values = array.values().into_varbinview()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess into_varbinview() runs a canonicalize internally? The name of this function doesn't scream quite how expensive it is I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made #507. Per https://rust-lang.github.io/api-guidelines/naming.html#ad-hoc-conversions-follow-as_-to_-into_-conventions-c-conv this should likely be to_varbinview() but also since this is owned -> owned conversion the docs suggest into_, however, to_ in Rust is meant to indicate expensive conversions

@robert3005
Copy link
Member

I think nullable compare was a shortcut, we should respect existing nullability

@a10y
Copy link
Contributor Author

a10y commented Oct 23, 2024

@gatesn are you referring to the indices getting canonicalized so we can iterate them?

For VarBinView we want take to just take on the views, but b/c Vortex doesn't support u128 as a primitive type, we could not have a PrimitiveArray of views to take, so we were just delegating to Arrow. Now we just implement the "take" inline as part of take for VarBinView.

@lwwmanning
Copy link
Member

It seems like it's (generally?) faster to canonicalize and then take...

But shouldn't the implementations of take have been doing this anyway if it was better? Why is it the caller's responsibility to optimize?

Should this logic live inside the take entry-point fn?

There's a fair bit of accumulated context on this in #1041 (I just updated with latest). The BLUF is that forcing the canonicalization was definitely the right move for the case of DictArray-for-strings where the values were then FSST encoded. But German strings + this PR fixes that the "right way".

My guess is that we can get rid of the primitive values canonicalization, might just need to optimize BitPackedArray take to fix any perf regression (which is the right way to do it)


// convert nullable to non-nullable, only safe if there are no nulls present.
(Nullability::Nullable, Nullability::NonNullable) => {
if self.validity() != Validity::AllValid {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annoyingly, you probably want this check to be on the LogicalValidity (which would allow the cast to happen even if Validity::Array where all the array values are true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a10y
Copy link
Contributor Author

a10y commented Oct 28, 2024

Going to redo this

@a10y a10y closed this Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DictArray -> VarBinView optimize
4 participants