Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Avro metadata #1843

Draft
wants to merge 22 commits into
base: develop
Choose a base branch
from
Draft

[WIP] Avro metadata #1843

wants to merge 22 commits into from

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Jan 8, 2025

  • derive-macro to generate conversions into/out of Avro and schemas
  • #[derive(ToAvro, FromAvro) for all array metadata
  • TODO: figure out how to handle ScalarValue

@a10y a10y force-pushed the aduffy/proc-macro-avro branch from 54ff7ce to 884c4f1 Compare January 8, 2025 23:03
@a10y
Copy link
Contributor Author

a10y commented Jan 10, 2025

This seems to have reduced the overall file size for the large file by ~40MB, or around 10% overall. If we assume that the metadata previously was ~25% of the overall size of the file, then we cut the metadata by almost half by using Avro. However, the file is still a good ~40% larger than the parquet file, which is problematic!

Need to resurrent the old JSON dump tool I wrote awhile back to collect more detailed info.

@robert3005
Copy link
Member

As an experiment you could remove padding we add to ipc messages and compare the size. That padding is wrong and should be optional. Changes that @gatesn is working on will make it easy to toggle

@gatesn
Copy link
Contributor

gatesn commented Jan 10, 2025

Yeah, I'd like to punt this until at least after we merge the layouts stuff and can then play around with padding, alignment, and compression

@robert3005
Copy link
Member

Regardless of your changes we could know how much space is wasted on padding by doing what I suggested

@a10y
Copy link
Contributor Author

a10y commented Jan 10, 2025

image
uv run python -c "import vortex as vx; vx.io.read_path('A0.small.50_avro.vortex');" > padding.jsonl

According to DuckDB, padding makes up ~29MB of the 480MB or so
image

@a10y
Copy link
Contributor Author

a10y commented Jan 10, 2025

image image

16MB of layout flatbuffers is also substantial

@robert3005
Copy link
Member

We repeat rowcount a lot in the layouts while we only need it sometimes. I know @gatesn will eventually get rid of it but wonder if it would make things substantially smaller

@gatesn
Copy link
Contributor

gatesn commented Jan 10, 2025

Yep, we can do some things to optimise the flat buffers.

Could you also find the byte range, grab the bytes, then just shove them through lz4?

Curious if that gives us a lot of lift given how fast it is

@a10y a10y force-pushed the aduffy/proc-macro-avro branch from 60fd946 to b15cc2e Compare January 13, 2025 00:31
@a10y
Copy link
Contributor Author

a10y commented Jan 13, 2025

Unfortunately, in its current form this makes decompress from file meaningfully slower b/c of the repeated construction of the Avro schema (which allocates).

We should statically construct a schema/lazylock at compile time to avoid the overhead (Schema is not cheaply cloneable)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants