-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Avro metadata #1843
base: develop
Are you sure you want to change the base?
[WIP] Avro metadata #1843
Conversation
54ff7ce
to
884c4f1
Compare
This seems to have reduced the overall file size for the large file by ~40MB, or around 10% overall. If we assume that the metadata previously was ~25% of the overall size of the file, then we cut the metadata by almost half by using Avro. However, the file is still a good ~40% larger than the parquet file, which is problematic! Need to resurrent the old JSON dump tool I wrote awhile back to collect more detailed info. |
As an experiment you could remove padding we add to ipc messages and compare the size. That padding is wrong and should be optional. Changes that @gatesn is working on will make it easy to toggle |
Yeah, I'd like to punt this until at least after we merge the layouts stuff and can then play around with padding, alignment, and compression |
Regardless of your changes we could know how much space is wasted on padding by doing what I suggested |
We repeat rowcount a lot in the layouts while we only need it sometimes. I know @gatesn will eventually get rid of it but wonder if it would make things substantially smaller |
Yep, we can do some things to optimise the flat buffers. Could you also find the byte range, grab the bytes, then just shove them through lz4? Curious if that gives us a lot of lift given how fast it is |
60fd946
to
b15cc2e
Compare
Unfortunately, in its current form this makes decompress from file meaningfully slower b/c of the repeated construction of the Avro schema (which allocates). We should statically construct a schema/lazylock at compile time to avoid the overhead ( |
#[derive(ToAvro, FromAvro)
for all array metadata