Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the functionality of the Iceberg rewrite_manifests procedure (e.g. in OPTIMIZE) #14821

Open
alexjo2144 opened this issue Oct 28, 2022 · 4 comments · May be fixed by #24678
Open

Add the functionality of the Iceberg rewrite_manifests procedure (e.g. in OPTIMIZE) #14821

alexjo2144 opened this issue Oct 28, 2022 · 4 comments · May be fixed by #24678
Assignees
Labels
enhancement New feature or request

Comments

@alexjo2144
Copy link
Member

alexjo2144 commented Oct 28, 2022

Relates to: #9340
The Spark implementation is documented here.

When using the append operation rewriting manifests is done automatically at a set size defined by commit.manifest.min-count-to-merge, defaulting to 100. However, if write latency is important, a user may want to skip the automatic compaction and run it async to the writers.

This may be done as a separate procedure, or as a part of the OPTIMIZE command

@findepi
Copy link
Member

findepi commented Oct 28, 2022

The optimize should do this, I'm not yet convince we need a separate procedure

@alexjo2144 alexjo2144 changed the title Add an Iceberg rewrite_manifests procedure Add the functionality of the Iceberg rewrite_manifests procedure Nov 3, 2022
@alexjo2144
Copy link
Member Author

The optimize should do this, I'm not yet convince we need a separate procedure

I'm not sure yet either. Updated the description.

@findepi findepi changed the title Add the functionality of the Iceberg rewrite_manifests procedure Add the functionality of the Iceberg rewrite_manifests procedure (e.g. in OPTIMIZE) Nov 4, 2022
@findinpath
Copy link
Contributor

When a table’s write pattern doesn’t align with the query pattern, metadata can be rewritten to re-group data files into manifests

Taken from Iceberg Spark Procedures Docs

Here is a relative lengthy article about Iceberg which includes the reasoning behind using rewrite_manifests

https://blog.developer.adobe.com/taking-query-optimizations-to-the-next-level-with-iceberg-6c968b83cd6f

A key metric is to keep track of the count of manifests per partition.

The health of the dataset would be tracked based on how many partitions cross a pre-configured threshold of acceptable value of these metrics. The trigger for manifest rewrite can express the severity of the unhealthiness based on these metrics.

We rewrote the manifests by shuffling them across manifests based on a target manifest size. Here is a plot of one such rewrite with the same target manifest size of 8MB. Notice that any day partition spans a maximum of 4 manifests.

  • Before a partition used to span on up to 300 manifests.

I'm not yet convince we need a separate procedure

On the light of the above arguments, I'm inclined to say that this metadata related functionality would need an own procedure, instead of squeezing it under OPTIMIZE.

@ebyhr ebyhr self-assigned this Dec 2, 2024
@mtofano
Copy link

mtofano commented Dec 26, 2024

Thank you for looking into this! This is something I am interested in as well.

In my particular use case my write pattern for back populating a table does not align with the read pattern and update pattern. Rewriting the manifests is something that I think would increase read performance.

I see that this was assigned to @ebyhr and am curious the current state / roadmap for this feature?

@ebyhr ebyhr linked a pull request Jan 10, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging a pull request may close this issue.

5 participants