-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix import timeouts on increasing datasets by precomputing batch size #5231
base: develop
Are you sure you want to change the base?
Conversation
WalkthroughThe changes in this pull request introduce significant enhancements to the Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (4)
fiftyone/core/dataset.py
(0 hunks)fiftyone/core/odm/database.py
(0 hunks)fiftyone/core/utils.py
(2 hunks)tests/unittests/utils_tests.py
(3 hunks)
💤 Files with no reviewable changes (2)
- fiftyone/core/odm/database.py
- fiftyone/core/dataset.py
🔇 Additional comments (1)
tests/unittests/utils_tests.py (1)
85-127
: Unit tests for ContentSizeBatcher
provide comprehensive coverage
The added tests effectively validate the ContentSizeBatcher
under various conditions, ensuring correct batching behavior with different target sizes and batch size constraints. This thorough testing enhances the reliability of the new batching strategy.
for obj in iterable: | ||
try: | ||
curr_batch_content_size += len( | ||
json_util.dumps(self._make_dict(obj)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this extra serialization potentially be very expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently do this already in the _upsert_samples_batch so we are basically doubling how much we are doing it in that one method. In most other cases (such as bulk_write and insert_documents) its the same total computation since we are just moving it from post computing for each batch to precomputing for the entire collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though I think it might be possible to refactor _upsert_sample_batch so that we just pass the dict as an iterable to begin with instead of calculating it twice
) | ||
self.curr_batch = 0 | ||
|
||
def _compute_batch_sizes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not ideal because it requires iterating over the entire iterator before any writes start happening.
A large scale add_samples()
call could take 1 hour or more to complete. With this implementation, it will take an hour just to compute safe batch sizes, during which time there will be no progress bar or indication to the user of what's happening, and then will it start adding samples to the dataset for the next hour.
We used dynamic batching with backpressure to try to find a compromise where writes will start immediately while also tuning the batch size to maximize throughput.
Perhaps there's a way to keep the greedy batching but add a safeguard to split unexpectedly large batches into multiple writes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an alternative is we can override the next implementation for just this batch size and build it as we go so instead of doing it upfront we can just iterate and build a list up to the max size and then return that.
We probably want to refactor a bit as well because currently we have two content size bathers: one for backpressure where we don't know the size of the items before batching (think it's just a bunch of ops and not actually looking at the items) and another for if we have the items and can calculate their sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently we have two content size bathers
What are you referring to? We currently have three batchers:
https://github.com/voxel51/fiftyone-teams/blob/318d3abb38b478f557180bb6aed4f25681cc7c64/fiftyone/core/utils.py#L1579-L1606
LatencyDynamicBatcher
: targets a specific latency between calls to generate the next batchContentSizeDynamicBatcher
: targets a specific content size in each batchStaticBatcher
: simple fixed batch size
Latency and content size are correlated heuristics, and both would encounter problems if there are wildly different Sample
sizes in a collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you're referring to latency and content.
The latency batcher is the default for OSS because it is effectively parameterized by the frame rate on the progress bars you want to see (and the assumption that the database is a big boi and will take as much content in 0.2 seconds as you can throw at it, for example)
The content batcher is the default for Teams when there's an API connection involved because the assumption is that the API gateway has a hard limit on request size, so we want to be certain to not exceed say 1 MB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The batcher in this PR "ContentSizeBatcher" is separate from ContentSizeDynamicBatcher because it doesn't use backpressure and has a different use case where we can view content size upfront which, keep me honest @swheaton, isn't always the case. Thats what I meant by specifically two "content size batchers" I know there's other batchers as well.
I am proposing a couple things:
- rename both of the content size batchers so that it's clear one is used for computing before writing the batch and the other is for estimating based on the previous batch's results
- make it so this new batcher we have added computes batch size everytime next is called so that we don't have to do it all upfront (this would require overriding the inherited next method in its current state so maybe a refactor would be good)
or if I am wrong about there being two use cases (compute before running content size batch vs compute after running content size batch) consolidate these into one method that just does number 2 above instead of precomputing the entire iterable.
What changes are proposed in this pull request?
When importing a dataset that changes greatly in size between samples with the content size dynamic batcher we were seeing timeouts from mongo because based on the previous batch of samples we would predict future batch sizes which wasn't scaling up fast enough. Since most situations we have the content sizes before doing any writes to the database we can precompute these values and create batch sizes upfront.
How is this patch tested? If it is not, please explain why.
Output:
batch sizes created:
batch sizes created before these changes:
Above is just a small chunk of the batch sizes but the idea is that it ramped from 0 - 3600ish which maintained the 1 mb target size until it reached the 3rd video which was way more densely populated with labels. This resulted in the batch size being 45736747 and often times it timed out.
Release Notes
Is this a user-facing change that should be mentioned in the release notes?
notes for FiftyOne users.
What areas of FiftyOne does this PR affect?
fiftyone
Python library changesSummary by CodeRabbit
Release Notes
New Features
ContentSizeBatcher
for improved batch processing based on content size.Enhancements
Bug Fixes
Documentation