-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dplyr::copy_to() performance with odbc 1.5.0 #862
Comments
Looping in @nathanhaigh—thanks for the issue. Related to #774 and #391. We've been iterating on this default over time and looks like the most recent changes introduce a slowdown in that context (Snowflake, larger data) you've noted. Maybe we could introduce a |
Hey @simonpcouch I see that @nathanhaigh has discovered how he can use the package options to adjust the batch size. Following along his ( excellent ) write-up it seems like he is mostly curious why the change to the default was made. I'll just chime in with motivation - I know you are aware of this since you addressed the issue originally, but posting here for general benefit. I think the pull request came as a result of at least one person complaining that with the previous set of parameters, RAM usage was excessive. It makes sense, when inserting a large data set in a single batch, we populate potentially very big buffers and pass them onto the driver. We should probably have a more sophisticated default than 1024. For example if inserting a skinny table of numerical data, we can probably afford to grab many (all) rows at once. If, on the other hand, we have a wide table with LONG fields, we may need to adjust the number of rows according to some gradient. This type of analysis would in-and-of-itself come as a performance hit, but it sounds like it may be small compared to what some users (that are less memory constrained) may be experiencing. |
I am curious about the change as it seems a very low default - perhaps that says more about the data science space I operate in but I would have thought tables of 10's-100's thousands of rows wouldn't be considered particularly big. Inserting in batches of 1024 rows would then require 10's-100's of insert statements. In the end, 1024 rows may only represent a very small amount of data so a more intelligent approach based on data size (e.g. chunking to xMB) might be a better approach. Perhaps one option might be to determine the number of bytes ( @detule The issue you link to above seems to be more about a memory leak rather than memory usage. In that sense would From a dplyr user perspective, setting the
|
Hi @nathanhaigh The memory leak was a red herring in that issue. The user demonstrated that the excessive RAM usage was caused by trying to simultaneously pass all the data in the table to the driver. Now a few things have happened over the last few years that might warrant a revisit.
At any rate, IMO this sounds like a feature request rather than an issue given that there is a mechanism in place currently that users can use to customize the number of rows per batch for their use case. I was thinking about this some more, and I am just going to make a note here in case one of us picks this up at some point: a more sophisticated batching approach I think needs to take into account both the size of the object as well as the amount of available memory. Someone writing results on a 2GB Pi home server might have different appetite for memory usage than someone crunching arrow data sets on a corporate 4U. Thanks again for the great analysis. |
Like @nathanhaigh, I do wonder about 1024 as a default. Maybe 10k might be more suitable? Or if you're worried about the combination of rows and columns, it might be better to batch based on number of cells? (That's not perfect either, but there's only so much we can do). |
See tidyverse/dplyr#7103
The text was updated successfully, but these errors were encountered: