Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr::copy_to() performance with odbc 1.5.0 #862

Open
DavisVaughan opened this issue Nov 11, 2024 · 5 comments
Open

dplyr::copy_to() performance with odbc 1.5.0 #862

DavisVaughan opened this issue Nov 11, 2024 · 5 comments

Comments

@DavisVaughan
Copy link

See tidyverse/dplyr#7103

@simonpcouch
Copy link
Collaborator

Looping in @nathanhaigh—thanks for the issue. Related to #774 and #391. We've been iterating on this default over time and looks like the most recent changes introduce a slowdown in that context (Snowflake, larger data) you've noted.

Maybe we could introduce a batch_rows() helper with DBMS-specific methods.

@detule
Copy link
Collaborator

detule commented Nov 12, 2024

Hey @simonpcouch

I see that @nathanhaigh has discovered how he can use the package options to adjust the batch size. Following along his ( excellent ) write-up it seems like he is mostly curious why the change to the default was made.

I'll just chime in with motivation - I know you are aware of this since you addressed the issue originally, but posting here for general benefit. I think the pull request came as a result of at least one person complaining that with the previous set of parameters, RAM usage was excessive. It makes sense, when inserting a large data set in a single batch, we populate potentially very big buffers and pass them onto the driver.

We should probably have a more sophisticated default than 1024. For example if inserting a skinny table of numerical data, we can probably afford to grab many (all) rows at once. If, on the other hand, we have a wide table with LONG fields, we may need to adjust the number of rows according to some gradient. This type of analysis would in-and-of-itself come as a performance hit, but it sounds like it may be small compared to what some users (that are less memory constrained) may be experiencing.

@nathanhaigh
Copy link

nathanhaigh commented Nov 13, 2024

I am curious about the change as it seems a very low default - perhaps that says more about the data science space I operate in but I would have thought tables of 10's-100's thousands of rows wouldn't be considered particularly big. Inserting in batches of 1024 rows would then require 10's-100's of insert statements. In the end, 1024 rows may only represent a very small amount of data so a more intelligent approach based on data size (e.g. chunking to xMB) might be a better approach. Perhaps one option might be to determine the number of bytes (object.size(my_table)) used by the table together with the number of rows to figure out an appropriate number for batch_rows.

@detule The issue you link to above seems to be more about a memory leak rather than memory usage. In that sense would batch_rows seems to be more about limiting memory usage, perhaps with the side effect of reducing memory leakage?

From a dplyr user perspective, setting the odbc.batch_rows option was not an obvious solution as it buried a few layers down and it took me some time to discover it as an potential solution to the performance issues I was observing. @simonpcouch - How would the following statement manifest itself to the user?

we could introduce a batch_rows() helper with DBMS-specific methods

@detule
Copy link
Collaborator

detule commented Dec 7, 2024

Hi @nathanhaigh

The memory leak was a red herring in that issue. The user demonstrated that the excessive RAM usage was caused by trying to simultaneously pass all the data in the table to the driver.

Now a few things have happened over the last few years that might warrant a revisit.

  • Moar data;
  • Available RAM continues to grow; and
  • Off-prem data warehouses where round trips come at a much higher premium and the trade off is such that higher RAM usage is preferable given the transaction time of the alternative.

At any rate, IMO this sounds like a feature request rather than an issue given that there is a mechanism in place currently that users can use to customize the number of rows per batch for their use case.

I was thinking about this some more, and I am just going to make a note here in case one of us picks this up at some point: a more sophisticated batching approach I think needs to take into account both the size of the object as well as the amount of available memory. Someone writing results on a 2GB Pi home server might have different appetite for memory usage than someone crunching arrow data sets on a corporate 4U.

Thanks again for the great analysis.

@hadley
Copy link
Member

hadley commented Dec 7, 2024

Like @nathanhaigh, I do wonder about 1024 as a default. Maybe 10k might be more suitable? Or if you're worried about the combination of rows and columns, it might be better to batch based on number of cells? (That's not perfect either, but there's only so much we can do).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants