Batch improvement #37

chishankar-work · 2021-03-10T17:17:57Z

The idea is too emulate the way JDBCIO writes to SQL.

When calling write_record it does the .execute() and then .commit() in sequence, writing and committing each record to disk one at a time. The proposed change allows for more effective batching while better managing connection pooling to CSQL.

Removed the session.commit() from the write_record which is called on every element in the _WriteToRelationalDBFn ParDo. Instead, we just call .execute() on each record, and then commit it to disk all at once.
Instead of building the engine at the start of each bundle, move self._db = SqlAlchemyDB(self.source_config) to the .setup() method so it's only created once for the object and handles for connection pooling for the sessions that are opened and closed at the start and finish of each bundle.
Handled the .commit() logic in the DoFn. In the start_bundle create a record_counter = 0 and records = []. This will allow us to build the commits up to sizes and ensure that they don't get too big.
In cases where the bundles are small or divide unevenly leaving a chunk with less than 1000 records, we can directly call commit_records in the finish_bundle() to take care of the remaining elements in the bundle and flush the buffer.
Made max_batch_size a configurable value with a default value of 1000. This can be changed easily by the user by doing something such as:

relational_db.Write(
    source_config=source_config,
    table_config=table_config
    table_config=table_config,
    max_batch_size=1500
 )

… configurable. Also moved the SqlAlchemyDB to the setup() method for better connection pool handling

…s configurable. Also moved the SqlAlchemyDB to the setup() method for better connection pool handling"

chishankar-work · 2021-03-10T17:21:49Z

GH-36

mohaseeb

Thanks @chishankar-work for this PR (and sorry for being late to review it). I had just minor comments.

P.S. you might want to check #38 if you will be running with SQLAlchemy 1.4.

mohaseeb · 2021-04-21T02:55:57Z

beam_nuggets/io/relational_db.py


    def process(self, element):
        assert isinstance(element, dict)
+        self.records.append(element)
+        self.record_counter = self.record_counter + 1
        self._db.write_record(self.table_config, element)


Shouldn't we remove this call now?

mohaseeb · 2021-04-21T02:56:41Z

beam_nuggets/io/relational_db.py

        self._db.write_record(self.table_config, element)

+        if (self.record_counter > self.max_batch_size):


The parentheses are redundant.

mohaseeb · 2021-04-21T03:02:24Z

beam_nuggets/io/relational_db.py

+            self.commit_records()
+
+    def commit_records(self):
+        if self.record_counter() == 0:


self.record_counter() to self.record_counter

mohaseeb · 2021-04-21T03:04:50Z

beam_nuggets/io/relational_db.py

                )

    """

-    def __init__(self, source_config, table_config, *args, **kwargs):
+    def __init__(self, source_config, table_config, max_batch_size=1000, *args, **kwargs):


Let's add this to the class documentation and describe the new behavior.

mohaseeb · 2021-04-21T04:07:23Z

beam_nuggets/io/relational_db_api.py

@@ -358,7 +361,6 @@ def write_record(self, session, create_insert_f, record_dict):
                record=record_dict
            )
            session.execute(insert_stmt)


Not sure about SQLAlchemy behavior here. If the behavior is sending each stmt directly the DB, then another potential improvement could be to create a single insert stmt for the whole batch using, e.g., something like this or this, significantly reducing the number of DB calls.

@mohaseeb Correct. I ended up creating my own batch insert mod based on your project and the only thing I changed was around removing assumptions that record is a dict. SQLAlchemy while generating insert statement supports both a single row (dict ) or multiple rows (list of dicts).

chishankar-work added 2 commits March 9, 2021 16:19

Changed implementation to write in batches (default = 1000), which is…

f2afc10

… configurable. Also moved the SqlAlchemyDB to the setup() method for better connection pool handling

"Changed implementation to write in batches (default = 1000), which i…

dd44ffc

…s configurable. Also moved the SqlAlchemyDB to the setup() method for better connection pool handling"

mohaseeb reviewed Apr 21, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch improvement #37

Batch improvement #37

chishankar-work commented Mar 10, 2021 •

edited

Loading

chishankar-work commented Mar 10, 2021

mohaseeb left a comment •

edited

Loading

mohaseeb Apr 21, 2021

mohaseeb Apr 21, 2021

mohaseeb Apr 21, 2021

mohaseeb Apr 21, 2021

mohaseeb Apr 21, 2021 •

edited

Loading

octopop Nov 8, 2021

		self._db.write_record(self.table_config, element)

		if (self.record_counter > self.max_batch_size):

Batch improvement #37

Are you sure you want to change the base?

Batch improvement #37

Conversation

chishankar-work commented Mar 10, 2021 • edited Loading

chishankar-work commented Mar 10, 2021

mohaseeb left a comment • edited Loading

Choose a reason for hiding this comment

mohaseeb Apr 21, 2021

Choose a reason for hiding this comment

mohaseeb Apr 21, 2021

Choose a reason for hiding this comment

mohaseeb Apr 21, 2021

Choose a reason for hiding this comment

mohaseeb Apr 21, 2021

Choose a reason for hiding this comment

mohaseeb Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

octopop Nov 8, 2021

Choose a reason for hiding this comment

chishankar-work commented Mar 10, 2021 •

edited

Loading

mohaseeb left a comment •

edited

Loading

mohaseeb Apr 21, 2021 •

edited

Loading