DRAFT: MDEV-34705: Storing binlog in InnoDB #3775

knielsen · 2025-01-17T13:13:11Z

Draft pull request for work-in-progress on the MDEV-34075 binlog-in-engine feature

A new option --binlog-storage-engine=ENGINE moves the binlog implementation into the storage engine, for supporting engines (currently only InnoDB).

InnoDB implements the binlog files as a new type of tablespace, and uses its redo log to make the binlog crash-safe without the overhead and complexity of two-phase commit.

…rite an InnoDB tablespace Signed-off-by: Kristian Nielsen <[email protected]>

… InnoDB tablespace The option --innodb-in-engine now causes InnoDB DML commits to include binlogging in the same mtr. Binlog group commit now skips binlogging to old file-based binlog and passes events to InnoDB instead. Many things unfinished still, like allocating new tablespaces when the first one is filled, writing large event groups out-of-band to not bloat the InnoDB commit record in the redo log and exceed max mtr size, writing DDL and all other events to the InnoDB binlog, skipping the creation of the old-style binlog, reading the new style binlog from InnoDB, etc. etc. Signed-off-by: Kristian Nielsen <[email protected]>

Only works for two tablespace files though. For the third, we need to implement closing the first one, so that the tablespace id can be reused. Signed-off-by: Kristian Nielsen <[email protected]>

Before creating the next binlog tablespace N+2, flush out and close the old binlog tablespace N, so that the new tablespace can re-use the tablespace id without conflict. Signed-off-by: Kristian Nielsen <[email protected]>

…plete) Signed-off-by: Kristian Nielsen <[email protected]>

…e to get a hton available Signed-off-by: Kristian Nielsen <[email protected]>

Initial code to read in the binlog dump thread events from InnoDB binlog. Signed-off-by: Kristian Nielsen <[email protected]>

Skip prepare step in InnoDB when it handles the binlog, but re-enable InnoDB fsync at commit. Signed-off-by: Kristian Nielsen <[email protected]>

… engine Signed-off-by: Kristian Nielsen <[email protected]>

…ground thread Signed-off-by: Kristian Nielsen <[email protected]>

…rx cache in-memory buffer Signed-off-by: Kristian Nielsen <[email protected]>

When (re-)starting the server, check for any existing binlog files. Open the last two found (if any), and find the position that was last written before the restart. Continue binlogging from that point rather than creating new binlog files. Signed-off-by: Kristian Nielsen <[email protected]>

Move rpl_gtid and rpl_binlog_state_base into separate rpl_gtid_base.h include that can be used from engines implementing the binlog interface. Signed-off-by: Kristian Nielsen <[email protected]>

Signed-off-by: Kristian Nielsen <[email protected]>

…iterate() Signed-off-by: Kristian Nielsen <[email protected]>

Every N bytes (hardcoded at 64k for now, to become a configurable setting), write the binlog GTID state into the binlog tablespace. This allows to quickly find a given GTID position by binary search to the prior GTID state in the tablespace and then a small linear scan from that point. The full binlog state is dumped at the start of the binlog file; remaining states dumped are differential states containing only the changed (domain_id, server_id) pairs, to save space if binlog space is large. This commit only implements the writing of the binlog state to the tablespace at regular intervals. The binary search to be implemented in a subsequent commit. Signed-off-by: Kristian Nielsen <[email protected]>

Re-write the logic for selecting between reading from buffer pool or file, and how to move to the next file, in a clean way. Handles a bunch of ToDo's, probably fixes a few bugs, and generally makes the code much more robust. Signed-off-by: Kristian Nielsen <[email protected]>

To restore the binlog state, after finding the position in the old binlog to continue from, read the full gtid state saved at the start of the binlog file as well as the most recent differentioal gtid state written shortly before the starting position. Then construct a binlog reader to read the remaining few events (if any), and update with any GTIDs read to obtain the final restored GTID binlog state. Signed-off-by: Kristian Nielsen <[email protected]>

Signed-off-by: Kristian Nielsen <[email protected]>

To find the target position, we first loop backwards over binlog files, reading the initial GTID state written at the start to find the file to start in. We then binary search on the differential GTID states written every --innodb-binlog-state-interval bytes. This patch does only minimal changes to the dump thread code in sql_repl.cc to be able to send out binlog data to the client. Some re-factoring/cleanup should be done in a follow-up patch to more cleanly separate the two code paths, avoid a lot of if-statements and make the binlog-in-engine code path free of much of the cruft from the legacy binlog implementation. Signed-off-by: Kristian Nielsen <[email protected]>

Signed-off-by: Kristian Nielsen <[email protected]>

Only GTID slave connection is supported, at least for now. Signed-off-by: Kristian Nielsen <[email protected]>

We need two flags on the chunk type to fully identify how a record is split into chunks. A "CONT" flag which marks a continuation chunk (set on all but the first chunk). And a "LAST" flag which marks the end of a record (set only on the last chunk). Signed-off-by: Kristian Nielsen <[email protected]>

Introduce a class/interface chunk_data_base to encapsulate supplying the data to be written as a binlog record. For fsp_binlog_write_cache(), this is an IO_CACHE with two sections (main and gtid) that need to be binlogged in the opposite order. Separate the logic for page writing into a generic fsp_binlog_write_chunk() function which takes a chunk_data_base * as data source. This in preparation for introducing other kinds of data to be written into the binlog, eg. out-of-band partial event group data. Signed-off-by: Kristian Nielsen <[email protected]>

…eparate refactoring of end_event Signed-off-by: Kristian Nielsen <[email protected]>

…-band Signed-off-by: Kristian Nielsen <[email protected]>

…rd (untested) Signed-off-by: Kristian Nielsen <[email protected]>

…g_reader Signed-off-by: Kristian Nielsen <[email protected]>

With this commit, the out-of-band binlogging of large event groups in multiple smaller records interleaved with other event groups is now working. Instead of flushing the binlog cache to disk when they reach @@binlog_cache_size, instead the cache is binlogged as an out-of-band record. Then at transaction commit, a commit record is written containing just the GTID and a link to the out-of-band data. To facilitate append-only operation, the binlogged records do not have a "next" pointer. Instead, they are written out as a forest of perfect binary trees, the leftmost leaf of one tree pointing to the root of the previous tree. This structure is used in the binlog reader to efficiently read out the event group data consecutively for the binlog dump thread, needing to maintain only O(log(N)) amount of memory during the reading. As part of this commit, the existing binlog reader code is refactored to be greatly improved, with a much cleaner explicit state machine and handling of chunk/page/file boundaries etc. Also fixes some bugs in the gtid_search::find_gtid_pos(). Signed-off-by: Kristian Nielsen <[email protected]>

Move high-level binlog code to handler/handler0binlog.cc and low-level code to fsp/fsp0binlog.cc. Signed-off-by: Kristian Nielsen <[email protected]>

When starting to read data from a specific GTID position, the logic for skipping any initial, partial record that we start in the middle of, was incorrect. Signed-off-by: Kristian Nielsen <[email protected]>

…ew files Signed-off-by: Kristian Nielsen <[email protected]>

Add option --binlog-directory, used to place the binlogs outside the data directory (eg. to put them on different disk/file system). Disallow specifying the binlog name in --log-bin when --binlog-storage-engine is used, as the name is then not user configurable. A ToDo (not implemented in this commit) is to use the --binlog-directory value, if given, also for the legacy binlog implementation. Signed-off-by: Kristian Nielsen <[email protected]>

Signed-off-by: Kristian Nielsen <[email protected]>

No DELETE_DOMAIN_ID supported yet, will come in a later commit, after PURGE is implemented. Signed-off-by: Kristian Nielsen <[email protected]>

Signed-off-by: Kristian Nielsen <[email protected]>

…uite mostly pass Signed-off-by: Kristian Nielsen <[email protected]>

CLAassistant · 2025-01-17T13:13:19Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Enable binlog_in_engine as a default suite. Fix embedded and Windows build failures. Use sql_print_(error|warning) over ib::error() and ib::warn(). Use small_vector<> for the innodb_binlog_oob_reader instead of a custom implementation. Signed-off-by: Kristian Nielsen <[email protected]>

Signed-off-by: Kristian Nielsen <[email protected]>

Fix missing WORDS_BIGENDIAN define in ut0compr_int.cc. Fix misaligned read buffer for O_DIRECT. Fix wrong/missing update_binlog_end_pos() in binlog group commit. Fix race where active_binlog_file_no incremented too early. Fix wrong assertion when reader reaches the very start of (active+1). Signed-off-by: Kristian Nielsen <[email protected]>

knielsen added 30 commits January 17, 2025 12:31

MDEV-34705: Binlog in Engine: Very first sketch, able to create and w…

019afb7

…rite an InnoDB tablespace Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine: Allocate next binlog tablespace as needed.

c2f41ca

Only works for two tablespace files though. For the third, we need to implement closing the first one, so that the tablespace id can be reused. Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine: Start of binlog reader (untested, incom…

2d5590b

…plete) Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine: Change option to --binlog-storage-engin…

d0bff63

…e to get a hton available Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine

c88c7a9

Initial code to read in the binlog dump thread events from InnoDB binlog. Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine

0143421

Skip prepare step in InnoDB when it handles the binlog, but re-enable InnoDB fsync at commit. Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine: Also binlog standalone (eg. DDL) in the…

eece900

… engine Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine: Pre-allocate binlog tablespaces in back…

8b63488

…ground thread Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine: Fix losing last part of event group > t…

5f866b1

…rx cache in-memory buffer Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine: Refactor rpl_gtid.h

1975b78

Move rpl_gtid and rpl_binlog_state_base into separate rpl_gtid_base.h include that can be used from engines implementing the binlog interface. Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine: Refactor nlz() in shared header

b4b873e

Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine: Add functions for compressed int

08f73b0

Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog in Engine: Refactor to add rpl_binlog_state_base::…

d0e7018

…iterate() Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Fix race that could corrupt last mtr data in a tablespace

df538af

Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Improved error handling when searching GTID position

c62c132

Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog-in-engine: Working replication to slave

a623e2e

Only GTID slave connection is supported, at least for now. Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: out-of band binlogging, partial untested commit to do a s…

7458fc3

…eparate refactoring of end_event Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: out-of band binlogging, fix trx_cache handling for out-of…

28cbd70

…-band Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: out-of band binlogging, link to oob data from commit reco…

4f4c4f6

…rd (untested) Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog-in-engine: Small visibility tweak in handler_binlo…

8524966

…g_reader Signed-off-by: Kristian Nielsen <[email protected]>

knielsen added 8 commits January 17, 2025 13:03

MDEV-34705: Binlog-in-engine: Refactor InnoDB part

c9da3dd

Move high-level binlog code to handler/handler0binlog.cc and low-level code to fsp/fsp0binlog.cc. Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog-in-engine: Fix incorrect binlog data

31cfb98

When starting to read data from a specific GTID position, the logic for skipping any initial, partial record that we start in the middle of, was incorrect. Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog-in-engine: Drop old X/X0Y.cc name convention for n…

3c9d248

…ew files Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog-in-engine: Implement SHOW BINARY LOGS

ab260b1

Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog-in-engine: Implement FLUSH BINARY LOGS

fcd373e

No DELETE_DOMAIN_ID supported yet, will come in a later commit, after PURGE is implemented. Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog-in-engine: Implement RESET MASTER

504f4ee

Signed-off-by: Kristian Nielsen <[email protected]>

MDEV-34705: Binlog-in-engine: Misc. small fixes to make normal test s…

0f6a19c

…uite mostly pass Signed-off-by: Kristian Nielsen <[email protected]>

knielsen force-pushed the knielsen_binlog_in_engine branch from 2ef2e33 to aa64f73 Compare January 17, 2025 17:10

MDEV-34705: Binlog-in-engine: Buildbot fixes

18932be

Signed-off-by: Kristian Nielsen <[email protected]>

knielsen force-pushed the knielsen_binlog_in_engine branch from aa64f73 to 18932be Compare January 17, 2025 20:07

cvicentiu added the MariaDB Foundation Pull requests created by MariaDB Foundation label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: MDEV-34705: Storing binlog in InnoDB #3775

DRAFT: MDEV-34705: Storing binlog in InnoDB #3775

knielsen commented Jan 17, 2025

CLAassistant commented Jan 17, 2025

DRAFT: MDEV-34705: Storing binlog in InnoDB #3775

Are you sure you want to change the base?

DRAFT: MDEV-34705: Storing binlog in InnoDB #3775

Conversation

knielsen commented Jan 17, 2025

CLAassistant commented Jan 17, 2025