Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: MDEV-34705: Storing binlog in InnoDB #3775

Draft
wants to merge 41 commits into
base: 11.4
Choose a base branch
from

Conversation

knielsen
Copy link
Member

Draft pull request for work-in-progress on the MDEV-34075 binlog-in-engine feature

A new option --binlog-storage-engine=ENGINE moves the binlog implementation into the storage engine, for supporting engines (currently only InnoDB).

InnoDB implements the binlog files as a new type of tablespace, and uses its redo log to make the binlog crash-safe without the overhead and complexity of two-phase commit.

…rite an InnoDB tablespace

Signed-off-by: Kristian Nielsen <[email protected]>
… InnoDB tablespace

The option --innodb-in-engine now causes InnoDB DML commits to include
binlogging in the same mtr. Binlog group commit now skips binlogging to
old file-based binlog and passes events to InnoDB instead.

Many things unfinished still, like allocating new tablespaces when the first
one is filled, writing large event groups out-of-band to not bloat the
InnoDB commit record in the redo log and exceed max mtr size, writing DDL
and all other events to the InnoDB binlog, skipping the creation of the
old-style binlog, reading the new style binlog from InnoDB, etc. etc.

Signed-off-by: Kristian Nielsen <[email protected]>
Only works for two tablespace files though. For the third, we need to
implement closing the first one, so that the tablespace id can be reused.

Signed-off-by: Kristian Nielsen <[email protected]>
Before creating the next binlog tablespace N+2, flush out and close the old
binlog tablespace N, so that the new tablespace can re-use the tablespace
id without conflict.

Signed-off-by: Kristian Nielsen <[email protected]>
…e to get a hton available

Signed-off-by: Kristian Nielsen <[email protected]>
Initial code to read in the binlog dump thread events from InnoDB binlog.

Signed-off-by: Kristian Nielsen <[email protected]>
Skip prepare step in InnoDB when it handles the binlog, but re-enable
InnoDB fsync at commit.

Signed-off-by: Kristian Nielsen <[email protected]>
…rx cache in-memory buffer

Signed-off-by: Kristian Nielsen <[email protected]>
When (re-)starting the server, check for any existing binlog files.
Open the last two found (if any), and find the position that was last
written before the restart. Continue binlogging from that point rather
than creating new binlog files.

Signed-off-by: Kristian Nielsen <[email protected]>
Move rpl_gtid and rpl_binlog_state_base into separate rpl_gtid_base.h include
that can be used from engines implementing the binlog interface.

Signed-off-by: Kristian Nielsen <[email protected]>
Every N bytes (hardcoded at 64k for now, to become a configurable setting),
write the binlog GTID state into the binlog tablespace. This allows to
quickly find a given GTID position by binary search to the prior GTID state
in the tablespace and then a small linear scan from that point.

The full binlog state is dumped at the start of the binlog file; remaining
states dumped are differential states containing only the changed
(domain_id, server_id) pairs, to save space if binlog space is large.

This commit only implements the writing of the binlog state to the
tablespace at regular intervals. The binary search to be implemented in a
subsequent commit.

Signed-off-by: Kristian Nielsen <[email protected]>
Re-write the logic for selecting between reading from buffer pool or file,
and how to move to the next file, in a clean way. Handles a bunch of ToDo's,
probably fixes a few bugs, and generally makes the code much more robust.

Signed-off-by: Kristian Nielsen <[email protected]>
To restore the binlog state, after finding the position in the old binlog to
continue from, read the full gtid state saved at the start of the binlog
file as well as the most recent differentioal gtid state written shortly
before the starting position. Then construct a binlog reader to read the
remaining few events (if any), and update with any GTIDs read to obtain the
final restored GTID binlog state.

Signed-off-by: Kristian Nielsen <[email protected]>
To find the target position, we first loop backwards over binlog files,
reading the initial GTID state written at the start to find the file to
start in. We then binary search on the differential GTID states written
every --innodb-binlog-state-interval bytes.

This patch does only minimal changes to the dump thread code in sql_repl.cc
to be able to send out binlog data to the client. Some re-factoring/cleanup
should be done in a follow-up patch to more cleanly separate the two code
paths, avoid a lot of if-statements and make the binlog-in-engine code path
free of much of the cruft from the legacy binlog implementation.

Signed-off-by: Kristian Nielsen <[email protected]>
Only GTID slave connection is supported, at least for now.

Signed-off-by: Kristian Nielsen <[email protected]>
We need two flags on the chunk type to fully identify how a record is split
into chunks. A "CONT" flag which marks a continuation chunk (set on all but
the first chunk). And a "LAST" flag which marks the end of a record (set
only on the last chunk).

Signed-off-by: Kristian Nielsen <[email protected]>
Introduce a class/interface chunk_data_base to encapsulate supplying the
data to be written as a binlog record. For fsp_binlog_write_cache(), this is
an IO_CACHE with two sections (main and gtid) that need to be binlogged in
the opposite order.

Separate the logic for page writing into a generic fsp_binlog_write_chunk()
function which takes a chunk_data_base * as data source. This in preparation
for introducing other kinds of data to be written into the binlog, eg.
out-of-band partial event group data.

Signed-off-by: Kristian Nielsen <[email protected]>
…eparate refactoring of end_event

Signed-off-by: Kristian Nielsen <[email protected]>
With this commit, the out-of-band binlogging of large event groups in
multiple smaller records interleaved with other event groups is now working.

Instead of flushing the binlog cache to disk when they reach
@@binlog_cache_size, instead the cache is binlogged as an out-of-band
record. Then at transaction commit, a commit record is written containing
just the GTID and a link to the out-of-band data.

To facilitate append-only operation, the binlogged records do not have a
"next" pointer. Instead, they are written out as a forest of perfect binary
trees, the leftmost leaf of one tree pointing to the root of the previous
tree. This structure is used in the binlog reader to efficiently read out
the event group data consecutively for the binlog dump thread, needing to
maintain only O(log(N)) amount of memory during the reading.

As part of this commit, the existing binlog reader code is refactored to be
greatly improved, with a much cleaner explicit state machine and handling of
chunk/page/file boundaries etc.

Also fixes some bugs in the gtid_search::find_gtid_pos().

Signed-off-by: Kristian Nielsen <[email protected]>
Move high-level binlog code to handler/handler0binlog.cc and low-level code
to fsp/fsp0binlog.cc.

Signed-off-by: Kristian Nielsen <[email protected]>
When starting to read data from a specific GTID position, the logic for
skipping any initial, partial record that we start in the middle of,
was incorrect.

Signed-off-by: Kristian Nielsen <[email protected]>
Add option --binlog-directory, used to place the binlogs outside the data
directory (eg. to put them on different disk/file system).

Disallow specifying the binlog name in --log-bin when
--binlog-storage-engine is used, as the name is then not user configurable.

A ToDo (not implemented in this commit) is to use the --binlog-directory
value, if given, also for the legacy binlog implementation.

Signed-off-by: Kristian Nielsen <[email protected]>
No DELETE_DOMAIN_ID supported yet, will come in a later commit, after PURGE
is implemented.

Signed-off-by: Kristian Nielsen <[email protected]>
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Enable binlog_in_engine as a default suite.

Fix embedded and Windows build failures.

Use sql_print_(error|warning) over ib::error() and ib::warn().

Use small_vector<> for the innodb_binlog_oob_reader instead of a custom
implementation.

Signed-off-by: Kristian Nielsen <[email protected]>
@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch from 2ef2e33 to aa64f73 Compare January 17, 2025 17:10
@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch from aa64f73 to 18932be Compare January 17, 2025 20:07
Fix missing WORDS_BIGENDIAN define in ut0compr_int.cc.

Fix misaligned read buffer for O_DIRECT.

Fix wrong/missing update_binlog_end_pos() in binlog group commit.

Fix race where active_binlog_file_no incremented too early.

Fix wrong assertion when reader reaches the very start of (active+1).

Signed-off-by: Kristian Nielsen <[email protected]>
@cvicentiu cvicentiu added the MariaDB Foundation Pull requests created by MariaDB Foundation label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MariaDB Foundation Pull requests created by MariaDB Foundation
Development

Successfully merging this pull request may close these issues.

3 participants