Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: temporal partition rotation automation #17427

Open
shlomi-noach opened this issue Dec 24, 2024 · 4 comments
Open

RFC: temporal partition rotation automation #17427

shlomi-noach opened this issue Dec 24, 2024 · 4 comments
Labels
Component: schema management schemadiff and schema changes Type: RFC Request For Comment

Comments

@shlomi-noach
Copy link
Contributor

shlomi-noach commented Dec 24, 2024

I'm proposing that Vitess can automate temporal partition rotation based on user defined rules.

What are temporal partitioned tables?

Tables that are PARTITIONED BY RANGE and over temporal (time-based) values. This would be either:

  • PARTITION BY RANGE COLUMNS (col_name) over a single column, that is either DATE or DATETIME (technically MySQL also allows TIME but that is not so interesting to partition rotation).
  • PARTITION BY RANGE (func(col_name)) WHERE func is one of selected functions such as TO_DAYS, and the column is again either DATE or DATETIME.

Note: with MySQL 8.4 we can also support UNIX_TIMESTAMP(timestamp_column)

For practical reasons, we will have a predefined set of allowed functions/expressions. For example, we will support PARTITION BY RANGE (TO_DAYS(my_column)) or likewise YEAR(my_column) but we will not support ROUND(SQRT(TO_DAYS(my_column)+3.14)). We aim for practical operational scenarios. Most users will rotate hourly, daily, possibly weekly, monthly, yearly.

What is the proposal?

We will have a per-table rule:

  • name of table
  • rotation interval (hourly, daily, ...)
  • number of ahead-of-time partitions (prepare this many futuristic empty partitions)
  • retention (e.g. 30 days)
  • control flags such as enabled/disabled

Vitess will periodically look at all the rules, and check all referenced tables. For each table, if applicable, it will generate a sequence of Online DDL migrations, with internal UUID and in-order-execution, that ensure the table is in required state. Since it will do this periodically, most of the this will be a no-op since the table will already have all the required future partitions, and will have dropped expired partitions.

Where are rules to be stored?

I'm thinking as part of Keyspace record in topo, much like the throttler configuration. The config will be copied from Keyspace to SrvKeyspace as needed, again just like the throttler configuration.

What are expected problems?

I'm not sure how to handle errors. For example:

  • what if a table was dropped but the rule is still there?
  • what if the table is not partitioned?
  • what if the rule does not comply with the table definition? e.g. request to rotate hourly, but the table uses TO_DAYS expression, making the minimal interval 24h?

So where should these errors go? As this is a background operation, there's no occasion to respond to the user with the list of errors.

Alternative approaches

With rules still in place, maybe Vitess should not auto-rotate. Instead, maybe we should have vtctldclient RotatePartitionedTables command, that will:

  • analyze all the required changes
  • run them sequentially (either asynchronously via Online DDL or somehow directly)
  • report errors to the user

The partition analysis should still be made on a per-shard basis, to ensure independence of shards and eventual consistency, much like we delegate all Online DDL changes to shards.

Existing work

schemadiff PR to analyze temporal range partitioned tables and to generate required creation of ahead-of-time partitions, and purge expired partitions: #17426. This also validates intervals and other constraints.

Also related:

@mcrauwel
Copy link
Contributor

mcrauwel commented Jan 6, 2025

I have several use-cases for this, having native Vitess support for this is a big 👍
I agree with all the points you mentioned.

About the expected problems:

  • if the table is dropped, I believe that should take precedence over running a rotation-migration re-introducting the old table
  • while the adding and dropping of partitions on an already partitioned table is a fast operation that does not require a table rebuild, I think we could support initiating partitioning of a non-partitioned table using an online-DDL table rebuild? I think that would be a nice way to opt-in to partitioning.
  • if the user requests like hourly rotation but the table only supports daily, I would think that 23 out of 24 executions would be just a no-op?

I would be a fan of running this automatically but having a command to be triggered by some control process would also work.

Just my 2 cents.

Absolutely fabulous that this is being considered for implementation!

@shlomi-noach
Copy link
Contributor Author

I think we could support initiating partitioning of a non-partitioned table using an online-DDL table rebuild?

That already works, just an implicit behavior of existing Online DDL!

@shlomi-noach
Copy link
Contributor Author

if the table is dropped, I believe that should take precedence over running a rotation-migration re-introducting the old table

Absolutely. I'm wondering though how will vitess communicate that there's an orphaned rule. Where will the error messages go?

if the user requests like hourly rotation but the table only supports daily, I would think that 23 out of 24 executions would be just a no-op?

That's an interesting idea. Giving this more thought!

@shlomi-noach
Copy link
Contributor Author

Further design notes: how would this look like?

Config/rule

We'd have a rule per-table, with these fields:

  • table name
  • rule description (human readable)
  • rotation interval (hour/day/week/month/quarter/year)
  • interval mode (relevant for week interval, see docs)
  • create ahead count: number of empty future partitions to prepare ahead of time
  • expiration: duration after which partitions should be dropped (e.g. older than 1 month)
  • disable flag for creation (give more control to users over rotation)
  • disable flag for expiration (give more control to users over rotation)
  • possibly flag for "only drop when rolling forward", making a "tit for tat" behavior.

User facing commands

The user would have access to commands such as:

  • vtctldclient ShowPartitionRotationRules [--table <table>] (shows all rules or a specific table rule)
  • vtctldclient SetTablePartitionRotationRule to either create or update a rule. The user would be able to override settings, enable/disable, etc. Something like vtctldclient SetTablePartitionRotationRule --table notifications --interval day --expire-interval month --expire-interval-count 3 --create-ahead-count 7.
  • vtctldclient DropTablePartitionRotationRule to delete a rule.

Where rules are stored

There are two options:

  1. topo. This would be similar to how we embed throttler configuration within the Keyspace record. We would likewise introduce a subtree under Keyspace, and like the throttler, we would always copy updates from Keyspace to SrvKeyspace. Like the throttler, we would watch Keyspace for changes. I say "we" and mean vttablet.
  2. _vt tables: we would have something like this under _vt:
CREATE TABLE IF NOT EXISTS partition_rotation_rules
(
    `id`                        bigint unsigned  NOT NULL AUTO_INCREMENT,
    `keyspace`                  varchar(256)     NOT NULL,
    `shard`                     varchar(255)     NOT NULL,
    `mysql_schema`              varchar(128)     NOT NULL,
    `mysql_table`               varchar(128)     NOT NULL,
    `description`               text             NOT NULL,
    `interval_name`             varchar(32)      NOT NULL,
    `interval_mode`             tinyint unsigned NOT NULL DEFAULT '0',
    `create_disabled`           tinyint unsigned NOT NULL DEFAULT '0',
    `create_ahead_count`        int unsigned     NOT NULL DEFAULT '0',
    `expire_disabled`           tinyint unsigned NOT NULL DEFAULT '0',
    `expire_interval_name`      varchar(32)      NOT NULL,
    `expire_interval_count`     int unsigned     NOT NULL DEFAULT '0',
    `only_expire_on_prepare`    tinyint unsigned NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`)
)

Basically map the table to a rule's config. We would then have a copy of a table's rule in each shard. When we create/update/drop rule, we must ensure to submit the change to all shards. This will be done in a similar way to ApplySchema for Online DDL, and how the migrations are submitted to _vt.schema_migrations. Some shards could be down or network isolated, in which case the vtctldlcient command must return with error, and it will be the user's responsibility to resubmit the rule (for all shards or just for the missing shards). It is possible to get skewed config in the approach, though likely this will be only temporary and with no serious effects -- it all depends on the user's diligence.

Further notes about rules

Either by topo or by _vt tables, we must ensure to copy rotation rules on MoveTables workflows.

How rotation is applied

This will be a task for the Online DDL executor, a component which already manifests as a state machine, and which routinely observes state of schema migrations.

Breakdown:

When?

There's multiple ways to go about "when to do the rotation". We could:

  1. Let the user specify specific time ranges ("I want the rotation to happen every Sunday between 2:00am and 4:00am")
  2. Automatically infer the best time for rotation based on the partitioning scheme (if rotating daily, rotate at about midnight; if rotating weekly on Sunday, rotate at about Saturday night, ...)
  3. Rotate whenever applicable.
  4. Don't rotate automatically, but only upon user's command, e.g. some vtctldclient ApplyPartitionRotations [--table=<table>]

I'm leaning towards (3) and I'll explain why: suppose we just check the situation every 30min (which is short enough for hourly rotation). Most of the time we will have nothing to do. Once in a while, we will have something to do. Imagine we need to do a daily rotation on table t, and that we need to (a) prepare 7 days ahead, and (b) drop 30 days old partitions. Per #17426, we would just try to see if there's any need to create 7 partitions ahead. Most of the time the answer will be "nope! there's already 7 empty partitions ahead of time". But the first run right after midnight will find that "yes, there are only 6 empty partitions ahead of time, hence we need to create one more".

30 minute interval is naive. We can make the behavior more intelligent, given that schemadiff knows how to analyze a table's rotation interval. We could e.g. figure out that if t rotates daily, and that at 00:12 we were able to prepare ahead all required partitions, then there's really no need to check again for the next 23+ hours.

create/add ahead partitions

ALTER TABLE ... ADD PARTITION is a cheap operation. It essentially creates an internal InnoDB table. It does, however, take a DDL lock on the entire table, meaning it locks all partitions. As such, it can conflict with long running queries. This can either create locks and delays on the primary, or can create lag on replicas.

This is an argument for (1) - letting the user choose their preferred time of rotation, or (4) - again throwing the problem at the user.

drop/expire partitions

ALTER TABLE ... DROP PARTITION is a not-so-cheap operation. Under the hood, this translates to a DROP TABLE. With recent MySQL 8.0 versions, that's generally nto a big deal -- except when it comes to partitions, this command again locks the entire table including all other existing partitions. So if the partitions is large and takes 5 seconds to DROP, that's a 5 seconds lock on the table.

We could do that, but we can employ a more sophisticated approach, utilizing the EXCHANGE PARTITION command. Basically MySQL allows us to exchange a table's partition with an actual non-partitioned table that otherwise has the same schema design. So the trick is to:

  1. CREATE TABLE _vt_hld_... (<same schema design>), which is of course empty,
  2. followed by ALTER TABLE t EXCHANGE PARTITION p WITH _vt_hld_.... This operation is cheap - it does place a DDL lock on the table just as ADD PARTITION does. But no data is dropped here. _vt_hld... turns to be partition p, partition p turns to be _vt_hold...
  3. Then, ALTER TABLE t DROP PARTITION p -- this partition is now empty so the drop is fast (note this again locks the table, 2nd time).
  4. As for _vt_hld...? We just let the table garbage collector to sweep it whenever. We can either make the table immediately recyclable, or we can make it recyclable after 24 hours. Either way, dropping the table will not impose any direct locks on our partitioned table t. It is subject to all normal DROP TABLE terms and conditions; it could create some lag, etc.

Scheduling and execution

Whatever the timing is, Online DDL would use schemadiff to compare the table definition with the rules, and determine if there's any need to ADD/DROP partitions. For any such change (there could be multiple), it will create a schema migration. The UUID for each such migration will be a deterministic hash of the table name and partition name. Per #17426, partition names are generated after their respective timestamps, which again makes this nicely deterministic.

This deterministic UUID evaluation conveniently eliminates any concerns for "what if Online DDL tries creating something that already exists due to PRS/ERS" family of questions. Online DDL always ignores migrations with an already existing UUID (and doesn't even submit such migrations).

Online DDL will then submit the migration onto itself, ie create a _vt.schema_migrations entry. we can further decorate the table with a couple additional columns that explain why this was submitted (on behalf of what rule etc.). The rest is just the migration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: schema management schemadiff and schema changes Type: RFC Request For Comment
Projects
None yet
Development

No branches or pull requests

2 participants