Writing Checkpoints #273

xianwill · 2021-05-27T17:02:33Z

xianwill
May 27, 2021
Collaborator

Summary

This is an open forum to discuss our solution for writing checkpoint files (See also #106).

Some up front notes:

Checkpoints allow readers to load the bulk of table state from one or a few parquet files instead of an unbounded number of json files
The reference implementation writes a checkpoint file after every 10th commit

Some up front opinions of mine that affect the design:

If we miss a checkpoint due to a worker crash - its not that bad. As long as checkpoints are usually created every 10th commit, everything is fine. Log correctness should not be impacted by a missed checkpoint. The worst that will happen is that readers will load more than 10 JSON files to replay the full snapshot into state.
If we overwrite _last_checkpoint with an older checkpoint version in a multi-writer scenario - its also not_that_bad. Again - the worst that can happen is that readers reading at the time will have to load some extra JSON log entries.

These two opinions have a strong impact on the design. Basically - I'd like to avoid the complexity of coordinating checkpoints in a distributed way and accept some trade-offs that do not break log correctness.

Design

Static Structure

The image below shows a static structure diagram that mentions some existing relationships in delta-rs and some additions to support writing checkpoints. Blue highlights are adds, red highlights are especially important deps. Each DeltaTable instance already holds the last checkpoint in memory. The checkpoint includes a version field which is useful for the snapshot logic. The added DeltaTable last_checkpoint_version() method just exposes the version of the last checkpoint to callers who need to determine whether they should run a checkpoint or not.

Delta writes must always go through a DeltaTransaction. The committed version is useful to allow callers to determine whether a checkpoint should be written or not.

The static structure also proposes a CheckPointWriter struct with a public run_loop method and a public tx (aka "sender") field. The CheckPointWriter run_loop method may be started on a separate thread for any application that wants to periodically create checkpoints and publish checkpoint intents as versions are written. This channel may be used internally by higher level writers or managed externally.

Interaction

Client Perspective

This next image shows a sequence diagram from the client perspective. A client (aka writer) will:

hold a delta table instance (necessary for creating transactions)
have access to the last checkpoint version (since the last checkpoint is stored internally to the DeltaTable struct)
have access to the log version committed when writing a delta transaction (since the writer will either receive the verson as a return value or commit a specific version)

To prevent checkpoint creation from slowing down data writes - the client can use the above mentioned data points coupled with the static structure design to publish a checkpoint intention for a separate thread to handle. Alternatively (and not pictured), a client could run checkpoints explicitly without hosting a separate thread for the CheckPointWriter. In that case, rather than sending a checkpoint intention to the writer channel, the client could just invoke delta_table.create_checkpoint() when appropriate.

CheckPointWriter Perspective

No pics here at the moment, but upon receiving a message to create a checkpoint - a CheckPointWriter instance must

Create a DeltaTable instance from load_with_version
Invoke delta_table.create_checkpoint()

DeltaTable.create_checkpoint perspective

No pics here at the moment either, but ultimately - this method just needs to write the associated DeltaTableState as one or more parquet files. The image under the "Static Structure" section describes details about the associations that make this possible and the fields of DeltaTableState that must be accounted for in the parquet file writes.

I'll repeat one time - I think we can avoid synchronizing checkpoint writes in a distributed way for now and add this as an optimization later because of the points mentioned in the summary regarding worker crash or backtracking _last_checkpoint.

Thoughts?

houqp · 2021-05-28T00:20:33Z

houqp
May 28, 2021
Maintainer

I'll repeat one time - I think we can avoid synchronizing checkpoint writes in a distributed way for now and add this as an optimization later because of the points mentioned in the summary regarding worker crash or backtracking _last_checkpoint.

I agree. I actually think we would never need to synchronize checkpoints in a distributed way. Worst case would be checkpoints is taking up lots of io/cpu resources from the writer to the point where it's impacting the ingestion latency significantly. To address that, we can just run checkpoint in a dedicated checkpoint worker that only performs checkpoint.

Create a DeltaTable instance from load_with_version

Instead of loading the table from scratch in the checkpoint thread, how about we sjust end the necessary fields in deltatable struct over directly through the channel?

3 replies

xianwill May 28, 2021
Collaborator Author

Instead of loading the table ...

Yeah you're right. probably better for our use case where we should already have the state representing the new checkpoint in memory. I started thinking about more generic apis where we might be running a dedicated checkpoint worker and lost focus. FWIW - basically we need to send that whole DeltaTableState struct to the channel. That holds all the data it takes to create the actions that need to be written in the checkpoint.

houqp May 28, 2021
Maintainer

yeah, checkpoint happens in a separate process, then it would have to load the new states from remote storage.

rtyler May 28, 2021
Maintainer

I'll repeat one time - I think we can avoid synchronizing checkpoint writes in a distributed way for now and add this as an optimization later because of the points mentioned in the summary regarding worker crash or backtracking _last_checkpoint.

I agree. I actually think we would never need to synchronize checkpoints in a distributed way. Worst case would be checkpoints is taking up lots of io/cpu resources from the writer to the point where it's impacting the ingestion latency significantly. To address that, we can just run checkpoint in a dedicated checkpoint worker that only performs checkpoint.

Perhaps I am misunderstanding here. The notion of distributed coordination of checkpoints to me is definitely meh, no big deal.

But the creation of that _last_checkpoint file may have an implications we need to worry about.

If we overwrite _last_checkpoint with an older checkpoint version in a multi-writer scenario - its also not_that_bad. Again - the worst that can happen is that readers reading at the time will have to load some extra JSON log entries.

From the Delta Lake protocol:

Due to the zero-padded encoding of the files in the log, the version id of this recent checkpoint can be used on storage systems that support lexicographically-sorted, paginated directory listing to enumerate any delta files or newer checkpoints that comprise more recent versions of the table.

This totally makes sense to me, and makes me agree with @xianwill's assessment that an older _last_checkpoint is not that bad so long as AWS S3 is one of those storage systems that "lexicographically-sorted, paginated directory listing " and the Delta Lake reader relies on it 😄

I think that's worth a small (<30m) time-boxed sanity check just to make sure we know what's going to happen with our use-case and not be surprised if we have some slightly stale data in our stream queries.

All that said, I kind of really really like @houqp's idea about a separate lambda making checkpoints which was discussed in Slack. I think that's a really novel solution, keeps complexity and risk down in kafka-delta-ingest, and of course is Serverless (which is so hot right now). 🥰

xianwill · 2021-06-07T16:42:54Z

xianwill
Jun 7, 2021
Collaborator Author

I've submitted an initial PR in #280 with some design oriented notes in the PR. After review there, I'll make some updates to this discussion regarding next steps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing Checkpoints #273

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Writing Checkpoints #273

xianwill May 27, 2021 Collaborator

Summary

Design

Static Structure

Interaction

Client Perspective

CheckPointWriter Perspective

DeltaTable.create_checkpoint perspective

Replies: 2 comments · 3 replies

houqp May 28, 2021 Maintainer

xianwill May 28, 2021 Collaborator Author

houqp May 28, 2021 Maintainer

rtyler May 28, 2021 Maintainer

xianwill Jun 7, 2021 Collaborator Author

xianwill
May 27, 2021
Collaborator

Replies: 2 comments 3 replies

houqp
May 28, 2021
Maintainer

xianwill May 28, 2021
Collaborator Author

houqp May 28, 2021
Maintainer

rtyler May 28, 2021
Maintainer

xianwill
Jun 7, 2021
Collaborator Author