Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added doc for leases and the lease lifecycle. #1218

Merged
merged 6 commits into from
Nov 10, 2023

Conversation

stair-aws
Copy link
Contributor

@stair-aws stair-aws commented Oct 24, 2023

Issue #, if available:
N/A

Description of changes:
Added doc for leases and the lease lifecycle.

Rendered @ https://github.com/stair-aws/amazon-kinesis-client/blob/lease-doc/docs/lease-lifecycle.md

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

edit: fixed rendered link since the file was renamed.

@stair-aws stair-aws added the v2.x Issues related to the 2.x version label Oct 24, 2023

In KCL, a lease provides a temporal assignment between one Kinesis shard and an assigned worker.
Leases are persistent for the duration of shard processing (detailed later).
However, lease assignment is transient -- leases may be "stolen" by other workers in the same KCL application.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lease assignment is transient is a confusing statement, because the assignment has a consistent logic, it is the ownership that is transient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the assignment has a consistent logic

Pedantically, yes and no. "Yes", the programmed logic doesn't change, yet "no" in that X leases divvied up between Y workers will not produce consistent assignments due to intentional shuffling by KCL.

Per offline consensus, will reword to:

However, the worker that is processing a lease may change since leases may be "stolen" by other workers in the same KCL application.

docs/leases-and-lease-lifecycle.md Outdated Show resolved Hide resolved
## Lease Table

To persist metadata about lease state (e.g., last read checkpoint, current assigned worker), KCL creates a lease table in [DynamoDB][dynamodb].
Each KCL application will have its own distinct lease table that transcludes the application name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great verbage, but "transcludes" I don't know will be accessible. I would suggest something more like includes.

We can keep this but I think most will have to look it up in the dictionary

docs/leases-and-lease-lifecycle.md Outdated Show resolved Hide resolved
docs/leases-and-lease-lifecycle.md Outdated Show resolved Hide resolved
docs/leases-and-lease-lifecycle.md Outdated Show resolved Hide resolved
end

PSS->PSS: runShardSync()
opt if not required to sync shards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a leader not be required to sync shards? Is this if there are no shards, and if so can we say that explicitly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some examples:

  1. KCL is a multi-threaded app, and the stream could be purged at some time before, or during, the runShardSync()
  2. the known shard list is complete (i.e., no holes in the hash range)
  3. the known shard list is incomplete, but the holes don't cross the threshold to trigger a sync
  4. some future edit that may add, or remove, another condition that affects sync logic

if so can we say that explicitly

The more specific/nuanced documentation is, the more likely it will suffer bitrot with an implementation change. We should balance documentation to convey ideas vs. repacking the implementation as English.

Update pending to provide some additional details, yet not a full enumeration.

![Abridged sequence diagram of the Shard Sync process.
Listed participants are the Scheduler, LeaseCoordinator, PeriodicShardSyncManager, ShardSyncTask,
Lease Table (DDB), LeaseRefresher, LeaseSynchronizer, HierarchicalShardSyncer, and ShardDetector.
](images/lease-shard-sync.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image is hard to read and small. I don't know if we can increase text size. Also it introduces a lot of new terms that the customer maynot be familiar with

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image is hard to read and small.

Agreed, and this is after creating the abridged version that omits several classes/calls. It's likely a safe assumption that KCL users know how to click-into an image, or zoom-in via their browsers.

it introduces a lot of new terms that the customer maynot be familiar with

Anything specific to call out? Code links are provided immediately after so any reader can dive deeper.

* In general, leases spend the majority of their life in this state.
1. `SHARD_END`: The associated shard is `SHARD_END` and all records have been processed by KCL for the shard.
1. `DELETION`: Since there are no more records to process, KCL will delete the lease from the lease table.
* Deletion will only be triggered after all parents of a child shard have converged to `SHARD_END`.
Copy link
Contributor

@akidambisrinivasan akidambisrinivasan Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not clear which is getting deleted here from reading the sentence. Maybe we should say a lease is deleted only after the child shard(s) has checkpointed atleast once. So its not just converging to SHARD_END along with other co-parent, but also that the child lease must have started processing and checkpoined atleast once.

* First time starting KCL with an empty lease table.
* Stream mutations (i.e., split, merge) that create child shards.
* In multi-stream mode, dynamic discovery of a new stream.
1. `CREATION`: Leases are created 1:1 for each discovered shard, and initialized at the configured initial position.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for root leases they are created with configured initial position, all other leases for mutations of shards have TRIM_HORIZON ?


Lease syncing is a complex responsibility owned by the "leader" host in a KCL application.
By invoking the [ListShards API][list-shards], KCL will identify the shards for the configured stream(s).
This process is scheduled at a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a section on this below, but I would recommend to add it here so it doesnt seem like ListShards happens periodically but only when the auditor determines leases are missing, othterwise childshard leases are created when parent reaches SHARD_END, or something along those lines ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid readers incorrectly assuming that ListShards is invoked every interval, rewording (per offline consensus):

This process is scheduled at a configurable interval so KCL can self-identify new shards introduced via stream mutations.

... to:

This process is scheduled at a configurable interval so KCL can decide whether it should query for potentially-new shards.

1. `LeaseRefresher` invokes a DDB `scan` against the lease table which has a cost proportional to the number of leases.
1. Frequent balancing may cause high lease turn-over which incurs DDB `write` costs, and potentially redundant work for stolen leases.
1. High `maxLeasesToStealAtOneTime` may cause churn.
* For example, worker `B` steals multiple leases from worker `A` creating a numerical imbalance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always only steals to balance, but wont steal max if max is not needed to balance ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. If applicable, we can reintroduce this language later w/ a better example.

private final ShardInfo shardInfo;
private final ShardDetector shardDetector;

StreamIdentifier streamIdentifier;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes intended ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Per Lombok @Value:

all fields are made private and final by default, and setters are not generated

Explicit private final are unnecessary/redundant, and get flagged by IntelliJ as such.

+ decomposed shard sync UML into two separate diagrams (initialization, loop)
@stair-aws stair-aws merged commit 51a62a5 into awslabs:master Nov 10, 2023
1 check passed
@stair-aws stair-aws deleted the lease-doc branch November 10, 2023 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2.x Issues related to the 2.x version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants