Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry more aggressively in VssStore. #368

Merged
merged 1 commit into from
Oct 16, 2024
Merged

Conversation

G8XSU
Copy link
Contributor

@G8XSU G8XSU commented Oct 10, 2024

Since a failed persistence might cause LDK to panic.

.with_max_total_delay(Duration::from_secs(2))
let retry_policy = ExponentialBackoffRetryPolicy::new(Duration::from_millis(50))
.with_max_attempts(6)
.with_max_total_delay(Duration::from_secs(7))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even seven seconds seems like a really short period before we give up and crash entirely :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, 7 was just calculated based on backoff intervals b/w retries for approx 6 attempts.

I wanted to increase it further, but then increasing it too much leads to node being stuck for that duration, so too much is also bad? (difficult to determine, 10 secs?)

Open to other options, we could also remove the limit on max_total_delay and just keep like 8 exponential backoff retries or use naive retry with jitter and fixed delay.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh, so I agree with Matt that we had discussed retrying forever on persistence failure until we get a proper async-persistence interface as we just can't arbitrarily crash when our connectivity drops. I guess a raised max delay is better than nothing, but we probably still want to drop max attempts and max delay limits, or at the very least bump them way, way up?

Copy link
Contributor Author

@G8XSU G8XSU Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did bump them after the previous comment,
now it is 10 attempts and 15 secs max delay.
Let me know if you a have a specific number in mind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think intermittently we still want to go the 'forever' route as recently discussed offline. Any case, going to merge this as it's a step in the right direction and tracking it now here: #380

@tnull
Copy link
Collaborator

tnull commented Oct 11, 2024

Since a failed persistence might cause LDK to panic.

We also discussed to reduce the number of potential panics, e.g., actually using ReplayEvent in event handling now that #358 landed. Should we still do this here or in a follow-up PR?

@G8XSU
Copy link
Contributor Author

G8XSU commented Oct 14, 2024

We also discussed to reduce the number of potential panics, e.g., actually using ReplayEvent in event handling now that #358 landed. Should we still do this here or in a follow-up PR?

I think those changes are independent of these.
Made changes to ReplayEvents in case of persistence failure, but not entirely sure if they are sufficient/exhaustive for ldk-node : #374

Since a failed persistence might cause LDK to panic.
@tnull tnull merged commit ca6c2fa into lightningdevkit:main Oct 16, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants