Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(2.11) JetStream API routed queue changes #6342

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

neilalexander
Copy link
Member

This PR makes three changes:

  1. The routed JS API queue switches from a FIFO to a LIFO discipline. This ensures that the workers are always pulling the most recent tasks first, since they have the highest chance of being completed before the clients have given up waiting.
  2. Each API worker now only pulls off a single item from the IPQ at a time, rather than pulling all of them off in one go. This is important because in a scenario where the queue has built up, one single worker will end up pulling off all (or the majority) of the pending tasks and starving other workers, which slows down the recovery.
  3. The JS API queue no longer drains all requests when the limit is reached, instead trying to evict only the oldest entry to make way for a new one. This ensures that transient pauses/conditions have a relatively lower blast radius whilst maintaining a best-effort service level.

Signed-off-by: Neil Twigg [email protected]

@neilalexander neilalexander requested a review from a team as a code owner January 8, 2025 11:57
@Jarema
Copy link
Member

Jarema commented Jan 9, 2025

One concern I have with LIFO is - if the System is under managable, but constant pressure, causing the queue to be more or less at constant size, the old requests might never be handled.

@neilalexander
Copy link
Member Author

One concern I have with LIFO is - if the System is under managable, but constant pressure, causing the queue to be more or less at constant size, the old requests might never be handled.

With the LIFO approach it does mean that we prioritise more recent requests, yes, but consider the alternative: if the system builds up requests because of a pause or because it is consistently struggling, then we have a head-of-line blocking problem. The current FIFO approach will be pulling off the oldest requests and handling them after the clients have given up waiting, which in turn means that all requests behind them will also be handled after the clients have given up. The result is that we've affected a really large number of clients because none of the work got completed in time.

The LIFO approach turns it on its head and continues to serve the most recent requests as soon as possible, which gets a number of "happy" clients out of the way as soon as possible. (They were never head-of-line blocked and therefore do not need to come back and retry later, which just compounds the problem and makes everything worse.)

We then optimistically try to satisfy the older requests if we have the resources available to do so. If we don't then that's unavoidable, but that's no worse than FIFO today. At least with LIFO, we prioritised getting some work done that clients will still be waiting for and won't need to retry.

Also worth noting this is only affecting the routed JS API requests, i.e. ones that have to be routed to the metaleader for completion. For requests that can be handled inline from a client connection, this doesn't change anything. It also doesn't change anything for a system that is operating normally.

@neilalexander neilalexander changed the title JetStream API routed queue changes (2.11) JetStream API routed queue changes Jan 9, 2025
derekcollison added a commit that referenced this pull request Jan 9, 2025
This updates the JS API workers to use `popOne`, so that if the queue
stacks up at all, the recovery is fairer across workers, rather than
single workers stealing the entire queue.

This was extracted from #6342.

Signed-off-by: Neil Twigg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants