-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(backend): handle client side HTTP timeouts to fix crashes of metadata-writer. Fixes #8200 #11361
Conversation
… adjust timeouts according to the recommendations of the Kubernetes client authors. Signed-off-by: Evgeniy Mamchenko <[email protected]>
Hi @OutSorcerer. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OutSorcerer just want to clarify a few things.
except Exception as e: | ||
import traceback | ||
print(traceback.format_exc()) | ||
# Server side timeout, continue watching. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this line meant to be a comment for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment marks the place in code which is only reached when the server side timeout occurs, it was not meant to comment a particular statement.
Should I remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for clarity we can explain in the comment that the "continue[d] watching" occurs by getting a new stream on the next iteration of the while
loop — that way it is clear the comment is referring to the logical flow rather than a particular statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comment indeed seems misplaced.
@OutSorcerer isn't possible to handle the server-side exception the same way you did for the client-side one and keep the original exception handling (except Exception as e:
)?
Something like:
try:
...
except <Server-side error> as e:
# Server side timeout, continue watching.
pass
except urllib3.exceptions.ReadTimeoutError as e:
# Client side timeout, continue watching.
pass
except Exception as e:
import traceback
print(traceback.format_exc())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the behaviour in case of a server-side timeout is that the loop for event in pod_stream
just finishes normally (because the underlying HTTP request to Kubernetes API also finishes) so it cannot be handled by try-except
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for clarifying @OutSorcerer.
Then how about the following?
# Server side timeout, continue watching. | |
# If the for loop ended, a server-side timeout occurred. Continue watching. | |
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see, now it will be clear to which line that comment belongs. I made a new commit that adds pass
under the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment text to If the for loop ended, a server-side timeout occurred. Continue watching.
as well.
try: | ||
for event in pod_stream: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand why we need to retry the entire iterator on every error and thus create a new one?
Or does the iterator returned by pod_stream
become "poisoned" when it fails, so calling __next__
on it will never return a new item in the stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or does the iterator returned by pod_stream become "poisoned" when it fails, so calling next on it will never return a new item in the stream?
As I understand, yes, this is what happens here.
In case of a network error causing a client timeout it was leading to an unhandled exception in metadata-writer
, so Kubertenetes was restarting it and increasing the restart counter.
In the stack trace the error was happening on this line:
for event in pod_stream: |
Traceback (most recent call last):
File "/kfp/metadata_writer/metadata_writer.py", line 163, in <module>
for event in pod_stream:
File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 857, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 449, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='34.118.224.1', port=443): Read timed out.
# Client side timeout, continue watching. | ||
pass | ||
|
||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we were already catching all these errors before, but I am struggling to see why catching Exception
won't get us sometimes stuck where we never actually crash.
Especially because you proposed above that we have this try around the for loop, meaning any iteration errors will endlessly retry on a pod_stream
that may never work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep the amount of changes small originally.
But in my understanding removing this catch-all except clause improves the code. The unhalded exceptions will still be printed to the console and they will not be hidden from a user anymore. A user will see that restart counter increases and will be able to get the logs by a commnad like kubectl -n kubeflow logs --previous metadata-writer-6d5b8456-78265
. Kubernetes will restart metadata-writer
pod and it will continue handling events if it is possible.
So I made a new commit now that removes the catch-all except clause.
/ok-to-test |
… users. Signed-off-by: Evgeniy Mamchenko <[email protected]>
Signed-off-by: Evgeniy Mamchenko <[email protected]>
Signed-off-by: Evgeniy Mamchenko <[email protected]>
3f1d534
to
553e97c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Is there an estimate on when this could be approved and merged, and/or is there anything I can do to help? Just curious as my deployment is running into the same issue. This is more severe than just a pod restarting repeatedly, as when the pod is down, Kubeflow is seemingly unable to properly authorize users for namespaces. |
@kubeflow/pipelines maintainers can we get some eyes on this important PR (It needs some work, but is important as it fixes a critical issue that prevents KFP metadata-writer working on some Kubernetes distros). For more context, please see my comment here: But the issue is simply that some TCP sockets are timing out when we are watching Kubernetes resources from python. |
Signed-off-by: Evgeniy Mamchenko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve Thanks @OutSorcerer ! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: HumairAK The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@thesuperzapper, @hbelmiro, @HumairAK, thanks for reviewing the PR! |
Description of your changes:
urllib3.exceptions.ReadTimeoutError
) ofk8s_watch.stream
to prevent crashes ofmetadata-writer
pod. Without thismetadata-writer
pod fails in cases when a connection error causes a client timeout. This should fix [backend] Metadata writer pod always restarting #8200.metadata-writer
use case is that, currently, with a client side timeout of 2000 seconds (33.3 minutes) if a connection error happens during this period,metadata-writer
stops doing its job for up to 33.3 minutes. After this changemetadata-writer
will only be disconneted for up to the new client timeout, 60 seconds.ReadTimeoutError
is thrown and caught even in the absence of errors.Checklist: