Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backend): handle client side HTTP timeouts to fix crashes of metadata-writer. Fixes #8200 #11361

Merged
merged 5 commits into from
Nov 26, 2024

Conversation

OutSorcerer
Copy link
Contributor

@OutSorcerer OutSorcerer commented Nov 6, 2024

Description of your changes:

  • Handles client side timeouts (urllib3.exceptions.ReadTimeoutError) of k8s_watch.stream to prevent crashes of metadata-writer pod. Without this metadata-writer pod fails in cases when a connection error causes a client timeout. This should fix [backend] Metadata writer pod always restarting  #8200.
  • Adjusts the client side timeout according to the recommendations of Kubernetes Python client authors.

    It is recommended to set this [server] timeout value to a higher number such as 3600 seconds (1 hour).
    It is recommended to set this [client] timeout value to a lower number (for eg. ~ maybe 60 seconds).

    • The benefit of this for the metadata-writer use case is that, currently, with a client side timeout of 2000 seconds (33.3 minutes) if a connection error happens during this period, metadata-writer stops doing its job for up to 33.3 minutes. After this change metadata-writer will only be disconneted for up to the new client timeout, 60 seconds.
    • Since the client timeout is now smaller than the server timeout, if no events occur within the client timeout period, a ReadTimeoutError is thrown and caught even in the absence of errors.
  • Makes client side and server side timeout configurable via environment variables.
  • Updates Kubernetes Python client to the latest version.

Checklist:

… adjust timeouts according to the recommendations of the Kubernetes client authors.

Signed-off-by: Evgeniy Mamchenko <[email protected]>
Copy link

Hi @OutSorcerer. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions github-actions bot added the ci-passed All CI tests on a pull request have passed label Nov 6, 2024
Copy link
Member

@thesuperzapper thesuperzapper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OutSorcerer just want to clarify a few things.

except Exception as e:
import traceback
print(traceback.format_exc())
# Server side timeout, continue watching.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this line meant to be a comment for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment marks the place in code which is only reached when the server side timeout occurs, it was not meant to comment a particular statement.

Should I remove it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for clarity we can explain in the comment that the "continue[d] watching" occurs by getting a new stream on the next iteration of the while loop — that way it is clear the comment is referring to the logical flow rather than a particular statement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment indeed seems misplaced.
@OutSorcerer isn't possible to handle the server-side exception the same way you did for the client-side one and keep the original exception handling (except Exception as e:)?
Something like:

    try:
        ...
    except <Server-side error> as e:
         # Server side timeout, continue watching.
         pass
    except urllib3.exceptions.ReadTimeoutError as e:
         # Client side timeout, continue watching.
         pass
    except Exception as e:
         import traceback
         print(traceback.format_exc())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the behaviour in case of a server-side timeout is that the loop for event in pod_stream just finishes normally (because the underlying HTTP request to Kubernetes API also finishes) so it cannot be handled by try-except.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for clarifying @OutSorcerer.
Then how about the following?

Suggested change
# Server side timeout, continue watching.
# If the for loop ended, a server-side timeout occurred. Continue watching.
pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see, now it will be clear to which line that comment belongs. I made a new commit that adds pass under the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment text to If the for loop ended, a server-side timeout occurred. Continue watching. as well.

Comment on lines +167 to +168
try:
for event in pod_stream:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand why we need to retry the entire iterator on every error and thus create a new one?

Or does the iterator returned by pod_stream become "poisoned" when it fails, so calling __next__ on it will never return a new item in the stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or does the iterator returned by pod_stream become "poisoned" when it fails, so calling next on it will never return a new item in the stream?

As I understand, yes, this is what happens here.

In case of a network error causing a client timeout it was leading to an unhandled exception in metadata-writer, so Kubertenetes was restarting it and increasing the restart counter.

In the stack trace the error was happening on this line:

Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 163, in <module>
    for event in pod_stream:
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 857, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 449, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='34.118.224.1', port=443): Read timed out.

# Client side timeout, continue watching.
pass

except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we were already catching all these errors before, but I am struggling to see why catching Exception won't get us sometimes stuck where we never actually crash.

Especially because you proposed above that we have this try around the for loop, meaning any iteration errors will endlessly retry on a pod_stream that may never work?

Copy link
Contributor Author

@OutSorcerer OutSorcerer Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep the amount of changes small originally.

But in my understanding removing this catch-all except clause improves the code. The unhalded exceptions will still be printed to the console and they will not be hidden from a user anymore. A user will see that restart counter increases and will be able to get the logs by a commnad like kubectl -n kubeflow logs --previous metadata-writer-6d5b8456-78265. Kubernetes will restart metadata-writer pod and it will continue handling events if it is possible.

So I made a new commit now that removes the catch-all except clause.

@hbelmiro
Copy link
Contributor

hbelmiro commented Nov 7, 2024

/ok-to-test

@google-oss-prow google-oss-prow bot added size/S and removed size/M labels Nov 7, 2024
Copy link
Contributor

@hbelmiro hbelmiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@ishaan-mehta
Copy link

ishaan-mehta commented Nov 19, 2024

Is there an estimate on when this could be approved and merged, and/or is there anything I can do to help? Just curious as my deployment is running into the same issue.

This is more severe than just a pod restarting repeatedly, as when the pod is down, Kubeflow is seemingly unable to properly authorize users for namespaces.

@thesuperzapper
Copy link
Member

@kubeflow/pipelines maintainers can we get some eyes on this important PR (It needs some work, but is important as it fixes a critical issue that prevents KFP metadata-writer working on some Kubernetes distros).

For more context, please see my comment here:

But the issue is simply that some TCP sockets are timing out when we are watching Kubernetes resources from python.

@github-actions github-actions bot added ci-passed All CI tests on a pull request have passed and removed ci-passed All CI tests on a pull request have passed labels Nov 25, 2024
Copy link
Contributor

@hbelmiro hbelmiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Nov 26, 2024
@HumairAK
Copy link
Collaborator

/approve

Thanks @OutSorcerer !

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HumairAK

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@OutSorcerer
Copy link
Contributor Author

@thesuperzapper, @hbelmiro, @HumairAK, thanks for reviewing the PR!

@google-oss-prow google-oss-prow bot merged commit 94a21cc into kubeflow:master Nov 26, 2024
17 checks passed
@github-actions github-actions bot added ci-passed All CI tests on a pull request have passed and removed ci-passed All CI tests on a pull request have passed labels Nov 26, 2024
@HumairAK HumairAK modified the milestones: KFP SDK 2.11, KFP 2.4.0 Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed All CI tests on a pull request have passed lgtm ok-to-test size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[backend] Metadata writer pod always restarting
5 participants