Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backend): handle client side HTTP timeouts to fix crashes of metadata-writer. Fixes #8200 #11361

Merged
merged 5 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion backend/metadata_writer/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ grpcio==1.66.1
# via ml-metadata
idna==3.10
# via requests
kubernetes==10.0.1
kubernetes==31.0.0
# via -r -
lru-dict==1.3.0
# via -r -
Expand Down
27 changes: 17 additions & 10 deletions backend/metadata_writer/src/metadata_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
import collections
import kubernetes
import yaml
import urllib3
from time import sleep
import lru

Expand All @@ -30,7 +31,10 @@
workflow_name_to_context_id_size = os.environ.get('WORKFLOW_NAME_TO_CONTEXT_ID_SIZE', 5000)
pods_with_written_metadata_size = os.environ.get('PODS_WITH_WRITTEN_METADATA_SIZE', 5000)
debug_files_size = os.environ.get('DEBUG_FILES_SIZE', 5000)

# See the documentation on settings k8s_watch timeouts:
# https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md
k8s_watch_server_side_timeout = os.environ.get('K8S_WATCH_SERVER_SIDE_TIMEOUT', 1800)
k8s_watch_client_side_timeout = os.environ.get('K8S_WATCH_CLIENT_SIDE_TIMEOUT', 60)

kubernetes.config.load_incluster_config()
k8s_api = kubernetes.client.CoreV1Api()
Expand Down Expand Up @@ -150,18 +154,18 @@ def is_kfp_v2_pod(pod) -> bool:
k8s_api.list_namespaced_pod,
namespace=namespace_to_watch,
label_selector=ARGO_WORKFLOW_LABEL_KEY,
timeout_seconds=1800, # Sometimes watch gets stuck
_request_timeout=2000, # Sometimes HTTP GET gets stuck
timeout_seconds=k8s_watch_server_side_timeout,
_request_timeout=k8s_watch_client_side_timeout,
)
else:
pod_stream = k8s_watch.stream(
k8s_api.list_pod_for_all_namespaces,
label_selector=ARGO_WORKFLOW_LABEL_KEY,
timeout_seconds=1800, # Sometimes watch gets stuck
_request_timeout=2000, # Sometimes HTTP GET gets stuck
timeout_seconds=k8s_watch_server_side_timeout,
_request_timeout=k8s_watch_client_side_timeout,
)
for event in pod_stream:
try:
try:
for event in pod_stream:
Comment on lines +167 to +168
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand why we need to retry the entire iterator on every error and thus create a new one?

Or does the iterator returned by pod_stream become "poisoned" when it fails, so calling __next__ on it will never return a new item in the stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or does the iterator returned by pod_stream become "poisoned" when it fails, so calling next on it will never return a new item in the stream?

As I understand, yes, this is what happens here.

In case of a network error causing a client timeout it was leading to an unhandled exception in metadata-writer, so Kubertenetes was restarting it and increasing the restart counter.

In the stack trace the error was happening on this line:

Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 163, in <module>
    for event in pod_stream:
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 857, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 449, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='34.118.224.1', port=443): Read timed out.

obj = event['object']
print('Kubernetes Pod event: ', event['type'], obj.metadata.name, obj.metadata.resource_version)
if event['type'] == 'ERROR':
Expand Down Expand Up @@ -393,6 +397,9 @@ def is_kfp_v2_pod(pod) -> bool:

pods_with_written_metadata[obj.metadata.name] = None

except Exception as e:
import traceback
print(traceback.format_exc())
HumairAK marked this conversation as resolved.
Show resolved Hide resolved
# If the for loop ended, a server-side timeout occurred. Continue watching.
pass
HumairAK marked this conversation as resolved.
Show resolved Hide resolved

except urllib3.exceptions.ReadTimeoutError as e:
# Client side timeout, continue watching.
pass
HumairAK marked this conversation as resolved.
Show resolved Hide resolved
Loading