Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocketBadStatusException Handshake status 500 Internal Server Error: read: connection timed out #36

Open
craigwalton-dsit opened this issue Dec 20, 2024 · 0 comments
Labels
3rd party errors Errors observed from 3rd party code such as websocket or SSL errors

Comments

@craigwalton-dsit
Copy link
Collaborator

craigwalton-dsit commented Dec 20, 2024

Migrated from internal repo.
Complete stack trace and logs (sensitive) https://github.com/AI-Safety-Institute/aisi-inspect-tools/issues/180
Original date: 06 Nov 2024

│ <redacted>.venv/lib/python3.12/site-packages/websocket/_handshake.py:150 in _get_resp_headers    │
│                                                                                                                      │
│   147 │   │   │   )  # read the body of the HTTP error message response and include it in the                        │
│   148 │   │   else:                                                                                                  │
│   149 │   │   │   response_body = None                                                                               │
│ > 150 │   │   raise WebSocketBadStatusException(                                                                     │
│   151 │   │   │   f"Handshake status {status} {status_message} -+-+- {resp_headers} -+-+- {res                       │
│   152 │   │   │   status,                                                                                            │
│   153 │   │   │   status_message,                                                                                    │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
WebSocketBadStatusException: Handshake status 500 Internal Server Error -+-+- {'audit-id': '89c3bf06-0c1c-4a49-804a-8756578da43d', 'cache-control': 'no-cache, private', 'content-type': 'application/json', 'date': 'Tue, 05 Nov 2024 23:39:52 GMT', 'content-length': '200'} -+-+- b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"error dialing backend: read tcp 192.168.105.37:40968-\\u003e192.168.160.39:10250: read: connection timed out","code":500}\n'
...
│ <redacted>/.venv/lib/python3.12/site-packages/kubernetes/stream/ws_client.py:538 in               │
│ websocket_call                                                                                                       │
│                                                                                                                      │
│   535 │   │   else:                                                                                                  │
│   536 │   │   │   return WSResponse('%s' % ''.join(all))                                                             │
│   537 │   except (Exception, KeyboardInterrupt, SystemExit) as e:                                                    │
│ > 538 │   │   raise ApiException(status=0, reason=str(e))                                                            │
│   539                                                                                                                │
│   540                                                                                                                │
│   541 def portforward_call(configuration, _method, url, **kwargs):                                                   │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
ApiException: (0)
Reason: Handshake status 500 Internal Server Error -+-+- {'audit-id': '89c3bf06-0c1c-4a49-804a-8756578da43d', 'cache-control': 'no-cache, private', 'content-type': 'application/json', 'date': 'Tue, 05 Nov 2024 23:39:52 GMT', 'content-length': '200'} -+-+- b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"error dialing backend: read tcp 192.168.105.37:40968-\\u003e192.168.160.39:10250: read: connection timed out","code":500}\n'
...
K8sError: Error during: Execute command in pod. {"pod": "agent-env-krcyegzg-default-0", "task_name": "<redacted>", "cmd": "['bash', '-c', "<redacted>", "stdin": "None", "cwd": "None", "timeout": "300"}
2024-11-05 22:51:14,385 - SANDBOX - K8S: Starting: Execute command in pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:51:14,594 - SANDBOX - K8S: Completed: Execute command in pod. {"result": "ExecResult(success=True, returncode=0, ...
2024-11-05 22:51:14,594 - SANDBOX - K8S: Starting: Read file from pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:51:15,019 - SANDBOX - K8S: Completed: Read file from pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:52:19,320 - SANDBOX - K8S: Starting: Execute command in pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:52:19,532 - SANDBOX - K8S: Completed: Execute command in pod. {"result": "ExecResult(success=True, returncode=0, ...
2024-11-05 22:52:19,533 - SANDBOX - K8S: Starting: Read file from pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:52:19,661 - SANDBOX - K8S: Completed: Read file from pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:53:28,437 - SANDBOX - K8S: Starting: Execute command in pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:53:33,670 - SANDBOX - K8S: Completed: Execute command in pod. {"result": "ExecResult(success=True, returncode=0, ...
2024-11-05 22:53:33,670 - SANDBOX - K8S: Starting: Read file from pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:53:33,813 - SANDBOX - K8S: Completed: Read file from pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:55:34,132 - SANDBOX - K8S: Starting: Execute command in pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:55:34,402 - SANDBOX - K8S: Completed: Execute command in pod. {"result": "ExecResult(success=True, returncode=0, ...
2024-11-05 22:55:34,402 - SANDBOX - K8S: Starting: Read file from pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:55:34,549 - SANDBOX - K8S: Completed: Read file from pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 22:56:15,437 - SANDBOX - K8S: Starting: Execute command in pod. {"pod": "agent-env-krcyegzg-default-0", ...
2024-11-05 23:39:53,808 - ERROR - K8S: Error during: Execute command in pod. {"cause": "(0)\nReason: Handshake status 500 Internal Server Error -+-+- {'audit-id': '89c3bf06-0c1c-4a49-804a-8756578da43d', 'cache-control': 'no-cache, private', 'content-type': 'application/json', 'date': 'Tue, 05 Nov 2024 23:39:52 GMT', 'content-length': '200'} -+-+- b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"error dialing backend: read tcp 192.168.105.37:40968-\\\\u003e192.168.160.39:10250: read: connection timed out\",\"code\":500}\\n'\n", "pod": "agent-env-krcyegzg-default-0", ...
2024-11-05T22:50:10Z   Warning   agent-env-krcyegzg-default-0                  FailedScheduling        0/46 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 2 node(s) had untolerated taint {aisi.gov.uk/dev: true}, 2 node(s) had untolerated taint {aisi.gov.uk/devpods: true}, 2 node(s) had untolerated taint {node.kubernetes.io/unreachable: }, 38 Insufficient memory. preemption: 0/46 nodes are available: 38 No preemption victims found for incoming pod, 8 Preemption is not helpful for scheduling.
2024-11-05T22:50:20Z   Normal    agent-env-krcyegzg-default-0                  Scheduled               Successfully assigned agent/agent-env-krcyegzg-default-0 to ip-192-168-160-39.eu-west-2.compute.internal
2024-11-05T22:50:21Z   Normal    agent-env-krcyegzg-default-0                  Created                 Created container resolve-coredns-ip
2024-11-05T22:50:21Z   Normal    agent-env-krcyegzg-default-0                  Pulled                  Container image "toolbelt/dig:2024-09-23" already present on machine
2024-11-05T22:50:21Z   Normal    agent-env-krcyegzg-default-0                  Started                 Started container resolve-coredns-ip
2024-11-05T22:50:22Z   Normal    agent-env-krcyegzg-default-0                  Created                 Created container default
2024-11-05T22:50:22Z   Normal    agent-env-krcyegzg-default-0                  Pulled                  Container image "redacted" already present on machine
2024-11-05T22:50:22Z   Normal    agent-env-krcyegzg-default-0                  Started                 Started container default
2024-11-05T22:56:46Z   Warning   agent-env-krcyegzg-default-0                  NodeNotReady            Node is not ready
2024-11-05T23:01:51Z   Normal    agent-env-krcyegzg-default                    SuccessfulCreate        create Pod agent-env-krcyegzg-default-0 in StatefulSet agent-env-krcyegzg-default successful
2024-11-05T23:01:51Z   Normal    agent-env-krcyegzg-default-0                  TaintManagerEviction    Marking for deletion Pod agent/agent-env-krcyegzg-default-0
2024-11-05T23:01:51Z   Normal    agent-env-krcyegzg-default-0                  TaintManagerEviction    Cancelling deletion of Pod agent/agent-env-krcyegzg-default-0
2024-11-05T23:02:22Z   Warning   agent-env-krcyegzg-default-0                  FailedScheduling        0/46 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 2 node(s) had untolerated taint {aisi.gov.uk/dev: true}, 2 node(s) had untolerated taint {aisi.gov.uk/devpods: true}, 36 Insufficient memory, 4 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/46 nodes are available: 10 Preemption is not helpful for scheduling, 36 No preemption victims found for incoming pod.
2024-11-05T23:02:27Z   Normal    agent-env-krcyegzg-default-0                  Scheduled               Successfully assigned agent/agent-env-krcyegzg-default-0 to ip-192-168-104-100.eu-west-2.compute.internal
2024-11-05T23:02:28Z   Normal    agent-env-krcyegzg-default-0                  Created                 Created container resolve-coredns-ip
2024-11-05T23:02:28Z   Normal    agent-env-krcyegzg-default-0                  Started                 Started container resolve-coredns-ip
2024-11-05T23:02:28Z   Normal    agent-env-krcyegzg-default-0                  Pulled                  Container image "toolbelt/dig:2024-09-23" already present on machine
2024-11-05T23:02:29Z   Normal    agent-env-krcyegzg-default-0                  Pulled                  Container image "redacted" already present on machine
2024-11-05T23:02:30Z   Normal    agent-env-krcyegzg-default-0                  Started                 Started container default
2024-11-05T23:02:30Z   Normal    agent-env-krcyegzg-default-0                  Created                 Created container default
2024-11-05T23:39:55Z   Normal    agent-env-krcyegzg-default-0                  Killing                 Stopping container default

From a large eval set which had max_samples larger than the cluster could accommodate.

@craigwalton-dsit craigwalton-dsit added the 3rd party errors Errors observed from 3rd party code such as websocket or SSL errors label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party errors Errors observed from 3rd party code such as websocket or SSL errors
Projects
None yet
Development

No branches or pull requests

1 participant