Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSH tasks end up INTERRUPTED if scan results tarball is too big #270

Open
kdudka opened this issue Jan 21, 2025 · 0 comments
Open

OSH tasks end up INTERRUPTED if scan results tarball is too big #270

kdudka opened this issue Jan 21, 2025 · 0 comments

Comments

@kdudka
Copy link
Contributor

kdudka commented Jan 21, 2025

I am forwarding an issue from Red Hat internal Jira that I was debugging in June 2024 but that I have not resolved yet.

Current Behavior:
If the scan results tarball is too big, the OSH task ends up in the INTERRUPTED state despite it was successful:

  1. The scan results tarball is successfully uploaded and unpacked.
  2. A scan notification about a CLOSED task is sent.
  3. Then for some reason the worker tries to finish the task again, which breaks the state machine in kobo.

Expected Behavior:

The task status remains CLOSED.

Impact Statement:
Scan results are successfully uploaded and unpacked. Only the task status is misleading.

worker.log contains:

2024-06-20 14:49:50 [INFO    ] Task #586227 exited with status 0
2024-06-20 14:49:50 [INFO    ] Task has finished: 586227
2024-06-20 14:49:50 [INFO    ] kill_group: Process (pgrp 1747847) exited
2024-06-20 14:49:53 [INFO    ] Attempting to take task #586418 [MockBuild]
2024-06-20 14:49:53 [INFO    ] Task forked #586418 [MockBuild]: pid=2020326
2024-06-20 14:50:14 [INFO    ] Waking up task #585987 [ErrataDiffBuild].
2024-06-20 14:50:14 [WARNING ] Closing interrupted tasks: [586227]

hub.log contains:

INFO 2024-06-20 16:46:47,047    worker.py:86 finish_task        Finishing task 586227
INFO 2024-06-20 16:46:47,067    results_loader.py:33 __init__   Processing tasks '#586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]' and 'None'
WARNING 2024-06-20 16:46:47,070 path.py:60 get_json_results     json results not found: '[]', task #586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]
DEBUG 2024-06-20 16:46:47,070   results_loader.py:49 unpack_results     Unpacking /var/lib/osh/hub/tasks/0/580000/586227/virt-cdi-uploadproxy-container-v4.12.12-5.tar.xz
DEBUG 2024-06-20 16:46:47,070   csmock_parser.py:57 extract_tarball     Running command ['tar', '-xf', '/var/lib/osh/hub/tasks/0/580000/586227/virt-cdi-uploadproxy-container-v4.12.12-5.tar.xz', '-C', '/var/lib/osh/hub/tasks/0/580000/586227', '--wildcards', '--wildcards-match-slash', '--exclude=*cov', '--exclude=*debug']
[...]
INFO 2024-06-20 16:48:17,470    worker.py:86 finish_task        Finishing task 586227
INFO 2024-06-20 16:48:17,486    results_loader.py:33 __init__   Processing tasks '#586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]' and 'None'
WARNING 2024-06-20 16:48:17,489 path.py:60 get_json_results     json results not found: '[]', task #586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]
DEBUG 2024-06-20 16:48:17,489   results_loader.py:49 unpack_results     Unpacking /var/lib/osh/hub/tasks/0/580000/586227/virt-cdi-uploadproxy-container-v4.12.12-5.tar.xz
DEBUG 2024-06-20 16:48:17,489   csmock_parser.py:57 extract_tarball     Running command ['tar', '-xf', '/var/lib/osh/hub/tasks/0/580000/586227/virt-cdi-uploadproxy-container-v4.12.12-5.tar.xz', '-C', '/var/lib/osh/hub/tasks/0/580000/586227', '--wildcards', '--wildcards-match-slash', '--exclude=*cov', '--exclude=*debug']
[...]
INFO 2024-06-20 16:49:47,876    worker.py:86 finish_task        Finishing task 586227
INFO 2024-06-20 16:49:47,885    results_loader.py:33 __init__   Processing tasks '#586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]' and 'None'
INFO 2024-06-20 16:49:47,887    results_loader.py:46 unpack_results     Results are already unpacked for task #586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]
INFO 2024-06-20 16:49:51,340    worker.py:86 finish_task        Finishing task 586314
INFO 2024-06-20 16:49:51,352    results_loader.py:33 __init__   Processing tasks '#586314 [method: MockBuild, state: OPEN, worker: osh-worker-015.osh-001.prod.iad2.dc.redhat.com]' and 'None'
WARNING 2024-06-20 16:49:51,355 path.py:60 get_json_results     json results not found: '[]', task #586314 [method: MockBuild, state: OPEN, worker: osh-worker-015.osh-001.prod.iad2.dc.redhat.com]
DEBUG 2024-06-20 16:49:51,356   results_loader.py:49 unpack_results     Unpacking /var/lib/osh/hub/tasks/0/580000/586314/ose-clusterresourceoverride-container-v4.12.0-202406191416.p0.ge85aaf7.assembly.stream.el8.tar.xz
DEBUG 2024-06-20 16:49:51,356   csmock_parser.py:57 extract_tarball     Running command ['tar', '-xf', '/var/lib/osh/hub/tasks/0/580000/586314/ose-clusterresourceoverride-container-v4.12.0-202406191416.p0.ge85aaf7.assembly.stream.el8.tar.xz', '-C', '/var/lib/osh/hub/tasks/0/580000/586314', '--wildcards', '--wildcards-match-slash', '--exclude=*cov', '--exclude=*debug']
DEBUG 2024-06-20 16:49:51,478   worker.py:80 email_task_notification    email_task_notification for 586227
[...]
DEBUG 2024-06-20 16:50:16,023   worker.py:80 email_task_notification    email_task_notification for 586227

Cause:
The worker calls finish_task() over XML-RPC each 90 seconds again if the previous call does not finish till then. This is caused by retry_request_decorator:

def retry_request_decorator(transport_class):

I guess the XML-RPC connection times out after 60 seconds, then the decorator sleeps for 30 seconds. The messages from kobo can be seen at the and of stdout.log (rather than in worker.log where I looked for them initially). I think we need to distinguish network-related timeouts from the time actually spent by a remote operation.

It is also a bad idea to repeat an XML-RPC call that has already been (at least partially) handled by the remote service.

As figured out by @lzaoral, this is a regression introduced by #173.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant