Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTRL+C in new asyncio daemon while running a process does not kill it #4649

Open
giovannipizzi opened this issue Jan 9, 2021 · 23 comments
Open

Comments

@giovannipizzi
Copy link
Member

This is for instance an output I get, when running a CalcJob in ipython and then pressing CTRL+C multiple times:

Running a PwCalculation...
^C01/08/2021 07:38:04 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:05 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:05 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:05 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:06 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:06 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:06 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:06 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:07 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:07 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:07 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:07 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:07 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:07 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:07 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:07 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:13 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:13 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:14 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:14 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:14 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:14 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:14 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:14 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:14 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:14 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420
^C01/08/2021 07:38:14 PM <67614> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 104420

To kill it I had to do a kill -9 instead.

In the previous implementation (v1.5.x and previous) this would kill the process and I could get back control e.g. if I was in ipython.

@sphuber @muhrin @unkcpz @chrisjsewell

@unkcpz
Copy link
Member

unkcpz commented Jan 13, 2021

I am not sure I understand it correctly, it confused me a lot. I try to explain what I understand. @muhrin could you have a look and give some comments?

It is not a real 'hang up'. In fact, in this case, the program is waiting for the end of the current state(Waiting), since process.kill is designed to not interrupt execution of a method call in the Running or Waiting state. The following example explains this more clearly:

from aiida.engine import run, calcfunction

@calcfunction
def dummy():
    import time
    time.sleep(60)

run(dummy)

verdi run the script and ctrl+c to stop it. Same as Giovanni's example, in asyncio version the program 'hang' there I have to wait until about 60s(the time that the process finished) or have to kill -9 to stop it. But in origin aiida_core version, ctrl+c stop it immediately.
Somehow, I can understand the behaviour of hang up there and get control back after a while. It is the process run in the loop in a block way for time.sleep(60) until it finishes. The kill method of process only set the _interrupt_action and _interrupt_action will be run in next step after Running state.
My question here is why in old tornado version aiida_core, it can kill the program immediately.

BTW I think the code touched in this issue also relevant to #4298, could be taken care of at the same time.

@giovannipizzi
Copy link
Member Author

Hi @unkcpz I totally see your point. But from an user experience point of view, I think what I really want is that also the time.sleep(60) is killed, quickly (and that the state of the calcfunction node is correctly updated: goes in kill state, is sealed etc.)

I don't know why the behaviour was different before (pinging also @sphuber) but I find it more desirable (if I press CTRL+C, I want to stop as soon as possible, say within 3-5 seconds, but also setting up correctly the state of the calcjob, something that does not happen with kill -9.

@chrisjsewell chrisjsewell added this to the v1.6.0 milestone Jan 13, 2021
@muhrin
Copy link
Contributor

muhrin commented Jan 13, 2021

I'm sorry, I'm not quite sure how all this works. I know Sebastiaan did some signal handler magic to try and clean things up a little more cleanly but I don't remember many of the details.

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 26, 2021

Just want to link this to some potentially related issues and comments: #4603 (comment) and #3776 (comment)

In particular, this certainly looks to relate to the fact that, even after a calcjob (run not submitted) has finished in an IPython shell (i.e. the cell finishes execution), if you CTRL-C in a later executed cell, you still get this log message!

In [1]: from aiida_sleep.cli import run_calc                                                                                              
In [2]: run_calc(payload=int(1e6), output=int(1e6))     
In [5]: import time; time.sleep(60)                                                                                                                  
^C01/26/2021 03:42:51 PM <243> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 5

I don't think the calcjobs are ever actually finalising properly

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 26, 2021

Also, I don't know if this is an issue with my calcjob, I start getting a bunch of error messages when I try to directly CTRL-C (the computer for this calcjob is a slurm server):

In [6]: run_calc(time=60)                                                                                                                            
^C01/26/2021 04:54:35 PM <426> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 67
01/26/2021 04:54:38 PM <426> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt

01/26/2021 04:54:38 PM <426> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 1 of do_upload excepted, retrying after 20 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt
01/26/2021 04:54:38 PM <426> aiida.scheduler.slurm: [ERROR] Error in _parse_kill_output: retval=1; stdout=; stderr=scancel: error: Invalid job id None

01/26/2021 04:54:38 PM <426> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 317, in do_kill
    return execmanager.kill_calculation(node, transport)
  File "/root/aiida-core/aiida/engine/daemon/execmanager.py", line 426, in kill_calculation
    running_jobs = scheduler.get_jobs(jobs=[job_id], as_dict=True)
  File "/root/aiida-core/aiida/schedulers/scheduler.py", line 341, in get_jobs
    retval, stdout, stderr = self.transport.exec_command_wait(self._get_joblist_command(jobs=jobs, user=user))
  File "/root/aiida-core/aiida/schedulers/plugins/slurm.py", line 235, in _get_joblist_command
    command.append(f"--jobs={','.join(joblist)}")
TypeError: sequence item 0: expected str instance, NoneType found

01/26/2021 04:54:38 PM <426> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 1 of do_kill excepted, retrying after 20 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 317, in do_kill
    return execmanager.kill_calculation(node, transport)
  File "/root/aiida-core/aiida/engine/daemon/execmanager.py", line 426, in kill_calculation
    running_jobs = scheduler.get_jobs(jobs=[job_id], as_dict=True)
  File "/root/aiida-core/aiida/schedulers/scheduler.py", line 341, in get_jobs
    retval, stdout, stderr = self.transport.exec_command_wait(self._get_joblist_command(jobs=jobs, user=user))
  File "/root/aiida-core/aiida/schedulers/plugins/slurm.py", line 235, in _get_joblist_command
    command.append(f"--jobs={','.join(joblist)}")
TypeError: sequence item 0: expected str instance, NoneType found
01/26/2021 04:54:58 PM <426> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt

01/26/2021 04:54:58 PM <426> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 2 of do_upload excepted, retrying after 40 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt
01/26/2021 04:55:04 PM <426> aiida.scheduler.slurm: [ERROR] Error in _parse_kill_output: retval=1; stdout=; stderr=scancel: error: Invalid job id None

01/26/2021 04:55:04 PM <426> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 317, in do_kill
    return execmanager.kill_calculation(node, transport)
  File "/root/aiida-core/aiida/engine/daemon/execmanager.py", line 426, in kill_calculation
    running_jobs = scheduler.get_jobs(jobs=[job_id], as_dict=True)
  File "/root/aiida-core/aiida/schedulers/scheduler.py", line 341, in get_jobs
    retval, stdout, stderr = self.transport.exec_command_wait(self._get_joblist_command(jobs=jobs, user=user))
  File "/root/aiida-core/aiida/schedulers/plugins/slurm.py", line 235, in _get_joblist_command
    command.append(f"--jobs={','.join(joblist)}")
TypeError: sequence item 0: expected str instance, NoneType found

01/26/2021 04:55:04 PM <426> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 2 of do_kill excepted, retrying after 40 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 317, in do_kill
    return execmanager.kill_calculation(node, transport)
  File "/root/aiida-core/aiida/engine/daemon/execmanager.py", line 426, in kill_calculation
    running_jobs = scheduler.get_jobs(jobs=[job_id], as_dict=True)
  File "/root/aiida-core/aiida/schedulers/scheduler.py", line 341, in get_jobs
    retval, stdout, stderr = self.transport.exec_command_wait(self._get_joblist_command(jobs=jobs, user=user))
  File "/root/aiida-core/aiida/schedulers/plugins/slurm.py", line 235, in _get_joblist_command
    command.append(f"--jobs={','.join(joblist)}")
TypeError: sequence item 0: expected str instance, NoneType found
^C01/26/2021 04:55:06 PM <426> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 67
^C01/26/2021 04:55:07 PM <426> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 67
01/26/2021 04:55:38 PM <426> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt

01/26/2021 04:55:38 PM <426> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 3 of do_upload excepted, retrying after 80 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt
01/26/2021 04:55:49 PM <426> aiida.scheduler.slurm: [ERROR] Error in _parse_kill_output: retval=1; stdout=; stderr=scancel: error: Invalid job id None

01/26/2021 04:55:49 PM <426> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 317, in do_kill
    return execmanager.kill_calculation(node, transport)
  File "/root/aiida-core/aiida/engine/daemon/execmanager.py", line 426, in kill_calculation
    running_jobs = scheduler.get_jobs(jobs=[job_id], as_dict=True)
  File "/root/aiida-core/aiida/schedulers/scheduler.py", line 341, in get_jobs
    retval, stdout, stderr = self.transport.exec_command_wait(self._get_joblist_command(jobs=jobs, user=user))
  File "/root/aiida-core/aiida/schedulers/plugins/slurm.py", line 235, in _get_joblist_command
    command.append(f"--jobs={','.join(joblist)}")
TypeError: sequence item 0: expected str instance, NoneType found

01/26/2021 04:55:49 PM <426> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 3 of do_kill excepted, retrying after 80 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 317, in do_kill
    return execmanager.kill_calculation(node, transport)
  File "/root/aiida-core/aiida/engine/daemon/execmanager.py", line 426, in kill_calculation
    running_jobs = scheduler.get_jobs(jobs=[job_id], as_dict=True)
  File "/root/aiida-core/aiida/schedulers/scheduler.py", line 341, in get_jobs
    retval, stdout, stderr = self.transport.exec_command_wait(self._get_joblist_command(jobs=jobs, user=user))
  File "/root/aiida-core/aiida/schedulers/plugins/slurm.py", line 235, in _get_joblist_command
    command.append(f"--jobs={','.join(joblist)}")
TypeError: sequence item 0: expected str instance, NoneType found
01/26/2021 04:56:58 PM <426> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt

01/26/2021 04:56:58 PM <426> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 4 of do_upload excepted, retrying after 160 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt
01/26/2021 04:57:14 PM <426> aiida.scheduler.slurm: [ERROR] Error in _parse_kill_output: retval=1; stdout=; stderr=scancel: error: Invalid job id None

01/26/2021 04:57:14 PM <426> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 317, in do_kill
    return execmanager.kill_calculation(node, transport)
  File "/root/aiida-core/aiida/engine/daemon/execmanager.py", line 426, in kill_calculation
    running_jobs = scheduler.get_jobs(jobs=[job_id], as_dict=True)
  File "/root/aiida-core/aiida/schedulers/scheduler.py", line 341, in get_jobs
    retval, stdout, stderr = self.transport.exec_command_wait(self._get_joblist_command(jobs=jobs, user=user))
  File "/root/aiida-core/aiida/schedulers/plugins/slurm.py", line 235, in _get_joblist_command
    command.append(f"--jobs={','.join(joblist)}")
TypeError: sequence item 0: expected str instance, NoneType found

01/26/2021 04:57:14 PM <426> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 4 of do_kill excepted, retrying after 160 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 317, in do_kill
    return execmanager.kill_calculation(node, transport)
  File "/root/aiida-core/aiida/engine/daemon/execmanager.py", line 426, in kill_calculation
    running_jobs = scheduler.get_jobs(jobs=[job_id], as_dict=True)
  File "/root/aiida-core/aiida/schedulers/scheduler.py", line 341, in get_jobs
    retval, stdout, stderr = self.transport.exec_command_wait(self._get_joblist_command(jobs=jobs, user=user))
  File "/root/aiida-core/aiida/schedulers/plugins/slurm.py", line 235, in _get_joblist_command
    command.append(f"--jobs={','.join(joblist)}")
TypeError: sequence item 0: expected str instance, NoneType found

This happens if I do CTRL-C quite quickly after the run starts, but actually if I wait a bit then it kills correctly:

In [3]: run_calc(time=60)                                                                                                                            
^C01/26/2021 05:01:51 PM <477> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 70
01/26/2021 05:01:51 PM <477> aiida.engine.processes.calcjobs.tasks: [WARNING] CalcJob<70> killed, it was in the CalcJobState.SUBMITTING state
01/26/2021 05:01:51 PM <477> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 137, in do_submit
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 384, in execute
    await self._launch_task(task_submit_job, node, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt

---------------------------------------------------------------------------
KilledError                               Traceback (most recent call last)
<ipython-input-3-f7a7fd3871de> in <module>
----> 1 run_calc(time=60)

/opt/venv/lib/python3.8/site-packages/aiida_sleep/cli.py in run_calc(code, time, payload, output, fail, submit)
     61         node = submit_func(builder)
     62     else:
---> 63         node = run_get_node(builder).node
     64 
     65     return node

~/aiida-core/aiida/engine/launch.py in run_get_node(process, *args, **inputs)
     56         runner = manager.get_manager().get_runner()
     57 
---> 58     return runner.run_get_node(process, *args, **inputs)
     59 
     60 

~/aiida-core/aiida/engine/runners.py in run_get_node(self, process, *args, **inputs)
    266         :return: tuple of the outputs of the process and the calculation node
    267         """
--> 268         result, node = self._run(process, *args, **inputs)
    269         return ResultAndNode(result, node)
    270 

~/aiida-core/aiida/engine/runners.py in _run(self, process, *args, **inputs)
    242             signal.signal(signal.SIGTERM, kill_process)
    243 
--> 244             process_inited.execute()
    245             return process_inited.outputs, process_inited.node
    246 

/opt/venv/lib/python3.8/site-packages/plumpy/processes.py in func_wrapper(self, *args, **kwargs)
     77         if self._closed:
     78             raise exceptions.ClosedError('Process is closed')
---> 79         return func(self, *args, **kwargs)
     80 
     81     return func_wrapper

/opt/venv/lib/python3.8/site-packages/plumpy/processes.py in execute(self)
   1148             self.loop.run_until_complete(self.step_until_terminated())
   1149 
-> 1150         return self.future().result()
   1151 
   1152     @ensure_not_closed

KilledError: Process was killed because the runner received an interrupt

@chrisjsewell
Copy link
Member

To note, now that #4682 has been merged, the CTRL-C handler will be correctly returned to ipython, once the process has finished execution.
This does may quite solve this problem though, the kill operation will still take some time to stop.

Firtly, it might be good to add to:

def kill_process(_num, _frame):
"""Send the kill signal to the process in the current scope."""
LOGGER.critical('runner received interrupt, killing process %s', process_inited.pid)

something like:

            def kill_process(_num, _frame):
                """Send the kill signal to the process in the current scope."""
                if process_inited._killing:
                    LOGGER.critical('process %s is already being killed', process_inited.pid)
                else:
                    LOGGER.critical('runner received interrupt, killing process %s', process_inited.pid)

so it does not seem like the interrupt did not work.

In terms of the time it takes to kill, it only seems to take a few seconds for me, e.g. if I run:

In [1]: run_calc(code="sleep@local", time=300)  

which is a calculation that calls sleep for 300 seconds, then CTRL-C after about 10 seconds, it only takes a few seconds to kill.

@giovannipizzi, if you only do CTRL-C ones, how long does it actually take to kill?

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 27, 2021

One thing that does seem to be an additional issue, is that the exponentional backoff does not appear to respect the KillInterruption.
If I start running, then CTRL-C immediately:

In [7]: run_calc(code="sleep@local", time=80)                                                                                             
^C01/27/2021 05:20:38 PM <224> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 31
01/27/2021 05:20:40 PM <224> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt

01/27/2021 05:20:40 PM <224> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 1 of do_upload excepted, retrying after 20 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt
01/27/2021 05:20:40 PM <224> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1171, in step
    next_state = await self._run_task(self._state.execute)
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 549, in _run_task
    result = await coro(*args, **kwargs)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt

01/27/2021 05:20:40 PM <224> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 5 of do_upload excepted
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1171, in step
    next_state = await self._run_task(self._state.execute)
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 549, in _run_task
    result = await coro(*args, **kwargs)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt
01/27/2021 05:20:40 PM <224> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [WARNING] maximum attempts 5 of calling do_upload, exceeded
01/27/2021 05:20:40 PM <224> aiida.engine.processes.calcjobs.tasks: [WARNING] uploading CalcJob<28> failed
01/27/2021 05:20:42 PM <224> aiida.scheduler.direct: [ERROR] Error in _parse_kill_output: retval=1; stdout=; stderr=bash: line 0: kill: None: arguments must be process or job IDs

01/27/2021 05:20:42 PM <224> aiida.execmanager: [WARNING] scheduler.kill() failed but job<{None}> no longer seems to be running regardless
---------------------------------------------------------------------------
KilledError                               Traceback (most recent call last)
<ipython-input-7-64a797b43b27> in <module>
----> 1 run_calc(code="sleep@local", time=80)

/opt/venv/lib/python3.8/site-packages/aiida_sleep/cli.py in run_calc(code, time, payload, output, fail, submit)
     61         node = submit_func(builder)
     62     else:
---> 63         node = run_get_node(builder).node
     64 
     65     return node

~/aiida-core/aiida/engine/launch.py in run_get_node(process, *args, **inputs)
     56         runner = manager.get_manager().get_runner()
     57 
---> 58     return runner.run_get_node(process, *args, **inputs)
     59 
     60 

~/aiida-core/aiida/engine/runners.py in run_get_node(self, process, *args, **inputs)
    273         :return: tuple of the outputs of the process and the calculation node
    274         """
--> 275         result, node = self._run(process, *args, **inputs)
    276         return ResultAndNode(result, node)
    277 

~/aiida-core/aiida/engine/runners.py in _run(self, process, *args, **inputs)
    245                 signal.signal(signal.SIGINT, kill_process)
    246                 signal.signal(signal.SIGTERM, kill_process)
--> 247                 process_inited.execute()
    248             finally:
    249                 signal.signal(signal.SIGINT, original_handler_int)

/opt/venv/lib/python3.8/site-packages/plumpy/processes.py in func_wrapper(self, *args, **kwargs)
     77         if self._closed:
     78             raise exceptions.ClosedError('Process is closed')
---> 79         return func(self, *args, **kwargs)
     80 
     81     return func_wrapper

/opt/venv/lib/python3.8/site-packages/plumpy/processes.py in execute(self)
   1148             self.loop.run_until_complete(self.step_until_terminated())
   1149 
-> 1150         return self.future().result()
   1151 
   1152     @ensure_not_closed

KilledError: Process was killed because the runner received an interrupt

In [8]: 01/27/2021 05:21:00 PM <224> aiida.engine.transports: [ERROR] Exception whilst using transport:                                   
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1171, in step
    next_state = await self._run_task(self._state.execute)
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 549, in _run_task
    result = await coro(*args, **kwargs)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt

01/27/2021 05:21:00 PM <224> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 2 of do_upload excepted, retrying after 40 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1171, in step
    next_state = await self._run_task(self._state.execute)
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 549, in _run_task
    result = await coro(*args, **kwargs)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt
01/27/2021 05:21:40 PM <224> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1171, in step
    next_state = await self._run_task(self._state.execute)
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 549, in _run_task
    result = await coro(*args, **kwargs)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt

01/27/2021 05:21:40 PM <224> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 3 of do_upload excepted, retrying after 80 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1171, in step
    next_state = await self._run_task(self._state.execute)
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 549, in _run_task
    result = await coro(*args, **kwargs)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt
01/27/2021 05:23:00 PM <224> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 105, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1171, in step
    next_state = await self._run_task(self._state.execute)
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 549, in _run_task
    result = await coro(*args, **kwargs)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt

01/27/2021 05:23:00 PM <224> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 4 of do_upload excepted, retrying after 160 seconds
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1171, in step
    next_state = await self._run_task(self._state.execute)
  File "/opt/venv/lib/python3.8/site-packages/plumpy/processes.py", line 549, in _run_task
    result = await coro(*args, **kwargs)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 376, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 434, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Process was killed because the runner received an interrupt

@sphuber
Copy link
Contributor

sphuber commented Jan 27, 2021

I think the latter may simply be a problem only with the do_upload task. All or at least most of the others seem to define ignore_exceptions=plumpy.process_states.Interruption when calling the EBM and KillInterruption is a subclass of Interruption. Fix would be to simply add this ignored exception also to do_upload. It already defines two but is missing this one

@chrisjsewell
Copy link
Member

Fix would be to simply add this ignored exception also to do_upload

sounds like a plan 👍

@chrisjsewell
Copy link
Member

The last additional issue (which I alluded to in #4649 (comment)), is that it seems you can kill the process before a Job ID is assigned, and so here:

job_id = calculation.get_job_id()

it returns None, which ends up manifesting with the slurm scheduler as:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/aiida-core/aiida/engine/processes/calcjobs/tasks.py in task_kill_job(node, transport_queue, cancellable)
    320         logger.info(f'scheduled request to kill CalcJob<{node.pk}>')
--> 321         result = await exponential_backoff_retry(do_kill, initial_interval, max_attempts, logger=node.logger)
    322     except plumpy.process_states.Interruption:

~/aiida-core/aiida/engine/utils.py in exponential_backoff_retry(fct, initial_interval, max_attempts, logger, ignore_exceptions)
    187         try:
--> 188             result = await coro()
    189             break  # Finished successfully

~/aiida-core/aiida/engine/processes/calcjobs/tasks.py in do_kill()
    316             transport = await cancellable.with_interrupt(request)
--> 317             return execmanager.kill_calculation(node, transport)
    318 

~/aiida-core/aiida/engine/daemon/execmanager.py in kill_calculation(calculation, transport)
    425         # Failed to kill because the job might have already been completed
--> 426         running_jobs = scheduler.get_jobs(jobs=[job_id], as_dict=True)
    427         job = running_jobs.get(job_id, None)

~/aiida-core/aiida/schedulers/scheduler.py in get_jobs(self, jobs, user, as_dict)
    340         with self.transport:
--> 341             retval, stdout, stderr = self.transport.exec_command_wait(self._get_joblist_command(jobs=jobs, user=user))
    342 

~/aiida-core/aiida/schedulers/plugins/slurm.py in _get_joblist_command(self, jobs, user)
    234 
--> 235             command.append(f"--jobs={','.join(joblist)}")
    236 

TypeError: sequence item 0: expected str instance, NoneType found

@unkcpz
Copy link
Member

unkcpz commented Jan 28, 2021

hi @chrisjsewell I think what you did in #4682 is similar to 3771945 for ProcessFunction.

original_handler = signal.getsignal(kill_signal)
signal.signal(kill_signal, kill_process)

However, since ProcessFunction already reset the signal handler to the original one, ctrl+C not stop the running ProcessFunction and give control back to user in IPython.

As I mentioned #4649 (comment)

from aiida.engine import run, calcfunction

@calcfunction
def dummy():
    import time
    time.sleep(60)

run(dummy)

Calling ctrl+C in old tornado aiida_core give the control back immediately while asyncio version not.

Moreover, what I understand is in asyncio the recommend way of signal handling is by adding signal handler to the loop. The catch in the runner case is the process actually doing synchronous task (like the sleep here) and will block before it's done.

Or am I missing something?

@chrisjsewell
Copy link
Member

chrisjsewell commented Jan 28, 2021

I'm not quite sure what you are asking @unkcpz 😬

The key point for me is that the signal handlers are generally working as expected (you press CTRL-C and it initiates the kill mechanism), however, the process kill mechanism they call is definitely "broken", because it does not kill children (#3776).
So essentially working out how to fix that is the priority, but also keeping in mind this issue when tackling it.

(I also just want to link a comment by @sphuber here, with some additional discussion: #4603 (comment))

@giovannipizzi
Copy link
Member Author

@chrisjsewell

which is a calculation that calls sleep for 300 seconds, then CTRL-C after about 10 seconds, it only takes a few seconds to kill.
@giovannipizzi, if you only do CTRL-C ones, how long does it actually take to kill?

You mean the code version of the original issue at the top, or in the most recent develop after the fixes?
I don't remember/know, but from the logs above it took > 10s (and considering I had to open a new terminal, find the PID and run kill -9, it must have taken >20s).

Let me know if you need me to do more testing (and on which version) of this info is enough.

@unkcpz
Copy link
Member

unkcpz commented Jan 29, 2021

@chrisjsewell Sorry for not clearly express my question with necessary context. My question comes with two aspects:

  1. With aiida-core 1.5.2 you can test by when running following calcfunction no matter in IPython or as a script, ctrl-C stop it immediately and user get CMDLINE control back. However, it is not the behaviour in asyncio aiida-core even after your fix Fix: release signal handlers after run execution #4682 .
  2. When I did the migration to asyncio, I notice the recommended way to handle the signal is using loop.add_signal_handler to gracefully shutdown the tasks in the loop then I did this in daemon ruuner. I just wondering is that also need to do it in runner._run and process_function.run_get_node?

@chrisjsewell
Copy link
Member

chrisjsewell commented Feb 23, 2021

Ok so firstly:

I notice the recommended way to handle the signal is using loop.add_signal_handler

I not sure if this is necessary yet, one issue with it appears that you cannot retrieve the current signal handler and so reinstate them at the end of the function, as we do now (see https://docs.python.org/3/library/asyncio-eventloop.html#unix-signals)


Secondly, to do this right it is actually a bit more complicated.
There are two ways in which this can be used: via the cmdline (e.g. verdi run) and via ipython (e.g. verdi shell).
For IPython, as @giovannipizzi originally requested, we may be ok to halt instantly and let the full kill process complete in the background.
How via the cmdline, if we want the process to finish properly we need to wait for the process to reach a terminal state AND all of its descendants. Currently this does not work since the function terminates as soon as the parent reaches a terminal state and does not wait for the children to complete, e.g.

$ aiida-sleep workchain -nc 2
<CTRL-C>
...
$ verdi process list -ap1
10616  26s ago    SleepWorkChain    ☠ Killed         Process was killed because the runner received an interrupt
10617  25s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: upload
10618  25s ago    SleepCalculation  ⏵ Waiting        Waiting for transport task: upload

Here is a proof of concept, whereby on the first CTRL-C the kill process is started and it waits for all processes to finalise, but on second CTRL-C the function terminates straight away:

    def _run(self, process: TYPE_RUN_PROCESS, *args: Any, **inputs: Any) -> Tuple[Dict[str, Any], ProcessNode]:
        """
        Run the process with the supplied inputs in this runner that will block until the process is completed.
        The return value will be the results of the completed process

        :param process: the process class or process function to run
        :param inputs: the inputs to be passed to the process
        :return: tuple of the outputs of the process and the calculation node
        """
        assert not self._closed

        if utils.is_process_function(process):
            result, node = process.run_get_node(*args, **inputs)  # type: ignore[union-attr]
            return result, node

        with utils.loop_scope(self.loop):
            process_inited = self.instantiate_process(process, *args, **inputs)

            run_future = self.loop.create_future()
            def _callback(_future):
                if not run_future.done():
                    run_future.set_result(True)
            process_inited.future().add_done_callback(_callback)

            class _sig_count:
                value = 0

            def kill_process(_num, _frame):
                """Send the kill signal to the process in the current scope."""

                _sig_count.value += 1
                if _sig_count.value > 1:
                    LOGGER.critical('runner received second interrupt, terminating')
                    run_future.set_exception(RuntimeError('kill process terminated'))
                    return
                LOGGER.critical('runner received interrupt, killing process %s', process_inited.pid)
                
                process_inited.future().remove_done_callback(_callback)
                process_inited.kill(msg='Process was killed because the runner received an interrupt')
                
                import gc
                futures = [asyncio.wrap_future(o.future()) for o in gc.get_objects() if hasattr(o, '__class__') and isinstance(o, Process)]
                futures = asyncio.gather(*futures, return_exceptions=True)
                futures.add_done_callback(_callback)

            original_handler_int = signal.getsignal(signal.SIGINT)
            original_handler_term = signal.getsignal(signal.SIGTERM)

            async def _run():
                await run_future

            try:
                signal.signal(signal.SIGINT, kill_process)
                signal.signal(signal.SIGTERM, kill_process)
                process_inited.loop.create_task(process_inited.step_until_terminated())
                self.loop.run_until_complete(_run())
            finally:
                signal.signal(signal.SIGINT, original_handler_int)
                signal.signal(signal.SIGTERM, original_handler_term)

            return process_inited.outputs, process_inited.node

thoughts?

@chrisjsewell
Copy link
Member

chrisjsewell commented Feb 23, 2021

note at present there is an issue that the exceptions are not caught and the processes try to go into an excepted state:

root@ae5122fc408b:~# aiida-sleep workchain -nc 2
setting up and running workchain 1
^C02/23/2021 01:09:36 PM <1474067> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 10634
02/23/2021 01:09:36 PM <1474067> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 110, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 367, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 425, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Killed by parent<10634>

02/23/2021 01:09:36 PM <1474067> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 110, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 367, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 425, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Killed by parent<10634>

Task exception was never retrieved
future: <Task finished name='Task-63' coro=<interruptable_task.<locals>.execute_coroutine() done, defined at /root/aiida-core/aiida/engine/utils.py:117> exception=KillInterruption('Killed by parent<10634>')>
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 367, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 425, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Killed by parent<10634>
Task exception was never retrieved
future: <Task finished name='Task-74' coro=<interruptable_task.<locals>.execute_coroutine() done, defined at /root/aiida-core/aiida/engine/utils.py:117> exception=KillInterruption('Killed by parent<10634>')>
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 367, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 425, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Killed by parent<10634>
uuid: 1c3d292f-3caa-4c93-abfb-71ed6e625df0 (pk: 10634) (aiida.workflows:sleep)
Task exception was never retrieved
future: <Task finished name='Task-158' coro=<ProcessCallback.run() done, defined at /root/venv/lib/python3.8/site-packages/plumpy/events.py:99> exception=RuntimeError('Cannot transition from ProcessState.KILLED to ProcessState.EXCEPTED')>
Traceback (most recent call last):
  File "/root/venv/lib/python3.8/site-packages/plumpy/events.py", line 103, in run
    await self._callback(*self._args, **self._kwargs)
  File "/root/venv/lib/python3.8/site-packages/plumpy/processes.py", line 548, in _run_task
    result = await coro(*args, **kwargs)
  File "/root/venv/lib/python3.8/site-packages/plumpy/utils.py", line 246, in wrap
    return coro_or_fn(*args, **kwargs)
  File "/root/aiida-core/aiida/engine/processes/workchains/workchain.py", line 305, in on_process_finished
    self.resolve_awaitable(awaitable, value)
  File "/root/aiida-core/aiida/engine/processes/workchains/workchain.py", line 170, in resolve_awaitable
    self._update_process_status()
  File "/root/aiida-core/aiida/engine/processes/workchains/workchain.py", line 187, in _update_process_status
    self.node.set_process_status(status)
  File "/root/aiida-core/aiida/orm/nodes/process/process.py", line 210, in set_process_status
    return self.set_attribute(self.PROCESS_STATUS_KEY, status)
  File "/root/aiida-core/aiida/orm/utils/mixins.py", line 176, in set_attribute
    raise exceptions.ModificationNotAllowed('attributes of a sealed node are immutable')
aiida.common.exceptions.ModificationNotAllowed: attributes of a sealed node are immutable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/tasks.py", line 280, in __step
    result = coro.send(None)
  File "/root/venv/lib/python3.8/site-packages/plumpy/events.py", line 106, in run
    self._process.callback_excepted(self._callback, exc_info[1], exc_info[2])
  File "/root/venv/lib/python3.8/site-packages/plumpy/processes.py", line 513, in callback_excepted
    self.fail(exception, trace)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 97, in transition
    result = wrapped(self, *a, **kw)
  File "/root/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1080, in fail
    self.transition_to(process_states.ProcessState.EXCEPTED, exception, trace_back)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 335, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 351, in transition_failed
    raise exception.with_traceback(trace)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 317, in transition_to
    self._exit_current_state(new_state)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 376, in _exit_current_state
    raise RuntimeError('Cannot transition from {} to {}'.format(self._state.LABEL, next_state.label))
RuntimeError: Cannot transition from ProcessState.KILLED to ProcessState.EXCEPTED
Task exception was never retrieved
future: <Task finished name='Task-160' coro=<ProcessCallback.run() done, defined at /root/venv/lib/python3.8/site-packages/plumpy/events.py:99> exception=RuntimeError('Cannot transition from ProcessState.KILLED to ProcessState.EXCEPTED')>
Traceback (most recent call last):
  File "/root/venv/lib/python3.8/site-packages/plumpy/events.py", line 103, in run
    await self._callback(*self._args, **self._kwargs)
  File "/root/venv/lib/python3.8/site-packages/plumpy/processes.py", line 548, in _run_task
    result = await coro(*args, **kwargs)
  File "/root/venv/lib/python3.8/site-packages/plumpy/utils.py", line 246, in wrap
    return coro_or_fn(*args, **kwargs)
  File "/root/aiida-core/aiida/engine/processes/workchains/workchain.py", line 305, in on_process_finished
    self.resolve_awaitable(awaitable, value)
  File "/root/aiida-core/aiida/engine/processes/workchains/workchain.py", line 170, in resolve_awaitable
    self._update_process_status()
  File "/root/aiida-core/aiida/engine/processes/workchains/workchain.py", line 189, in _update_process_status
    self.node.set_process_status(None)
  File "/root/aiida-core/aiida/orm/nodes/process/process.py", line 202, in set_process_status
    self.delete_attribute(self.PROCESS_STATUS_KEY)
  File "/root/aiida-core/aiida/orm/utils/mixins.py", line 193, in delete_attribute
    raise exceptions.ModificationNotAllowed('attributes of a sealed node are immutable')
aiida.common.exceptions.ModificationNotAllowed: attributes of a sealed node are immutable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/tasks.py", line 280, in __step
    result = coro.send(None)
  File "/root/venv/lib/python3.8/site-packages/plumpy/events.py", line 106, in run
    self._process.callback_excepted(self._callback, exc_info[1], exc_info[2])
  File "/root/venv/lib/python3.8/site-packages/plumpy/processes.py", line 513, in callback_excepted
    self.fail(exception, trace)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 97, in transition
    result = wrapped(self, *a, **kw)
  File "/root/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1080, in fail
    self.transition_to(process_states.ProcessState.EXCEPTED, exception, trace_back)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 335, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 351, in transition_failed
    raise exception.with_traceback(trace)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 317, in transition_to
    self._exit_current_state(new_state)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 376, in _exit_current_state
    raise RuntimeError('Cannot transition from {} to {}'.format(self._state.LABEL, next_state.label))
RuntimeError: Cannot transition from ProcessState.KILLED to ProcessState.EXCEPTED

but it does correctly set all processes as killed:

$ verdi process list -ap1
10622  10s ago    SleepWorkChain    ☠ Killed         Process was killed because the runner received an interrupt
10623  10s ago    SleepCalculation  ☠ Killed         Killed by parent<10622>
10624  10s ago    SleepCalculation  ☠ Killed         Killed by parent<10622>

@chrisjsewell
Copy link
Member

chrisjsewell commented Feb 23, 2021

ok if we catch the exceptions, this works "properly" for verdi run; both CTRL-Cx1 and CTRL-Cx2

        with utils.loop_scope(self.loop):
            self.loop.set_exception_handler(lambda loop, context: None)

it also works for verdi shell CTRL-Cx1,
but for CTRL-Cx2 exceptions stop being caught (because we return to the ipython event loop):

In [7]: run_workchain(2, time=10)
^C02/23/2021 03:28:37 PM <1502764> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 10658
02/23/2021 03:28:37 PM <1502764> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 110, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 367, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 425, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Killed by parent<10658>

02/23/2021 03:28:37 PM <1502764> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/root/aiida-core/aiida/engine/transports.py", line 110, in request_transport
    yield transport_request.future
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 78, in do_upload
    transport = await cancellable.with_interrupt(request)
  File "/root/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
    result = await next(wait_iter)
  File "/usr/lib/python3.8/asyncio/tasks.py", line 608, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 367, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 425, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()
plumpy.process_states.KillInterruption: Killed by parent<10658>

^C02/23/2021 03:28:38 PM <1502764> aiida.engine.runners: [CRITICAL] runner received second interrupt, terminating
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-5d65dc24cb6b> in <module>
----> 1 run_workchain(2, time=10)

~/venv/lib/python3.8/site-packages/aiida_sleep/cli.py in run_workchain(number, code, time, payload, output_dict, output_array, fail, submit)
    215         node = submit_func(builder)
    216     else:
--> 217         node = run_get_node(builder).node
    218 
    219     return node

~/aiida-core/aiida/engine/launch.py in run_get_node(process, *args, **inputs)
     56         runner = manager.get_manager().get_runner()
     57 
---> 58     return runner.run_get_node(process, *args, **inputs)
     59 
     60 

~/aiida-core/aiida/engine/runners.py in run_get_node(self, process, *args, **inputs)
    304         :return: tuple of the outputs of the process and the calculation node
    305         """
--> 306         result, node = self._run(process, *args, **inputs)
    307         return ResultAndNode(result, node)
    308 

~/aiida-core/aiida/engine/runners.py in _run(self, process, *args, **inputs)
    276                 signal.signal(signal.SIGTERM, kill_process)
    277                 process_inited.loop.create_task(process_inited.step_until_terminated())
--> 278                 self.loop.run_until_complete(_run())
    279             finally:
    280                 signal.signal(signal.SIGINT, original_handler_int)

~/venv/lib/python3.8/site-packages/nest_asyncio.py in run_until_complete(self, future)
     96                 raise RuntimeError(
     97                     'Event loop stopped before Future completed.')
---> 98             return f.result()
     99         finally:
    100             self._thread_id = old_thread_id

/usr/lib/python3.8/asyncio/futures.py in result(self)
    176         self.__log_traceback = False
    177         if self._exception is not None:
--> 178             raise self._exception
    179         return self._result
    180 

/usr/lib/python3.8/asyncio/tasks.py in __step(***failed resolving arguments***)
    280                 result = coro.send(None)
    281             else:
--> 282                 result = coro.throw(exc)
    283         except StopIteration as exc:
    284             if self._must_cancel:

~/aiida-core/aiida/engine/runners.py in _run()
    270 
    271             async def _run():
--> 272                 await run_future
    273 
    274             try:

/usr/lib/python3.8/asyncio/futures.py in __await__(self)
    258         if not self.done():
    259             self._asyncio_future_blocking = True
--> 260             yield self  # This tells Task to wait for completion.
    261         if not self.done():
    262             raise RuntimeError("await wasn't used with future")

/usr/lib/python3.8/asyncio/tasks.py in __wakeup(self, future)
    347     def __wakeup(self, future):
    348         try:
--> 349             future.result()
    350         except BaseException as exc:
    351             # This may also be a cancellation.

/usr/lib/python3.8/asyncio/futures.py in result(self)
    176         self.__log_traceback = False
    177         if self._exception is not None:
--> 178             raise self._exception
    179         return self._result
    180 

RuntimeError: kill process terminated


Unhandled exception in event loop:
  File "/usr/lib/python3.8/asyncio/tasks.py", line 280, in __step
    result = coro.send(None)
  File "/root/venv/lib/python3.8/site-packages/plumpy/events.py", line 106, in run
    self._process.callback_excepted(self._callback, exc_info[1], exc_info[2])
  File "/root/venv/lib/python3.8/site-packages/plumpy/processes.py", line 513, in callback_excepted
    self.fail(exception, trace)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 97, in transition
    result = wrapped(self, *a, **kw)
  File "/root/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1080, in fail
    self.transition_to(process_states.ProcessState.EXCEPTED, exception, trace_back)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 335, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 351, in transition_failed
    raise exception.with_traceback(trace)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 317, in transition_to
    self._exit_current_state(new_state)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 376, in _exit_current_state
    raise RuntimeError('Cannot transition from {} to {}'.format(self._state.LABEL, next_state.label))

Exception Cannot transition from ProcessState.KILLED to ProcessState.EXCEPTED
Press ENTER to continue...

Unhandled exception in event loop:
  File "/usr/lib/python3.8/asyncio/tasks.py", line 280, in __step
    result = coro.send(None)
  File "/root/venv/lib/python3.8/site-packages/plumpy/events.py", line 106, in run
    self._process.callback_excepted(self._callback, exc_info[1], exc_info[2])
  File "/root/venv/lib/python3.8/site-packages/plumpy/processes.py", line 513, in callback_excepted
    self.fail(exception, trace)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 97, in transition
    result = wrapped(self, *a, **kw)
  File "/root/venv/lib/python3.8/site-packages/plumpy/processes.py", line 1080, in fail
    self.transition_to(process_states.ProcessState.EXCEPTED, exception, trace_back)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 335, in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 351, in transition_failed
    raise exception.with_traceback(trace)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 317, in transition_to
    self._exit_current_state(new_state)
  File "/root/venv/lib/python3.8/site-packages/plumpy/base/state_machine.py", line 376, in _exit_current_state
    raise RuntimeError('Cannot transition from {} to {}'.format(self._state.LABEL, next_state.label))

Exception Cannot transition from ProcessState.KILLED to ProcessState.EXCEPTED
Press ENTER to continue...

Unhandled exception in event loop:
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 367, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 425, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()

Exception Killed by parent<10658>
Press ENTER to continue...

Unhandled exception in event loop:
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 367, in execute
    skip_submit = await self._launch_task(task_upload_job, self.process, transport_queue)
  File "/root/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 425, in _launch_task
    result = await self._task
  File "/usr/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
    future.result()

Exception Killed by parent<10658>
Press ENTER to continue...

@chrisjsewell
Copy link
Member

actually the error above is the same as #3888

@chrisjsewell
Copy link
Member

so I've opened a draft PR #4784 for a more complete way to deal with killing Processes +children,
but also we have just found in #4648 that there are issues with CancelledError specific to python 3.7 (#4648 (comment))
So perhaps this is the source of the original issue that @giovannipizzi, that I could not replicate, whereby CTRL-C was failing to work at all. Will investigate....

@chrisjsewell
Copy link
Member

I tried this (using CTRL-C in verdi shell), and #4666, both using python 3.7 and python 3.8, to see if I could replicate a full on "hang" when killing a process, but still cannot achieve it.

So for now I am going to move both to the v1.6.1 milestone, as not being a "critical issue".
But perhaps @ltalirz / @mbercx you could also try this and see if you get a different outcome?

@chrisjsewell chrisjsewell modified the milestones: v1.6.0, v1.6.1 Mar 9, 2021
@mbercx
Copy link
Member

mbercx commented Mar 10, 2021

Tried in Quantum Mobile v20.11.2a, with Python version 3.7.9, and was not able to reproduce the hanging issue:

$ aiida-common-workflows launch relax -S Al -- quantum_espresso
Running a QuantumEspressoRelaxWorkChain...
03/10/2021 06:42:52 AM <3226715> aiida.orm.nodes.process.workflow.workchain.WorkChainNode: [REPORT] [5132|PwRelaxWorkChain|setup]: No change in volume possible for the provided base input parameters. Meta convergence is turned off.
03/10/2021 06:42:53 AM <3226715> aiida.orm.nodes.process.workflow.workchain.WorkChainNode: [REPORT] [5132|PwRelaxWorkChain|run_relax]: launching PwBaseWorkChain<5135>
/home/max/codes/aiida-quantumespresso/aiida_quantumespresso/utils/pseudopotential.py:35: AiidaDeprecationWarning: this function is deprecated and will be removed in `v4.0.0`.
  warnings.warn('this function is deprecated and will be removed in `v4.0.0`.', AiidaDeprecationWarning)
03/10/2021 06:42:53 AM <3226715> aiida.orm.nodes.process.workflow.workchain.WorkChainNode: [REPORT] [5135|PwBaseWorkChain|run_process]: launching PwCalculation<5140> iteration #1
^C03/10/2021 06:42:56 AM <3226715> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 5131
Traceback (most recent call last):
  File "/home/max/.virtualenvs/aiida/bin/aiida-common-workflows", line 11, in <module>
    load_entry_point('aiida-common-workflows', 'console_scripts', 'aiida-common-workflows')()
  File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/max/codes/aiida-common-workflows/aiida_common_workflows/cli/launch.py", line 120, in cmd_relax
    utils.launch_process(builder, daemon)
  File "/home/max/codes/aiida-common-workflows/aiida_common_workflows/cli/utils.py", line 59, in launch_process
    _, node = launch.run_get_node(process, **inputs)
  File "/home/max/codes/aiida-core/aiida/engine/launch.py", line 58, in run_get_node
    return runner.run_get_node(process, *args, **inputs)
  File "/home/max/codes/aiida-core/aiida/engine/runners.py", line 278, in run_get_node
    result, node = self._run(process, *args, **inputs)
  File "/home/max/codes/aiida-core/aiida/engine/runners.py", line 250, in _run
    process_inited.execute()
  File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/plumpy/processes.py", line 81, in func_wrapper
    return func(self, *args, **kwargs)
  File "/home/max/.virtualenvs/aiida/lib/python3.7/site-packages/plumpy/processes.py", line 1166, in execute
    return self.future().result()
plumpy.exceptions.KilledError: Process was killed because the runner received an interrupt

@chrisjsewell
Copy link
Member

chrisjsewell commented Mar 10, 2021

cheers @mbercx, and for completeness, what do you see after you do this with verdi process list -ap1? (I assume the PwCalculation will not have been killed correctly, i.e. #4784 is required)

@mbercx
Copy link
Member

mbercx commented Mar 10, 2021

Indeed:

$ verdi process list -ap1
  PK  Created    Process label                  Process State    Process status
----  ---------  -----------------------------  ---------------  -----------------------------------------------------------
5131  58m ago    QuantumEspressoRelaxWorkChain  ☠ Killed         Process was killed because the runner received an interrupt
5132  58m ago    PwRelaxWorkChain               ⏵ Waiting        Waiting for child processes: 5135
5135  58m ago    PwBaseWorkChain                ⏵ Waiting        Waiting for child processes: 5140
5136  58m ago    create_kpoints_from_distance   ⏹ Finished [0]
5140  58m ago    PwCalculation                  ⏵ Waiting        Monitoring scheduler: job state QUEUED

Total results: 5

Info: last time an entry changed state: 58m ago (at 06:42:56 on 2021-03-10)

@sphuber sphuber modified the milestones: Preparations for the new repository, v2.0.0 Apr 28, 2021
@chrisjsewell chrisjsewell self-assigned this May 5, 2021
@sphuber sphuber modified the milestones: v2.0.0, Post 2.0 Sep 22, 2021
@sphuber sphuber removed this from the v2.3.0 milestone Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants