Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler test failing #1888

Closed
ljstephenson opened this issue May 6, 2022 · 13 comments · May be fixed by #1976
Closed

Scheduler test failing #1888

ljstephenson opened this issue May 6, 2022 · 13 comments · May be fixed by #1976

Comments

@ljstephenson
Copy link
Contributor

Bug Report

One-Line Summary

python -m unittest -v artiq.test.test_scheduler.SchedulerCase.test_pending_priority fails.

Issue Details

Disabling the assertion and printing out the order in which events come in shows that they are only in a slightly different order; I'm not sure if behaviour has changed recently for this to be expected. The order I see is:

{'action': 'setitem', 'path': [], 'key': 0, 'value': {'pipeline': 'main', 'expid': {'log_level': 50, 'file': '\\\\jake.nist.gov\\688\\users\\ljs3\\scratch\\artiq\\artiq\\test\\test_scheduler.py', 'class_name': 'CheckPauseBackgroundExperiment', 'arguments': {}}, 'priority': 1, 'due_date': None, 'flush': False, 'status': 'pending', 'repo_msg': None}}
{'action': 'setitem', 'path': [], 'key': 1, 'value': {'pipeline': 'main', 'expid': {'log_level': 30, 'file': '\\\\jake.nist.gov\\688\\users\\ljs3\\scratch\\artiq\\artiq\\test\\test_scheduler.py', 'class_name': 'EmptyExperiment', 'arguments': {}}, 'priority': 3, 'due_date': 1651976855.5527163, 'flush': False, 'status': 'pending', 'repo_msg': None}}
{'action': 'setitem', 'path': [], 'key': 2, 'value': {'pipeline': 'main', 'expid': {'log_level': 30, 'file': '\\\\jake.nist.gov\\688\\users\\ljs3\\scratch\\artiq\\artiq\\test\\test_scheduler.py', 'class_name': 'EmptyExperiment', 'arguments': {}}, 'priority': 2, 'due_date': 1651876856.5527163, 'flush': False, 'status': 'pending', 'repo_msg': None}}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'preparing'}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'prepare_done'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'preparing'}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'running'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'prepare_done'}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'paused'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'running'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'run_done'}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'running'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'analyzing'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'deleting'}
{'action': 'delitem', 'path': [], 'key': 2}

These two items are reversed compared to the output expected by the test case:

{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'preparing'}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'running'}

Steps to Reproduce

  1. Run python -m unittest -v artiq.test.test_scheduler.SchedulerCase.test_pending_priority

Expected Behavior

Test passes.

Actual (undesired) Behavior

Test fails.

>python -m unittest -v artiq.test.test_scheduler.SchedulerCase.test_pending_priority
test_pending_priority (artiq.test.test_scheduler.SchedulerCase)
Check due dates take precedence over priorities when waiting to ... C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\sipyco\asyncio_tools.py:23: DeprecationWarning: There is no current event loop
  self.task = asyncio.ensure_future(log_exceptions(self._do))
C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\sipyco\asyncio_tools.py:101: DeprecationWarning: There is no current event loop
  self._loop = asyncio.get_event_loop()
Unhandled exception in TaskObject task body
Traceback (most recent call last):
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\sipyco\asyncio_tools.py", line 16, in log_exceptions
    return await awaitable()
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\artiq\master\scheduler.py", line 215, in _do
    run.status = RunStatus.preparing
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\artiq\master\scheduler.py", line 87, in status
    self._notifier[self.rid]["status"] = self._status.name
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\sipyco\sync_struct.py", line 242, in __setitem__
    self.root.publish({"action": ModAction.setitem.value,
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\artiq\test\test_scheduler.py", line 266, in notify
    self.assertEqual(mod, expect[expect_idx])
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\unittest\case.py", line 845, in assertEqual
    assertion_func(first, second, msg=msg)
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\unittest\case.py", line 1144, in assertDictEqual
    self.fail(self._formatMessage(msg, standardMsg))
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\unittest\case.py", line 675, in fail
    raise self.failureException(msg)
AssertionError: {'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'preparing'} != {'path': [0], 'action': 'setitem', 'value': 'running', 'key': 'status'}
- {'action': 'setitem', 'key': 'status', 'path': [2], 'value': 'preparing'}
?                                                 ^             - ^^^^

+ {'action': 'setitem', 'key': 'status', 'path': [0], 'value': 'running'}
?

Your System (omit irrelevant parts)

  • Operating System: W10, conda
  • ARTIQ version: (with recent versions of ARTIQ, run artiq_client --version): ARTIQ v7.8044.c440f9f.beta
  • Version of the gateware and runtime loaded in the core device: (in the output of artiq_coremgmt -D .... log) (not relevant)
  • If using Conda, output of conda list (please submit as a file attachment, as this tends to be long)
    conda-list.txt
  • Hardware involved: N/A
@sbourdeauducq
Copy link
Member

Probably Python 3.10 related, like your other issue. We didn't test anything with 3.10.

@ljstephenson
Copy link
Contributor Author

Just checked - same failure in environments with 3.8 and 3.9
conda-list-39.txt
conda-list-38.txt

@sbourdeauducq
Copy link
Member

Strange - it's passing on the server e.g. https://nixbld.m-labs.hk/build/120442/log
This code has not changed recently.

@dnadlinger
Copy link
Collaborator

Could this just be a timing/event ordering issue? I haven't had a look at this particular test, but some of them do rely on implicitly defining order via sleeps.

@ljstephenson
Copy link
Contributor Author

Could this just be a timing/event ordering issue?

Certainly seems to be, but I don't know how that order could be defined more rigidly. The "issue" seems to be that experiment 2 is taking longer to prepare on my machine than the test server, so experiment 0 is reaching the run stage before 2 is done preparing; but the prepare stage is empty so I don't see how it can be made faster.

Strange

Agreed. I repro'd this on a second W10 machine with python=3.9, package list generated was identical except for zipp. I don't think this is high priority but I do think it's bad juju for a test to pass on some machines and not others...

@dnadlinger
Copy link
Collaborator

dnadlinger commented May 12, 2022

Certainly seems to be, but I don't know how that order could be defined more rigidly.

I haven't looked into what this particular test would require, but for testing Scheduler.idle (which we haven't gotten around to properly redesigning and upstreaming yet), I had introduced some asyncio events to ensure the requisite ordering:

    def _make_status_events(self, rid, previous_notify=None):
        status_events = {
            key: asyncio.Event()
            for key in ("prepare_done", "running", "paused", "run_done",
                        "deleting", "idle")
        }

        def notify(mod):
            if (mod["path"] == [rid] and mod["action"] == "setitem"
                    and mod["key"] == "status"):
                event = status_events.get(mod["value"], None)
                if event:
                    event.set()
            if previous_notify:
                previous_notify(mod)

        return status_events, notify

    def _make_scheduler_for_idle_test(self):
        class Flags:
            def __init__(self):
                self.be_idle = False
                self.take_long_in_idle = False
                self.termination_ok = False

        flags = Flags()

        def get_dataset(name):
            return getattr(flags, name)

        def check_termination(mod):
            self.assertEqual(
                mod, {
                    "action": "setitem",
                    "key": "termination_ok",
                    "value": (False, True),
                    "path": []
                })
            flags.termination_ok = True

        handlers = {
            "get_dataset": get_dataset,
            "update_dataset": check_termination
        }
        scheduler = Scheduler(_RIDCounter(0), handlers, None, None)

        status_events, notify = self._make_status_events(0)
        scheduler.notifier.publish = notify

        scheduler.start()

        expid_idle = _get_expid("IdleExperiment")
        # Suppress the SystemExit backtrace when worker process is killed.
        expid_idle["log_level"] = logging.CRITICAL
        scheduler.submit("main", expid_idle, 0, None, False)

        return scheduler, status_events, flags

    def test_close_with_idle_running(self):
        """Check scheduler exits with experiment still idle"""
        loop = self.loop
        scheduler, status_events, flags = self._make_scheduler_for_idle_test()
        loop.run_until_complete(status_events["running"].wait())
        flags.be_idle = True
        loop.run_until_complete(status_events["idle"].wait())
        loop.run_until_complete(scheduler.stop())

    def test_close_with_prepared_bg(self):
        """Check scheduler exits when there is still a prepare_done experiment"""
        loop = self.loop
        scheduler, status_events, _ = self._make_scheduler_for_idle_test()

        loop.run_until_complete(status_events["running"].wait())

        # Submit lower-priority experiment that is still waiting to be run when
        # the scheduler is terminated.
        expid = _get_expid("EmptyExperiment")
        scheduler.submit("main", expid, -1, None, False)

        loop.run_until_complete(scheduler.stop())

    def test_resume_idle(self):
        """Check scheduler resumes previously idle experiments"""
        loop = self.loop
        scheduler, status_events, flags = self._make_scheduler_for_idle_test()
        loop.run_until_complete(status_events["running"].wait())

        # Make sure we can un-idle by returning False from the idle callback.
        flags.be_idle = True
        loop.run_until_complete(status_events["idle"].wait())

        status_events["running"].clear()
        flags.be_idle = False
        loop.run_until_complete(status_events["running"].wait())

        # Make sure we can un-idle by requesting termination.
        flags.be_idle = True
        loop.run_until_complete(status_events["idle"].wait())

        self.assertFalse(flags.termination_ok)
        scheduler.request_termination(0)
        loop.run_until_complete(status_events["deleting"].wait())
        self.assertTrue(flags.termination_ok)

        loop.run_until_complete(scheduler.stop())

    def test_idle_bg(self):
        """Check scheduler runs lower-priority experiments while idle"""
        loop = self.loop
        scheduler, idle_status_events, flags = self._make_scheduler_for_idle_test()

        bg_status_events, notify = self._make_status_events(
            1, scheduler.notifier.publish)
        scheduler.notifier.publish = notify

        loop.run_until_complete(idle_status_events["running"].wait())

        # Submit experiment with lower priority.
        expid = _get_expid("BackgroundExperiment")
        scheduler.submit("main", expid, -1, None, False)

        check_pause_1 = lambda: loop.run_until_complete(scheduler.check_pause(1))

        loop.run_until_complete(bg_status_events["prepare_done"].wait())

        flags.be_idle = True
        loop.run_until_complete(bg_status_events["running"].wait())
        self.assertFalse(check_pause_1())

        idle_status_events["running"].clear()
        flags.be_idle = False
        loop.run_until_complete(idle_status_events["running"].wait())

        bg_status_events["running"].clear()
        flags.be_idle = True
        self.assertFalse(check_pause_1())
        loop.run_until_complete(bg_status_events["running"].wait())

        idle_status_events["running"].clear()
        flags.be_idle = False
        self.assertTrue(check_pause_1())
        loop.run_until_complete(idle_status_events["running"].wait())

        scheduler.request_termination(0)
        scheduler.request_termination(1)
        loop.run_until_complete(bg_status_events["deleting"].wait())

        loop.run_until_complete(scheduler.stop())

    def test_idle_timeout(self):
        """Check that blocking is_idle_callback times out"""
        loop = self.loop
        scheduler, status_events, flags = self._make_scheduler_for_idle_test()

        # Submit empty experiment with lower priority.
        empty_status_events, notify = self._make_status_events(
            1, scheduler.notifier.publish)
        scheduler.notifier.publish = notify

        expid = _get_expid("EmptyExperiment")
        scheduler.submit("main", expid, -1, None, False)

        # Make the is_idle_callback block; silencing the expected error
        # message in the log output.
        logging.disable(logging.ERROR)
        flags.be_idle = True
        flags.take_long_in_idle = True
        loop.run_until_complete(status_events["deleting"].wait())
        logging.disable(logging.NOTSET)

        # Make sure EmptyExperiment completes normally now.
        loop.run_until_complete(empty_status_events["run_done"].wait())

        loop.run_until_complete(scheduler.stop())

@sbourdeauducq
Copy link
Member

Oh, I see. In this case it is probably simpler to have the test check relevant properties of the schedule instead of comparing it exactly to an expected schedule. Synchronizing the experiments to enforce a fully deterministic schedule would need complicated IPC.

@sbourdeauducq
Copy link
Member

some of them do rely on implicitly defining order via sleeps.

Which ones? Only moninj tests do that (to ensure a scan and event report by the firmware) IIRC.

@dnadlinger
Copy link
Collaborator

Oh, just checked the Git history, and apparently I did write the offending test myself two years ago – how embarassing (in fairness, I was probably just copying test_steps() without thinking about what ordering guarantees the scheduler actually makes).

The above is an example on using get_dataset to do simple synchronisation, but yes, just testing for the expected properties might be cleaner and shorter.

Which ones [rely on sleep]?

It might be that the 0.2 second intervals in which pause/check_pause are called by the test experiments don't actually matter, only some scheduler internals – it's been a while.

@sbourdeauducq
Copy link
Member

Those are just to prevent 100% CPU utilization while it waits for check_pause() to return True.

@ljstephenson
Copy link
Contributor Author

Agree on checking properties; given the current overlap between this test and both test_steps() and test_pause(), it seems like the relevant check is that rid 2 goes through all stages before rid 1 does anything?

@marmeladapk
Copy link
Contributor

I had this test fail, when my laptop cpu was forced to lowest frequency and other processes were running. With proper frequency restored this test passed.

@dnadlinger
Copy link
Collaborator

Should be fixed (without a degradation in the behaviour for this) in 530f67f.

#1976 would still be good to be able to test more complex scenarios more easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants