Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qsync times out a lot when tests are running sequentially #6965

Open
Gerold103 opened this issue Mar 28, 2022 · 2 comments · May be fixed by tarantool/test-run#449
Open

Qsync times out a lot when tests are running sequentially #6965

Gerold103 opened this issue Mar 28, 2022 · 2 comments · May be fixed by tarantool/test-run#449
Assignees
Labels

Comments

@Gerold103
Copy link
Collaborator

When I do this: python test-run.py qsync -j -1, I get some qsync timeout errors. When I run just python test-run.py qsync, all is fine. That is quite strange and deserves an investigation. The failing tests are:

replication/gh-5163-qsync-restart-crash
Test failed! Result content mismatch:
--- replication/gh-5163-qsync-restart-crash.result	Fri Feb 11 00:24:42 2022
+++ var/rejects/replication/gh-5163-qsync-restart-crash.reject	Tue Mar 29 00:24:49 2022
@@ -22,13 +22,13 @@
 
 box.space.sync:replace{1}
  | ---
- | - [1]
+ | - error: Quorum collection for a synchronous transaction is timed out
  | ...
 test_run:cmd('restart server default')
  | 
 box.space.sync:select{}
  | ---
- | - - [1]
+ | - []
  | ...
 box.space.sync:drop()
replication/gh-5288-qsync-recovery
Test failed! Result content mismatch:
--- replication/gh-5288-qsync-recovery.result	Fri Feb 11 00:24:42 2022
+++ var/rejects/replication/gh-5288-qsync-recovery.reject	Tue Mar 29 00:25:01 2022
@@ -17,7 +17,7 @@
  | ...
 s:insert{1}
  | ---
- | - [1]
+ | - error: Quorum collection for a synchronous transaction is timed out
  | ...
 box.snapshot()
replication/gh-5298-qsync-recovery-snap
Test failed! Result content mismatch:
--- replication/gh-5298-qsync-recovery-snap.result	Fri Feb 11 00:24:42 2022
+++ var/rejects/replication/gh-5298-qsync-recovery-snap.reject	Tue Mar 29 00:25:17 2022
@@ -22,6 +22,7 @@
  | ...
 for i = 1, 10 do box.space.sync:replace{i} end
  | ---
+ | - error: Quorum collection for a synchronous transaction is timed out
  | ...
 
 -- Local rows could affect this by increasing the signature.
@@ -46,7 +47,7 @@
 -- Could hang if the limbo would incorrectly handle the snapshot end.
 box.space.sync:replace{11}
  | ---
- | - [11]
+ | - error: Quorum collection for a synchronous transaction is timed out
  | ...
 
 old_synchro_quorum = box.cfg.replication_synchro_quorum
@@ -79,7 +80,6 @@
  | ...
 box.space.sync:get({11})
  | ---
- | - [11]
  | ...
 box.space.sync:get({12})
replication/gh-5446-qsync-eval-quorum (almost no output)
Test failed! Result content mismatch:
--- replication/gh-5446-qsync-eval-quorum.result	Fri Feb 11 00:24:42 2022
+++ var/rejects/replication/gh-5446-qsync-eval-quorum.reject	Tue Mar 29 00:27:19 2022
@@ -94,258 +94,3 @@
 
 -- Only one master node -> 1/2 + 1 = 1
 s:insert{1} -- should pass
- | ---
- | - [1]
- | ...

and nothing below this line. All red.

Then I gave up, next tests hang.

@Gerold103
Copy link
Collaborator Author

I could track down the problem I think - _cluster space is cleared. Here is a minimal reproducer: python test-run.py -j -1 gh-5140 gh-5163 --conf memtx. The test gh-5163 runs second and sees 2 rows in _cluster. That affects automatic synchro quorum calculation. It should see only one, this test doesn't start new instances.

Apparently, it is a leftover from gh-5140 after which test-run somewhy couldn't clear _cluster. I see some relevant code in tarantool/test-run/lib/tarantool-python/test/suites/lib/tarantool_python_ci.lua, but I don't know if it is called.

The solution is either get auto-clear fixed or find all the tests having the problem and stop using automatic synchro quorum in them - set it explicitly to 1, 2, etc.

@Totktonada
Copy link
Member

test-run restarts the default server each time before run a test. All non-default server should be stopped at end of the test (even if the test itself don't do that). Data files (xlog, snap, vylog) should be cleaned up between test runs.

IOW, if one test really can leave something in a space and other test sees it, it is the bug in test-run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants