-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RSDK-9365] fix flaky test #372
[RSDK-9365] fix flaky test #372
Conversation
micro-rdk/src/common/conn/viam.rs
Outdated
@@ -1357,7 +1357,7 @@ mod tests { | |||
let _task = cloned_exec.spawn(async move { | |||
viam_server.run().await; | |||
}); | |||
let _ = Timer::after(Duration::from_millis(500)).await; | |||
let _ = Timer::after(Duration::from_millis(1000)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a way you can completely eliminate the possibility of a race between the _fake_server_task
and the viam_server.run()
task given what we have now, but you can I think significantly shrink the window. Put some sort of condvar like object into play, and have the app.auth_fn
signal it when it. Then, wait on it before asserting on the state of cloned_ram_storage
. That will at least guarantee that the auth endpoint on run_fake_app_server
was invoked, meaning that the viam server has progressed far enough to have attempted it. You will still need a timeout, but I'll bet a much smaller one than 500 milliseconds will be effective in practice, rather than goign the other way and just doubling the timeout to one second.
A fully deterministic solution would probably require adding hooks into ViamServer
to allow tests to wait for it to have progressed to certain will known states, and we don't have that now. We probably should at some point.
micro-rdk/src/common/conn/viam.rs
Outdated
@@ -1355,6 +1363,18 @@ mod tests { | |||
let _fake_server_task = | |||
cloned_exec.spawn(async move { run_fake_app_server(other_clone, app).await }); | |||
let _task = cloned_exec.spawn(async move { | |||
let (lock, cvar) = &*pair; | |||
for _ in 0..3 { | |||
// lock within loop to drop mutex guard between (a)waits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
micro-rdk/src/common/conn/viam.rs
Outdated
@@ -1346,7 +1346,15 @@ mod tests { | |||
|
|||
let mut app = AppServerInsecure::default(); | |||
|
|||
// condvar to serialize fake-server-task auth and viam_server.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use tokio::sync::Notify
, rather than needing to roll a Mutex
and CondVar
pair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like we could, but would need to use async-compat
since this is a futures
executor and not a tokio
one.
https://docs.rs/async-compat/latest/async_compat/
Tokio’s types cannot be used outside tokio context, so any attempt to use them will panic.Solution: If you apply the Compat adapter to a future, the future will manually enter the context of a global tokio runtime. If a runtime is already available via tokio thread-locals, then it will be used. Otherwise, a new single-threaded runtime will be created on demand. That does not mean the future is polled by the tokio runtime - it only means the future sets a thread-local variable pointing to the global tokio runtime so that tokio’s types can be used inside it.
This would look something like
use async_compat::{Compat, CompatExt};
use tokio::sync::Notify;
fn futures_test() {
futures::executor::block_on(Compat::new(async {
notify2.notified().await;
println!("received notification");
}))
}
Still need to see if this would compile, but given the above info, still want me to try implementing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, if it isn't a drop-in simplification, don't bother.
micro-rdk/src/common/conn/viam.rs
Outdated
@@ -1355,6 +1363,18 @@ mod tests { | |||
let _fake_server_task = | |||
cloned_exec.spawn(async move { run_fake_app_server(other_clone, app).await }); | |||
let _task = cloned_exec.spawn(async move { | |||
let (lock, cvar) = &*pair; | |||
for _ in 0..3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limited to eventually fail if there are legitimate errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a concern about using sync in an async context where I am not sure it's doing what we want to do.
Generally if this test fails it's because we couldn't reset the credentials before the timeout if we use a notifier in fn_auth we should make sure to leave enough time for the code clearing the creds to be ran
micro-rdk/src/common/conn/viam.rs
Outdated
.unwrap(); | ||
started = result.0; | ||
if *started { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work the way we expecting to? This code seems to prevent viam_server to be run until the condition is cleared. But the condition can only be cleared if the viam_server makes a request to app_client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears to 🤷♂️ ? tests pass consistently.
I haven't dove into how auth_fn and everything is supposed to work, so maybe this is 'working' but not for the right reasons.
from offline discussion: as the executor is not multithreaded, the condvar isn't doing what we in lieu of a hook/channel/flag to determine the state of a given the 0..10 loop of 50 milliseconds equates to the original timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! just add a comment about the loop to explain what's its doing and why it's written this way
as the executor is not multithreaded, the condvar isn't doing what we think it's doing, and is in fact just blocking the whole executor. in lieu of a hook/channel/flag to determine the state of a given request/response, we are just using the fact that storage credentials are erased as the determining factor that the request was processed correctly. the 0..10 loop of 50 milliseconds equates to the original timeout duration but also gives more opportunities (via the timeout's .await) for the server and fake_server to make progress
afb40ff
to
edb7c2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think we will still want a way to do the sort of waiting that I originally suggested, but for the purposes of stabilizing this test, I'm fine with this approach for now.
the test has occasionally tripped, poisoning the
global_network_lock
(fixed).It's unclear how much time it would be needed to ensure it doesn't trip (or if other manageable factors are involved), but 1sec seems like a still-acceptable upper-bound on a single test. still faster than running the whole CI workflow again.
If other tests with similar timeouts have this issue it might make sense to make the timeout value global for tests.