[RSDK-9365] fix flaky test #372

mattjperez · 2025-01-07T16:41:40Z

the test has occasionally tripped, poisoning the global_network_lock (fixed).
It's unclear how much time it would be needed to ensure it doesn't trip (or if other manageable factors are involved), but 1sec seems like a still-acceptable upper-bound on a single test. still faster than running the whole CI workflow again.

If other tests with similar timeouts have this issue it might make sense to make the timeout value global for tests.

acmorrow · 2025-01-07T17:05:01Z

micro-rdk/src/common/conn/viam.rs

@@ -1357,7 +1357,7 @@ mod tests {
            let _task = cloned_exec.spawn(async move {
                viam_server.run().await;
            });
-            let _ = Timer::after(Duration::from_millis(500)).await;
+            let _ = Timer::after(Duration::from_millis(1000)).await;


I don't see a way you can completely eliminate the possibility of a race between the _fake_server_task and the viam_server.run() task given what we have now, but you can I think significantly shrink the window. Put some sort of condvar like object into play, and have the app.auth_fn signal it when it. Then, wait on it before asserting on the state of cloned_ram_storage. That will at least guarantee that the auth endpoint on run_fake_app_server was invoked, meaning that the viam server has progressed far enough to have attempted it. You will still need a timeout, but I'll bet a much smaller one than 500 milliseconds will be effective in practice, rather than goign the other way and just doubling the timeout to one second.

A fully deterministic solution would probably require adding hooks into ViamServer to allow tests to wait for it to have progressed to certain will known states, and we don't have that now. We probably should at some point.

mattjperez · 2025-01-21T14:18:50Z

micro-rdk/src/common/conn/viam.rs

@@ -1355,6 +1363,18 @@ mod tests {
            let _fake_server_task =
                cloned_exec.spawn(async move { run_fake_app_server(other_clone, app).await });
            let _task = cloned_exec.spawn(async move {
+                let (lock, cvar) = &*pair;
+                for _ in 0..3 {
+                    // lock within loop to drop mutex guard between (a)waits


refer to On using std::sync::Mutex and tokio::sync::Mutex.

mattjperez · 2025-01-21T14:19:22Z

micro-rdk/src/common/conn/viam.rs

@@ -1346,7 +1346,15 @@ mod tests {

        let mut app = AppServerInsecure::default();

+        // condvar to serialize fake-server-task auth and viam_server.run()


code taken from https://doc.rust-lang.org/std/sync/struct.Condvar.html#method.wait_timeout

Could you use tokio::sync::Notify, rather than needing to roll a Mutex and CondVar pair?

it seems like we could, but would need to use async-compat since this is a futures executor and not a tokio one.

https://docs.rs/async-compat/latest/async_compat/
Tokio’s types cannot be used outside tokio context, so any attempt to use them will panic.

Solution: If you apply the Compat adapter to a future, the future will manually enter the context of a global tokio runtime. If a runtime is already available via tokio thread-locals, then it will be used. Otherwise, a new single-threaded runtime will be created on demand. That does not mean the future is polled by the tokio runtime - it only means the future sets a thread-local variable pointing to the global tokio runtime so that tokio’s types can be used inside it.

This would look something like

use async_compat::{Compat, CompatExt}; use tokio::sync::Notify; fn futures_test() { futures::executor::block_on(Compat::new(async { notify2.notified().await; println!("received notification"); })) }

Still need to see if this would compile, but given the above info, still want me to try implementing it?

No, if it isn't a drop-in simplification, don't bother.

mattjperez · 2025-01-21T14:54:54Z

micro-rdk/src/common/conn/viam.rs

@@ -1355,6 +1363,18 @@ mod tests {
            let _fake_server_task =
                cloned_exec.spawn(async move { run_fake_app_server(other_clone, app).await });
            let _task = cloned_exec.spawn(async move {
+                let (lock, cvar) = &*pair;
+                for _ in 0..3 {


limited to eventually fail if there are legitimate errors

npmenard

Just a concern about using sync in an async context where I am not sure it's doing what we want to do.
Generally if this test fails it's because we couldn't reset the credentials before the timeout if we use a notifier in fn_auth we should make sure to leave enough time for the code clearing the creds to be ran

npmenard · 2025-01-22T14:53:43Z

micro-rdk/src/common/conn/viam.rs

+                        .unwrap();
+                    started = result.0;
+                    if *started {
+                        break;


does this work the way we expecting to? This code seems to prevent viam_server to be run until the condition is cleared. But the condition can only be cleared if the viam_server makes a request to app_client.

it appears to 🤷‍♂️ ? tests pass consistently.
I haven't dove into how auth_fn and everything is supposed to work, so maybe this is 'working' but not for the right reasons.

mattjperez · 2025-01-22T19:22:42Z

from offline discussion:

as the executor is not multithreaded, the condvar isn't doing what we
think it's doing, and is in fact just blocking the whole executor.

in lieu of a hook/channel/flag to determine the state of a given
request/response, we are just using the fact that storage credentials
are erased as the flag that the request was processed
correctly.

the 0..10 loop of 50 milliseconds equates to the original timeout
duration but also gives more opportunities (via the timeout's .await)
for the server and fake_server to make progress

npmenard

LGTM! just add a comment about the loop to explain what's its doing and why it's written this way

as the executor is not multithreaded, the condvar isn't doing what we think it's doing, and is in fact just blocking the whole executor. in lieu of a hook/channel/flag to determine the state of a given request/response, we are just using the fact that storage credentials are erased as the determining factor that the request was processed correctly. the 0..10 loop of 50 milliseconds equates to the original timeout duration but also gives more opportunities (via the timeout's .await) for the server and fake_server to make progress

acmorrow

LGTM. I think we will still want a way to do the sort of waiting that I originally suggested, but for the purposes of stabilizing this test, I'm fine with this approach for now.

mattjperez requested a review from a team as a code owner January 7, 2025 16:41

acmorrow requested changes Jan 7, 2025

View reviewed changes

mattjperez commented Jan 21, 2025

View reviewed changes

mattjperez requested a review from acmorrow January 21, 2025 14:19

mattjperez commented Jan 21, 2025

View reviewed changes

mattjperez changed the title ~~[RSDK-9365] increase timeout on flaky test~~ [RSDK-9365] fix flaky test Jan 21, 2025

npmenard reviewed Jan 22, 2025

View reviewed changes

mattjperez requested a review from npmenard January 22, 2025 19:22

npmenard approved these changes Jan 22, 2025

View reviewed changes

mattjperez added 5 commits January 23, 2025 10:20

increase timeout

6d27401

add condvar suggestion, revert total test duration

e823f09

clarifying comments

67b0e8e

update comment

edb7c2d

mattjperez force-pushed the rsdk-9365-flaky-test-app-permission-denied branch from afb40ff to edb7c2d Compare January 23, 2025 15:20

acmorrow approved these changes Jan 23, 2025

View reviewed changes

mattjperez merged commit 7efaeb8 into viamrobotics:main Jan 23, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RSDK-9365] fix flaky test #372

[RSDK-9365] fix flaky test #372

mattjperez commented Jan 7, 2025 •

edited

Loading

acmorrow Jan 7, 2025

mattjperez Jan 21, 2025

mattjperez Jan 21, 2025

acmorrow Jan 21, 2025

mattjperez Jan 21, 2025

acmorrow Jan 21, 2025

mattjperez Jan 21, 2025

npmenard left a comment

npmenard Jan 22, 2025

mattjperez Jan 22, 2025

mattjperez commented Jan 22, 2025

npmenard left a comment

acmorrow left a comment

		@@ -1346,7 +1346,15 @@ mod tests {

		let mut app = AppServerInsecure::default();

		// condvar to serialize fake-server-task auth and viam_server.run()

[RSDK-9365] fix flaky test #372

[RSDK-9365] fix flaky test #372

Conversation

mattjperez commented Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npmenard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattjperez commented Jan 22, 2025

npmenard left a comment

Choose a reason for hiding this comment

acmorrow left a comment

Choose a reason for hiding this comment

mattjperez commented Jan 7, 2025 •

edited

Loading