Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSDK-9365] fix flaky test #372

Conversation

mattjperez
Copy link
Member

@mattjperez mattjperez commented Jan 7, 2025

the test has occasionally tripped, poisoning the global_network_lock (fixed).
It's unclear how much time it would be needed to ensure it doesn't trip (or if other manageable factors are involved), but 1sec seems like a still-acceptable upper-bound on a single test. still faster than running the whole CI workflow again.

If other tests with similar timeouts have this issue it might make sense to make the timeout value global for tests.

@mattjperez mattjperez requested a review from a team as a code owner January 7, 2025 16:41
@@ -1357,7 +1357,7 @@ mod tests {
let _task = cloned_exec.spawn(async move {
viam_server.run().await;
});
let _ = Timer::after(Duration::from_millis(500)).await;
let _ = Timer::after(Duration::from_millis(1000)).await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a way you can completely eliminate the possibility of a race between the _fake_server_task and the viam_server.run() task given what we have now, but you can I think significantly shrink the window. Put some sort of condvar like object into play, and have the app.auth_fn signal it when it. Then, wait on it before asserting on the state of cloned_ram_storage. That will at least guarantee that the auth endpoint on run_fake_app_server was invoked, meaning that the viam server has progressed far enough to have attempted it. You will still need a timeout, but I'll bet a much smaller one than 500 milliseconds will be effective in practice, rather than goign the other way and just doubling the timeout to one second.

A fully deterministic solution would probably require adding hooks into ViamServer to allow tests to wait for it to have progressed to certain will known states, and we don't have that now. We probably should at some point.

@@ -1355,6 +1363,18 @@ mod tests {
let _fake_server_task =
cloned_exec.spawn(async move { run_fake_app_server(other_clone, app).await });
let _task = cloned_exec.spawn(async move {
let (lock, cvar) = &*pair;
for _ in 0..3 {
// lock within loop to drop mutex guard between (a)waits
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1346,7 +1346,15 @@ mod tests {

let mut app = AppServerInsecure::default();

// condvar to serialize fake-server-task auth and viam_server.run()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use tokio::sync::Notify, rather than needing to roll a Mutex and CondVar pair?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like we could, but would need to use async-compat since this is a futures executor and not a tokio one.

https://docs.rs/async-compat/latest/async_compat/
Tokio’s types cannot be used outside tokio context, so any attempt to use them will panic.

Solution: If you apply the Compat adapter to a future, the future will manually enter the context of a global tokio runtime. If a runtime is already available via tokio thread-locals, then it will be used. Otherwise, a new single-threaded runtime will be created on demand. That does not mean the future is polled by the tokio runtime - it only means the future sets a thread-local variable pointing to the global tokio runtime so that tokio’s types can be used inside it.

This would look something like

use async_compat::{Compat, CompatExt};
use tokio::sync::Notify;

fn futures_test() {
    futures::executor::block_on(Compat::new(async {
        notify2.notified().await;
        println!("received notification");
    }))
}

Still need to see if this would compile, but given the above info, still want me to try implementing it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if it isn't a drop-in simplification, don't bother.

@mattjperez mattjperez requested a review from acmorrow January 21, 2025 14:19
@@ -1355,6 +1363,18 @@ mod tests {
let _fake_server_task =
cloned_exec.spawn(async move { run_fake_app_server(other_clone, app).await });
let _task = cloned_exec.spawn(async move {
let (lock, cvar) = &*pair;
for _ in 0..3 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limited to eventually fail if there are legitimate errors

@mattjperez mattjperez changed the title [RSDK-9365] increase timeout on flaky test [RSDK-9365] fix flaky test Jan 21, 2025
Copy link
Member

@npmenard npmenard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a concern about using sync in an async context where I am not sure it's doing what we want to do.
Generally if this test fails it's because we couldn't reset the credentials before the timeout if we use a notifier in fn_auth we should make sure to leave enough time for the code clearing the creds to be ran

.unwrap();
started = result.0;
if *started {
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work the way we expecting to? This code seems to prevent viam_server to be run until the condition is cleared. But the condition can only be cleared if the viam_server makes a request to app_client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears to 🤷‍♂️ ? tests pass consistently.
I haven't dove into how auth_fn and everything is supposed to work, so maybe this is 'working' but not for the right reasons.

@mattjperez
Copy link
Member Author

from offline discussion:

as the executor is not multithreaded, the condvar isn't doing what we
think it's doing, and is in fact just blocking the whole executor.

in lieu of a hook/channel/flag to determine the state of a given
request/response, we are just using the fact that storage credentials
are erased as the flag that the request was processed
correctly.

the 0..10 loop of 50 milliseconds equates to the original timeout
duration but also gives more opportunities (via the timeout's .await)
for the server and fake_server to make progress

@mattjperez mattjperez requested a review from npmenard January 22, 2025 19:22
Copy link
Member

@npmenard npmenard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! just add a comment about the loop to explain what's its doing and why it's written this way

as the executor is not multithreaded, the condvar isn't doing what we
think it's doing, and is in fact just blocking the whole executor.

in lieu of a hook/channel/flag to determine the state of a given
request/response, we are just using the fact that storage credentials
are erased as the determining factor that the request was processed
correctly.

the 0..10 loop of 50 milliseconds equates to the original timeout
duration but also gives more opportunities (via the timeout's .await)
for the server and fake_server to make progress
@mattjperez mattjperez force-pushed the rsdk-9365-flaky-test-app-permission-denied branch from afb40ff to edb7c2d Compare January 23, 2025 15:20
Copy link
Member

@acmorrow acmorrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think we will still want a way to do the sort of waiting that I originally suggested, but for the purposes of stabilizing this test, I'm fine with this approach for now.

@mattjperez mattjperez merged commit 7efaeb8 into viamrobotics:main Jan 23, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants