Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test flakiness #388

Merged
merged 3 commits into from
Dec 31, 2024
Merged

Fix test flakiness #388

merged 3 commits into from
Dec 31, 2024

Conversation

spacebear21
Copy link
Collaborator

This change seems to reduce the frequency of the lock contention on TCP sockets in tests but doesn't fix it entirely and I'm not yet sure why. Leaving in Draft status while I debug further.

@DanGould
Copy link
Contributor

Have you been able to reproduce the bug in a local environment or are you still relying on CI to produce it?

@spacebear21
Copy link
Collaborator Author

I'm able to reproduce locally but not as frequently as it occurs in CI (possibly just because CI runs 3x as many jobs).

@spacebear21 spacebear21 linked an issue Nov 20, 2024 that may be closed by this pull request
@spacebear21
Copy link
Collaborator Author

Suggestion from @nothingmuch : Make the ohttp relay and payjoin directory port parameters optional, and simply find a free port during initialization if one is not provided.

@spacebear21
Copy link
Collaborator Author

spacebear21 commented Nov 21, 2024

Found an issue that explains why sharing infrastructure across tests isn't working: tokio-rs/tokio#2374. The approach of sharing a global connection pool seems generally discouraged by the tokio maintainers...

TL;DR:

[tokio::test] creates a new runtime for each test, so:

  • the async connection pool of the application is initialized when the first test accesses it
  • the async connection pool is then bound to the runtime of the first test
  • when the first test ends, its runtime is closed and consequently, the pool is also closed
  • then, all other tests fail because the pool is closed.

@spacebear21
Copy link
Collaborator Author

Summarizing the problem:

Each v2 test initializes a OHTTP relay and payjoin directory. These processes run on a random "free" port selected by find_free_port. There is a race condition between finding a free port and actually initializing the relay/directory, resulting in the occasional "Address in use" error when the relay/directory process tries to bind to that port.

I tried the following approaches, none of which fix the issue:

  • Use a Mutex to keep track of which ports have already been reserved. This doesn't work, afaict because it doesn't account for ports used by other processes (e.g. the redis and http clients).
  • Spin up the test infrastructure (ohttp relay + directory) once and share the connections across tests. This doesn't work because tokio tests don't share a runtime (see Fix test flakiness #388 (comment))
  • Don't specify a port and let the directory/ohttp relay select a free port on initialization (as suggested by @nothingmuch ). The issue with this is that the tests need to know the relay & directory ports so that they know where to direct requests.

Other things I considered:

  • Use named unix sockets instead of tcp but this seems quite involved...
  • Use testcontainers to containerize the relay and directory processes, but this would necessitate making custom docker images and I'm not even sure it would work.

I've spent what feels like too many hours on this and am truly stumped... Happy to take another stab if anyone has new suggestions, but until then I'm putting this on the back burner.

@DanGould
Copy link
Contributor

DanGould commented Nov 27, 2024

It can be painful to be stuck on something for a long time. Clearly what you've done so far is valuable and you've put lots of thought into the approach and potential remedies.

I was wondering most why @nothingmuch's comment didn't work and took a stab at it myself (to no avail)

Don't specify a port and let the directory/ohttp relay select a free port on initialization (#388 (comment) by @nothingmuch ). The issue with this is that the tests need to know the relay & directory ports so that they know where to direct requests.

And the problem with this is that it's easy to get a deadlock passing both a handle and a port up from payjoin-directory. I'm sure there's a way to make this work, but my screwing around with it did not readily figure it out. The GET request in the wait_for_service_ready loop seemed never to be able to send(). Easy to understand how you get into a black hole with this one.

In the meantime, might you share any scripts you used to test the tests? So we can pick this up at a time of rest?

@0xBEEFCAF3 just shared today how stoked he was on the clean cut-through interface. That took a long while to get right. This isn't the first time we've run into serious snags and I know we'll get through it.

@spacebear21
Copy link
Collaborator Author

In the meantime, might you share any scripts you used to test the tests?

It pretty much amounts to this bad boy:

export untilfail ()
{
  count=0
  while "$@"; do
    (( count++ ))
    echo "###### RUN COUNT: $count ######"
  done && say "failed after $count runs"
}

I run it like this and wait until the error occurs:

untilfail cargo test --test integration --features=v2,danger-local-https -- --nocapture

Note: on my old machine this fails reliably within ~40 runs - on my new/much faster machine it ~never fails.

@nothingmuch
Copy link
Collaborator

I was wondering most why @nothingmuch's comment didn't work and took a stab at it myself (to no avail)

hacky fix: DanGould/pull/8

opening up some followup issues

@coveralls
Copy link
Collaborator

coveralls commented Dec 2, 2024

Pull Request Test Coverage Report for Build 12563819354

Details

  • 43 of 52 (82.69%) changed or added relevant lines in 1 file are covered.
  • 5 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.01%) to 61.374%

Changes Missing Coverage Covered Lines Changed/Added Lines %
payjoin-directory/src/lib.rs 43 52 82.69%
Files with Coverage Reduction New Missed Lines %
payjoin-directory/src/lib.rs 5 77.81%
Totals Coverage Status
Change from base Build 12538199892: -0.01%
Covered Lines: 2895
Relevant Lines: 4717

💛 - Coveralls

@spacebear21
Copy link
Collaborator Author

spacebear21 commented Dec 2, 2024

Pulled @DanGould and @nothingmuch 's changes into this PR. I ran tests locally with those changes on my old machine for ~1 hour (178 run-throughs) with no failures.

TODOs:

@nothingmuch
Copy link
Collaborator

To do this more cleanly, i guess would suggest introducing some kind of TestServices struct, make init_directory into something that constructs that (combining with relay initialization i suppose), and endow it with informational getters for e.g. the port/host numbers, as well as the owner of the Container/docker CLI structs that need to remain alive?

#421 should probably be addressed as part of such cleanup as well

@spacebear21 spacebear21 force-pushed the fix-test-flakiness branch 4 times, most recently from 7765327 to 4ee1faa Compare December 3, 2024 15:41
@DanGould
Copy link
Contributor

DanGould commented Dec 4, 2024

This must be a priority for test fixtures downstream in payjoin-ffi by binding to #422 so that we can replicate tests in bound implementations

@DanGould DanGould added this to the Enable payjoin-ffi Testing milestone Dec 4, 2024
@DanGould
Copy link
Contributor

This is in a relatively clean state addressing flakiness, however, it can't be merged until payjoin/ohttp-relay#41 is and a new ohttp_relay-0.0.9 version is released. This version depends on the pull request's branch that introduces the same solution as we do on the directory to have it give US a port.

I'm not sure the status of #421 but I'm inclined to merge PRs in piecemeal since our major effort going forward is paying back tech debt. Best if we can target one issue and land it rather than block things that are huge packages of changes.

@DanGould DanGould requested a review from nothingmuch December 30, 2024 22:08
Copy link
Collaborator

@nothingmuch nothingmuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like there's a few new prints from tests, most of which don't print potentially very useful information anymore.

some of the pre-existing ones (that i did not mark with a comment) could/should also be cleaned up, either removed or replaced with tracing calls.

if you agree with tracing in general feel free to merge and open an issue to address this separately, but please first remove the ones that don't seem to have long term value from this PR, otherwise the noise can discourage people from using the logs when debugging tests

payjoin-cli/tests/e2e.rs Outdated Show resolved Hide resolved
payjoin/tests/integration.rs Outdated Show resolved Hide resolved
payjoin-cli/tests/e2e.rs Outdated Show resolved Hide resolved
payjoin/tests/integration.rs Outdated Show resolved Hide resolved
payjoin/tests/integration.rs Outdated Show resolved Hide resolved
payjoin/tests/integration.rs Outdated Show resolved Hide resolved
@DanGould DanGould force-pushed the fix-test-flakiness branch 3 times, most recently from febfae7 to 36f80fc Compare December 31, 2024 03:39
@DanGould
Copy link
Contributor

I realize now when I requested your review I only pushed my changes to my own branch and not the one this PR is based off of. DOH! I still would have left the print statements in without your comment however, so that catch was good. We definitely DO want tracing.

I take it as a very good sign that 4cffe9c passed all tests without flaking :D

@DanGould
Copy link
Contributor

If it passes the smell test, I suppose we could merge with the git ref and update to ohttp-relay 0.0.9 once that is released without issue. It's not actually a hard block

@DanGould DanGould marked this pull request as ready for review December 31, 2024 03:48
@DanGould DanGould removed the blocked label Dec 31, 2024
@DanGould DanGould requested a review from nothingmuch December 31, 2024 03:49
ensure that the db: testcontainer::Container variable does not go out of
scope while the directory is running.

previously the directory task itself was awaited on by init_directory,
whereas in the modified code it is instead returned as part of the result
due to the different return value of listen_tcp_with_tls_on_free_port.
this indirection de-coupled the db variable's lifetime from that of the
directory, allowing it to go out of scope earlier than expected.

Co-authored-by: Yuval Kogman <[email protected]>
Co-authored-by: spacebear <[email protected]>
Previously, the ohttp_relay::listen_tcp, similar to the directory task,
attempted to listen on a given free port, which occasionally caused
test crashes due to port contention. The new version of ohttp_relay
follows the same pattern as payjoin-directory so that we can test
reliably.
payjoin/Cargo.toml Show resolved Hide resolved
@DanGould DanGould merged commit d940ed2 into payjoin:master Dec 31, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Fix flaky integration tests: "Ohttp relay is long running"
4 participants