Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cirrus-run thinks the build is running long after it's done on Cirrus CI side #8

Open
sio opened this issue Oct 15, 2021 · 1 comment
Labels
help wanted Extra attention is needed

Comments

@sio
Copy link
Owner

sio commented Oct 15, 2021

Less than 1% of cirrus-run invocations hang indefinitely after Cirrus CI has long finished the corresponding build. CIRRUS_TIMEOUT is eventually reached and job failure is reported.

This issue needs further investigation. Is API server reporting incorrect build status sometimes? Is this some kind of cache/CDN issue?

Troubleshooting is difficult because of the rarity of this failure and because most invocations of cirrus-run happen non-interactively (via another CI service, e.g. GitLab).

Observer needs to act quickly upon encountering cirrus-run timeout:

  • Confirm that the build is in fact finished on Cirrus CI side. Link to the build is usually printed to stdout by cirrus-run.
  • Check API response for that particular build status:
    • Optional: Ensure that CIRRUS_API_TOKEN environment variable is provided with a correct value. Without a token only public repos will be viewable, and API rate limits will probably be more strict.
    • Execute make debug/build_status DEBUG_BUILD_ID=5735044040884224 from repo top-level directory (replace the number with your build ID)
  • Report the output here. If the script just keeps repeating the same "EXECUTING" status you can interrupt it with Ctrl+C or keep running to see when/if it fails.
sio added a commit that referenced this issue Oct 15, 2021
@sio sio added the help wanted Extra attention is needed label Oct 15, 2021
sio added a commit that referenced this issue Jan 19, 2022
@sio
Copy link
Owner Author

sio commented Mar 23, 2022

I've caught another snowflake yesterday.

Looks like some kind of caching issue. I was running a lot of concurrent Cirrus CI jobs (30+) and most of them were delayed by community cluster scheduler, so a lot of cirrus-run instances were just sitting there each querying the API every few seconds. It appears that after some amount of repeated queries reply got cached somewhere and was not updated when job finished successfully. Running the same query from a different host (my workstation) produced the correct result immediately.

There is no caching built into cirrus-run, so stale cache must be coming from the API itself or from some middleware in between (CDN?). I'm not sure if/how this is fixable on our side.


  • GitLab CI log: cirrus-run hangs indefinitely. Output verbosity is set to low, unfortunately.
  • Cirrus CI build
  • Debugging script returned correct status immediately:
$ .venv/bin/python debug/build_status.py 6671363116105728
https://cirrus-ci.com/build/6671363116105728
2022-03-22 14:57:52+00:00 (UTC)
{
  "build": {
    "id": "6671363116105728",
    "durationInSeconds": 1193,
    "clockDurationInSeconds": 1264,
    "status": "COMPLETED",
    "buildCreatedTimestamp": 1647958124677,
    "changeTimestamp": 1647958124677
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant