-
Notifications
You must be signed in to change notification settings - Fork 27
store: Only close self-pipe when we're done #672
Conversation
Seems to fix it for me on my single-cpu reproducer, although it looks like the CI test failed because now it's getting stuck during pull? |
This took a crazy long time to debug but after lots of false starts I think this is right. Basically what's going on is we have async tasks that are talking over a `pipe()` inside our own process. We must not close the read side of the pipe until the writer is done. I believe this is dependent on tokio task scheduling order, and it's way easier to reproduce when pinned to a single CPU. Closes: ostreedev#657 Signed-off-by: Colin Walters <[email protected]>
374fa43
to
3b2c098
Compare
Yeah that's concerning. I have no working theory for why it'd work when run from A hang is definitely worse than a flake here and symptomatic of a larger problem. |
@jeckersb do you want to play with this more and see if you can reproduce with bootc and if so get the stack trace etc? |
Yeah I'll try with bootc. I let it run all morning with just ostree-ext and didn't have any problems fwiw. |
First stab with gdb, helpful but it also gets confused by the re-exec dance.
|
Also the progress gets stuck, it hangs for maybe 5 seconds, and then some threads exit (skopeo?):
|
Actually it's just other tokio worker threads. I C-c'd it after it hung but before they terminated to get their stacks too:
|
Relevant bits with
|
Meanwhile
|
So if I'm following correctly, I think we've lost the handle on whatever is supposed to be driving things with skopeo? The skopeo process is still there, and we're still waiting in the blocking task for the skopeo process to exit, but we don't have any stack frames on the bootc side to drive the next API action with skopeo? |
Yeah I don't think this is right in practice...or at least I'm not confident in it enough. Will comment back in the issue |
This took a crazy long time to debug but after lots of false starts I think this is right. Basically what's going on is we have async tasks that are talking over a
pipe()
inside our own process.We must not close the read side of the pipe until the writer is done.
I believe this is dependent on tokio task scheduling order, and it's way easier to reproduce when pinned to a single CPU.
Closes: #657