Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read & Write race condition #75

Closed
captainjapeng opened this issue Jun 6, 2022 · 5 comments · Fixed by #76
Closed

Read & Write race condition #75

captainjapeng opened this issue Jun 6, 2022 · 5 comments · Fixed by #76

Comments

@captainjapeng
Copy link
Contributor

Upon further investigation from jaydenseric/graphql-upload#318 there are chances that fs.read might return zero bytes when processing a lot of files. I was able to replicate this issue at 100, 10248, 102416, 102416+1, 102416*2 byte files.

I have pinned point the issue to the line of codes https://github.com/mike-marcacci/fs-capacitor/blob/main/src/index.ts#L61-L70 basically it ends the stream whenever the write stream has already finished even if the read stream has not yet fully read the file as seen on the image.

image

@mike-marcacci
Copy link
Owner

This is a really good catch – I do wish that there was an official way to make this check (both the initial implementation and your fix rely on private/undocumented properties), but I know that at the time of original authorship at least, there wasn't one.

Just to clarify your reproduction of this, were payloads of the listed sizes able to consistently produce this error, or was it still nondeterministic? I'm just thinking in terms of adding tests for this case.

@captainjapeng
Copy link
Contributor Author

Hi @mike-marcacci, the payload size is irrelevant. I've tried it with 100 bytes, 16KiB, 16KiB + 1B, and if I recall, another higher value.

The issue is really non-deterministic so I can't think of a way to create a test for it also. I still don't understand what can cause the fs.read to return with zero bytes without any error my only hunch was that it runs first before write was called and the read callback runs after that.

I was only really able to replicate it during a high concurrency scenario on my local machine, but as discussed on the issue on graphql-upload, it occurs on a low traffic production deployment as well on GKE so probably it can also be caused by a slow disk although our current setup all uses SSD persistent disks.

@captainjapeng
Copy link
Contributor Author

If you'll modify this repo to add to a counter https://github.com/captainjapeng/graphql-upload-zero-buffer like in the image below, you'll see that sometimes error occurs at around less than 1000 index but can occur on the last end of the load test as well. One thing to take note of is that the problem is less reproducible when the node debugger is turned on probably because it adds overhead and makes everything slow.

Another thing is that sometimes data index printed is less than the index with the zero buffer, so can it be related to max number of open files?

data 38
data 39
data 40
buffer length 42 0

image

captainjapeng added a commit to captainjapeng/fs-capacitor that referenced this issue Jun 10, 2022
@mike-marcacci
Copy link
Owner

Ah this is really helpful – thanks for all the additional details. I doubt there's any sort of tests we can add to protect against regressions in future versions of node, but I'm glad to have this documented here.

@haggholm
Copy link

I doubt there's any sort of tests we can add to protect against regressions in future versions of node

This could potentially be a lead:

I encountered what appears to be exactly this bug. The fix is only in the latest versions of fs-capacitor, which are ESM only, and I'm not currently able to migrate the project I'm working on to ESM.

Since I wanted to prove to myself that upgrading to the latest version of fs-capacitor (by copying it into my project, since I can't import it), I had to try to find some way of provoking the issue. Based on your earlier discussion, I tried things like using an SMB-mounted NAS to get a slower filesystem, and so on. It worked, but not deterministically, and (worse yet) with a pretty low success rate ("success" being defined as exposing the bug).

What does seem to work for me is simply concurrency. The below isn't a reusable use case for you, but the gist should be pretty obvious; it references a MultiStream which uses fs-capacitor for internal buffering. It's definitely not perfect—it's still probabilistic—but the below case fails a majority of the time with version 7, and never fails with version 8 of fs-capacitor.

it("...", async () => {
    // This parameter is critical. There's a bug in older versions of fs-capacitor that causes
    // some bytes to be potentially missed; guard against any possible version (or upstream) regression.
    const N = 15;
    const streams = Array.from({ length: N }).map(() => getMultiStream());

    const results = await Promise.all(streams.map((s) => new LazyPipeline(s).toArray()));
    expect(results).toEqual(Array.from({ length: N }).map(() => objects.flat()));
  },
)

By the way, if the above is of any help—and even if it isn't—would you consider backporting the fix for this bug to a version of the library that's compatible with CommonJS? ESM may be the way of a future, but an awful lot of people work on projects stuck in CJS land. Right now, nearly 80% of your users are using a version of your library with a known data corruption bug, which is not maximally user friendly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants