Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[unlimited:waifu2x] Multithreading is possible but not configured properly #34

Open
3 tasks
LoganDark opened this issue May 5, 2023 · 16 comments
Open
3 tasks

Comments

@LoganDark
Copy link

LoganDark commented May 5, 2023

Problem

ONNX runtime supports multithreaded model execution, and it will automatically be enabled.

However, that can only happen when SharedArrayBuffer is available, which requires these HTTP headers to be set:

  • Cross-Origin-Embedder-Policy: require-corp
  • Cross-Origin-Opener-Policy: same-origin

https://unlimited.waifu2x.net does not send these headers, so ONNX runtime cannot use multiple threads. I will perform an experiment to show that this is a mistake.

Experiment

I will add these headers for testing by using a Chrome extension.

image

These headers will make SharedArrayBuffer available, and ONNX runtime will automatically use multiple threads.

Parameters for the experiment

  • Model: swin_unet.art_scan
  • Denoise: 3 (highest)
  • Scale: 1 (1x)
  • Tile size: 256 (console: tile size = 256)
  • TTA level: 0 (disabled)
  • Detect alpha: false (no alpha channel)
  • Size of the image: 42 tiles

Performed using the version of unlimited:waifu2x that is currently live at https://unlimited.waifu2x.net.

Result of the experiment

Chromium

  • 1 main thread that performs the execution (no changes)

    1 thread

    388556.5769042969 ms (approx. 9251.347069149926 ms per tile)

  • 12 worker threads that perform the execution (with headers)

    12 threads

    143964.38818359375 ms (approx. 3427.723528180805 ms per tile)

Using 12 threads divides the time taken by 2.698977030408252, a 2.7x improvement.

Firefox

  • 1 main thread that performs the execution (no changes)

    image

    DNF (slow); 109955ms for 3 tiles; estimated 1539370ms for 42 tiles (approx. 36651ms per tile)
    169983ms (approx. 4047.214285714286ms per tile)

  • 12 worker threads that perform the execution (enabled dom.postMessage.sharedArrayBuffer.bypassCOOP_COEP.insecure.enabled)

    image

    DNF (slow); 147402ms for 18 tiles; estimated 343938ms for 42 tiles (approx. 8189ms per tile)
    58643ms (approx. 1396.261904761905ms per tile)

Using 12 threads divides the time taken by 4.475719461065657 2.898606824343911, a 2.9x improvement, even larger than Chromium.

Implementation steps

  • Instruct the server to send the required HTTP headers
  • Define ort.env.wasm.numThreads = navigator.hardwareConcurrency before initialization, or else it will default to only 4 threads
  • Enjoy the free speedup
@LoganDark
Copy link
Author

It's astonishing how fast unlimited:waifu2x can get in Firefox with 12 threads. Seems Firefox really is the best at WebAssembly JIT.

It's possible to make it even faster by making the models compatible with ONNX runtime's WebGL or WebGPU backends, so that they can be executed on the GPU, just like with CUDA. In fact, the WebGPU backend might already be compatible (but I have not looked into this yet)

Compatibility mostly consists of removing operators that WebGL doesn't support, like ConstantOfShape. Some optimizers can already recognize and remove these. utils/pad.onnx pictured below, official on left, optimized on right:

Screenshot 2023-05-04 212503

But you also have to adjust int64 values so that they fit in 32 bits (utils/alpha_border_padding.onnx pictured below):

image

This can be done manually in a python debugger (as I have successfully done for some models). And not all the actual int64 types have to be converted to int32 (although the int64 casts need to be removed), they just need to fit in an int32.

I have successfully gotten some of the utility models to load in ONNX runtime's WebGL backend, but unfortunately, this isn't very useful because the utility models are mostly precalculations, and the most expensive part is the actual upscaling model, which uses operators that WebGL doesn't support, namely ConstantOfShape, Where, Expand.

I'm also looking into seeing if I can actually add support for ConstantOfShape into the WebGL backend myself, but of course this is not very easy since I cannot build ONNX runtime from source yet. Maybe I will modify the minified JS (hehehe....). My personal version of unlimited:waifu2x is based on a TypeScript translation/rewrite of reverse engineered minified code.

@nagadomi
Copy link
Owner

nagadomi commented May 5, 2023

Thank you for sharing.

Multithreading
Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Opener-Policy: same-origin

I tried this before but did not applied it as it was slower than the original code on chrome.
(I first tried ort.env.wasm.numThreads=4 but it didn't seem to work, so I tried microsoft/onnxruntime#9681 )
I may need to try again.

WebGL

I gave up on using WebGL backend because of the many unsupported functions.
(int32 conversion was possible with a slight modification of https://github.com/aadhithya/onnx-typecast )

WebGPU

I tried it recently but it did not work yet. microsoft/onnxruntime#15796


As of now, I am hoping to get WebGPU backend to work.
So I think that WebGL backend does not have to work.
It would be nice to have WebAssembly backend faster for users who don't have a GPU.

@LoganDark
Copy link
Author

I tried this before but did not applied it as it was slower than the original code on chrome.

This is clearly not true anymore

(int32 conversion was possible with a slight modification of https://github.com/aadhithya/onnx-typecast )

I also tried modifying that script. But int32 conversion is not required, only reducing magnitude of the values. And full conversion causes the model to fail validation anyway, because some operators require int64 attributes.

I tried it recently but it did not work yet. microsoft/onnxruntime#15796

Good to know~

As of now, I am hoping to get WebGPU backend to work.
So I think that WebGL backend does not have to work.

You're right, it doesn't. This issue itself is about WASM multithreading, not WebGL (that was just a slightly related comment).

It would be nice to have WebAssembly backend faster for users who don't have a GPU.

Absolutely

@nagadomi
Copy link
Owner

nagadomi commented May 5, 2023

Also pytorch version (cli/server and training) is running on 16-bit float (half float).
If 16-bit float can be used in some way, it can be faster without degradation. However, when I previously investigated it, it seemed difficult to use it in JavaScript.

@LoganDark
Copy link
Author

LoganDark commented May 5, 2023

If 16-bit float can be used in some way, it can be faster without degradation. However, when I previously investigated it, it seemed difficult to use it in JavaScript.

It should be sufficient to convert the input tensor to float16 and back each time you run the model. You can probably use these converter functions and use Uint16Array tensors as float16. Then use a model that expects float16. I will probably perform my own experiments once my codebase is functional

@nagadomi
Copy link
Owner

nagadomi commented May 5, 2023

OK, I have confirmed that it is faster with multithreading.

// google-chrome
// default (test on unlimited.waifu2x.net)
tile size = 256
script.js:38 render: 38275.5 ms
tile size = 256
script.js:38 render: 38714.466064453125 ms

// numThreads=16 (test on localhost)
tile size = 256
script.js:489 render: 12700.81005859375 ms
tile size = 256
script.js:489 render: 12487.656005859375 ms

// Firefox
// default
tile size = 256 script.js:28:27
render: 35854ms - タイマー終了
tile size = 256 script.js:28:27
render: 35756ms - タイマー終了

// numThreads=16
tile size = 256 script.js:362:17
render: 12039.94ms - タイマー終了
tile size = 256 script.js:362:17
render: 11316.12ms - タイマー終了

I may have made that mistake before, as it gets very slow when DevTools is open.

One thing that is not great is that all javascript files must be hosted locally to enable SharedArrayBuffer.

@LoganDark
Copy link
Author

LoganDark commented May 5, 2023

all javascript files must be hosted locally

You mean vendored (on your server that serves the correct HTTP headers)? You should have been doing that anyway. You should not depend on CDNs for your website's main functionality. You host the models on your server so why not host the runtime to execute them?

@nagadomi
Copy link
Owner

nagadomi commented May 5, 2023

Ahh, I remember. One of the reasons I didn't use it is because it would not work with Google Analytics or Adsense.

@LoganDark
Copy link
Author

Ahh, I remember. One of the reasons I didn't use it is because it would not work with Google Analytics or Adsense.

Why not? Can't you vendor those scripts as well?

@nagadomi
Copy link
Owner

nagadomi commented May 5, 2023

It needs to load scripts from third-party servers.
related to https://stackoverflow.com/questions/68683903/is-there-a-way-to-use-google-adsense-with-cross-origin-isolation

@LoganDark
Copy link
Author

LoganDark commented May 5, 2023

If you are ok with only being compatible with Chrome 96 and higher, setting Cross-Origin-Embedder-Policy: credentialless should work to keep google ads functional.

https://chromestatus.com/feature/4918234241302528

The header works on my chrome and SharedArrayBuffer exists with it.

But this does not enable multithreading in firefox (firefox does not support it).

Also try adding the crossorigin attribute to the script tag, it probably won't work but is worth a try maybe.

nagadomi added a commit that referenced this issue May 5, 2023
@nagadomi
Copy link
Owner

nagadomi commented May 5, 2023

For now, I have not been able to get Adsense to work with cross-origin isolation env.
I registered the website to Chrome Origin Trials (SharedArrayBuffer) and it works on chrome.

@LoganDark
Copy link
Author

Is there any way to get firefox support as well?

jpohhhh added a commit to Telosnex/fonnx that referenced this issue Dec 15, 2023
- Inference is defined as an async method, but, it blocks.
After a couple days of trying all avenues and looking at sample
apps, it looks like it is synchronous in that it will consume the
attention of the thread the `await session.run` is called on.
- Using Squadron to handle multi-threading didn't work. Now that
the JS function in index.html is loading the model and passing it
to a worker, it's possible it might.
- In any case, this shows exactly how to set up a worker that
A) does inference without blocking UI rendering
B) allows Dart code to `await` the result without blocking UI
- This process was frustrating and fraught, there's a surprising
lack of info and examples around ONNX web. Most seem to consume
it via diffusers.js/transformers.js. ONNX web was a separate library from the
rest of the ONNX runtime until sometime around late 2022. The examples
still use that library, and the examples use simple enough models that it's
hard to catch whether they are blocking without falling back to dev tools.
- Its absolutely crucial when debugging speed locally to make sure you're loading
the ONNX version you expect (i.e. wasm AND threaded AND simd). The easiest
way to check is network loads in Dev Tools, sort by size, and look for the .wasm
file to A) be loaded B) include wasm, simd, and threaded in the filename.
- Two things can prevent that:
-- CORS nonsense with Flutter serving itself in debug mode:
--- see here, nagadomi/nunif#34
--- note that the extension became adware, you should have Chrome set up its
permissions such that it isn't run until you click it. Also, note that you have to do
that each time the Flutter web app in debug mode's port changes.
-- MIME type issues
--- Even after that, I would see errors in console logs about the MIME type of the
.wasm being incorrect and starting with the wrong bytes. That, again, seems due to
local Flutter serving of the web app. To work around that, you can download the
WASM files from the same CDN folder that hosts ort.min.js (see worker.js) and
also in worker.js, remove the // in front of ort.env.wasm.wasmPaths = "". That
indicates you've placed the WASM files next to index.html, which you should.
Note you just need the 4 .wasm files, no more, from the CDN.

Some performance review notes:
- `webgpu` as execution provider completely errors out, says "JS executor
not supported in the ONNX version" (1.16.3)
- `webgl` throws "Cannot read properties of null (reading 'irVersion')"
- Tested perf by varying wasm / simd / thread and thread count on M2 MacBook Air 16 GB ram, Chrome 120
- Landed on simd & thread count = 1/2 of cores as best performing
-- first # is minilm l6v2, second is minilm l6v3, average inference time for 200 / 400 words
-- 4 threads: 526 ms / 2196 ms
-- simd 4 threads: 86 ms / 214 ms
-- simd 8 threads:  106 ms / 260 ms
-- simd 128 threads: 2879 ms / skipped
-- simd navigator.hardwareConcurrency threads (8): 107 ms / 222 ms
jpohhhh added a commit to Telosnex/fonnx that referenced this issue Dec 15, 2023
- Inference is defined as an async method, but, it blocks.
After a couple days of trying all avenues and looking at sample
apps, it looks like it is synchronous in that it will consume the
attention of the thread the `await session.run` is called on.
- Using Squadron to handle multi-threading didn't work. Now that
the JS function in index.html is loading the model and passing it
to a worker, it's possible it might.
- In any case, this shows exactly how to set up a worker that
A) does inference without blocking UI rendering
B) allows Dart code to `await` the result without blocking UI
- This process was frustrating and fraught, there's a surprising
lack of info and examples around ONNX web. Most seem to consume
it via diffusers.js/transformers.js. ONNX web was a separate library from the
rest of the ONNX runtime until sometime around late 2022. The examples
still use that library, and the examples use simple enough models that it's
hard to catch whether they are blocking without falling back to dev tools.
- Its absolutely crucial when debugging speed locally to make sure you're loading
the ONNX version you expect (i.e. wasm AND threaded AND simd). The easiest
way to check is network loads in Dev Tools, sort by size, and look for the .wasm
file to A) be loaded B) include wasm, simd, and threaded in the filename.
- Two things can prevent that:
-- CORS nonsense with Flutter serving itself in debug mode:
--- see here, nagadomi/nunif#34
--- note that the extension became adware, you should have Chrome set up its
permissions such that it isn't run until you click it. Also, note that you have to do
that each time the Flutter web app in debug mode's port changes.
-- MIME type issues
--- Even after that, I would see errors in console logs about the MIME type of the
.wasm being incorrect and starting with the wrong bytes. That, again, seems due to
local Flutter serving of the web app. To work around that, you can download the
WASM files from the same CDN folder that hosts ort.min.js (see worker.js) and
also in worker.js, remove the // in front of ort.env.wasm.wasmPaths = "". That
indicates you've placed the WASM files next to index.html, which you should.
Note you just need the 4 .wasm files, no more, from the CDN.

Some performance review notes:
- `webgpu` as execution provider completely errors out, says "JS executor
not supported in the ONNX version" (1.16.3)
- `webgl` throws "Cannot read properties of null (reading 'irVersion')"
- Tested perf by varying wasm / simd / thread and thread count on M2 MacBook Air 16 GB ram, Chrome 120
- Landed on simd & thread count = 1/2 of cores as best performing
-- first # is minilm l6v2, second is minilm l6v3, average inference time for 200 / 400 words
-- 4 threads: 526 ms / 2196 ms
-- simd 4 threads: 86 ms / 214 ms
-- simd 8 threads:  106 ms / 260 ms
-- simd 128 threads: 2879 ms / skipped
-- simd navigator.hardwareConcurrency threads (8): 107 ms / 222 ms
@IceTank
Copy link

IceTank commented Jan 20, 2025

Any update or process on this? Has multithreading already been implemented?

@LoganDark
Copy link
Author

LoganDark commented Jan 20, 2025

@IceTank if you use a browser extension to add the http headers like I described in the issue then yes multithreading is possible to use today, alternatively you can use that trick with my fork (repo here)

@IceTank
Copy link

IceTank commented Jan 20, 2025

alternatively you can use that trick with my fork (repo here)

Yes, this is already a lot faster than the original version. Thanks for sharing, and thanks for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants