[js/webgpu] Optimize ConvTranspose (Continue) #23429

qjia7 · 2025-01-20T06:49:08Z

BUG #23273

This PR does below optimizations:

When output channels is one, 1) calculate the offset before the inchannel loop to reduce indices to offsets calculation, 2) split the inputChannelsPerGroup into inputChannelsPerGroupInt and inputChannelsRemainder parts so that we can always access 4 data for inputChannelsPerGroupInt.
Use precise initial value to reduce useless loop iterations. Thanks @jiangzhaoming 's suggestion's on this.

With this PR, ConvTranspose becomes 3.7s from 8.4s on Intel Meteor Lake.
On NV RTX 2000 Ada, it becomes 1.6s from 2.7s.

This needs to add padding or recalculate the correct offset to get the right result. fix errors

qjia7 · 2025-01-20T06:53:48Z

@guschmue @fs-eire Please take a look, thanks.

guschmue · 2025-01-22T00:40:49Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

guschmue · 2025-01-22T00:41:00Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

azure-pipelines · 2025-01-22T00:41:03Z

Azure Pipelines successfully started running 1 pipeline(s).

guschmue · 2025-01-22T00:41:09Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

azure-pipelines · 2025-01-22T00:41:11Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

guschmue · 2025-01-22T00:41:22Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI

azure-pipelines · 2025-01-22T00:41:22Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2025-01-22T00:41:30Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

guschmue

lgtm.

@jiangzhaoming

BUG #23273 This PR does below optimizations: 1. When output channels is one, 1) calculate the offset before the inchannel loop to reduce indices to offsets calculation, 2) split the `inputChannelsPerGroup` into `inputChannelsPerGroupInt` and `inputChannelsRemainder` parts so that we can always access 4 data for `inputChannelsPerGroupInt`. 2. Use precise initial value to reduce useless loop iterations. Thanks @jiangzhaoming 's suggestion's on this. With this PR, ConvTranspose becomes 3.7s from 8.4s on Intel Meteor Lake. On NV RTX 2000 Ada, it becomes 1.6s from 2.7s.

### Description This PR is to update the win-ort-main branch to the tip main branch as of 2025-01-23. ### PR List ddf0d37 [QNN EP] Add LoggingManager::HasDefaultLogger() to provider bridge API (#23467) 05fbbdf [QNN EP] Make QNN EP a shared library (#23120) 1336566 Add custom vcpkg ports (#23456) 2e1173c Update the compile flags for vcpkg packages (#23455) 1f628a9 [Mobile] Add BrowserStack Android MAUI Test (#23383) 009cae0 [js/webgpu] Optimize ConvTranspose (Continue) (#23429) 04a4a69 Use onnx_protobuf.h to suppress some GCC warnings (#23453) 2e3b62b Suppress some strict-aliasing related warnings in WebGPU EP (#23454) b708f9b Bump ruff from 0.9.1 to 0.9.2 (#23427) c0afc66 [WebNN] Remove workarounds for TFLite backend (#23406) 8a821ff Bump vite from 6.0.7 to 6.0.11 in /js/web/test/e2e/exports/testcases/vite-default (#23446) 220c1a2 Make ORT and Dawn use the same protobuf/abseil source code (#23447) b7b5792 Change MacOS-13 to ubuntu on for android-java-api-aar-test.yml. (#23444) 19d0d2a WIP: Dp4MatMulNBits accuracy level 4 matmul for WebGPU EP (#23365) 95b8eff [QNN EP]: Clean up QNN logging resources if an error occurs during initialization (#23435) 626134c Bump clang-format from 19.1.6 to 19.1.7 (#23428) 0cf9753 Fix eigen external deps (#23439) f9440ae Moving RN_CI Android Testing to Linux (#23422) 1aa5902 [QNN EP] workaround for QNN validation bug for Tanh with uint16 quantized output (#23432) 7f5582a Seperate RN andriod and IOS into 2 separated Stages. (#23400) 73deac2 Implement some missing element wise Add/Sub/Mul/Div/Neg operations for CPU and CUDA EPs (#23090) 949fe42 Upgrade Java version from react-native/android to Java 17 (#23066) 0892c23 Update Qnn SDK default version to 2.30 (#23411) 94c099b Fix type cast build error (#23423) d633e57 [WebNN EP] Fix AddInitializersToSkip issues (#23354) e988ef0 [QNN EP] Fix regression for MatMul with two quantized/dynamic uint16 inputs (#23419) 7538795 Update onnxruntime binary size checks ci pipeline's docker image (#23405) 6c5ea41 Revert "[QNN EP] Clean up correctly from a partial setup (#23320)" (#23420) e866804 Enable comprehension simplification in ruff rules (#23414) 0a5f1f3 bugfix: string_view of invalid memory (#23417) 4cc38e0 fix crash when first input of BatchNormalization is 1-D (#23387) 0334414 Target py310 and modernize codebase with ruff (#23401) 87341ac [QNN EP] Fix segfault when unregistering HTP shared memory handles (#23402) ### Motivation and Context This update includes the change to make QNN-EP a shared library. --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: Justin Chu <[email protected]> Co-authored-by: Yulong Wang <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Changming Sun <[email protected]> Co-authored-by: Peishen Yan <[email protected]> Co-authored-by: Tianlei Wu <[email protected]> Co-authored-by: Hector Li <[email protected]> Co-authored-by: Jian Chen <[email protected]> Co-authored-by: Alexis Tsogias <[email protected]> Co-authored-by: junchao-zhao <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: sushraja-msft <[email protected]> Co-authored-by: Wanming Lin <[email protected]> Co-authored-by: Jiajia Qin <[email protected]> Co-authored-by: Caroline Zhu <[email protected]>

qjia7 added 3 commits January 20, 2025 12:48

always use aComponents = 4

2bb7320

This needs to add padding or recalculate the correct offset to get the right result. fix errors

add tests

fba6d55

use precise initialze value

6fc52f7

guschmue added the ep:WebGPU ort-web webgpu provider label Jan 22, 2025

guschmue approved these changes Jan 22, 2025

View reviewed changes

guschmue merged commit 25f4274 into microsoft:main Jan 22, 2025
51 checks passed

ashrit-ms mentioned this pull request Jan 23, 2025

Update win-ort-main to tip main 250123 #23473

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[js/webgpu] Optimize ConvTranspose (Continue) #23429

[js/webgpu] Optimize ConvTranspose (Continue) #23429

qjia7 commented Jan 20, 2025

qjia7 commented Jan 20, 2025

guschmue commented Jan 22, 2025

guschmue commented Jan 22, 2025

azure-pipelines bot commented Jan 22, 2025

guschmue commented Jan 22, 2025

azure-pipelines bot commented Jan 22, 2025

guschmue commented Jan 22, 2025

azure-pipelines bot commented Jan 22, 2025

azure-pipelines bot commented Jan 22, 2025

guschmue left a comment

[js/webgpu] Optimize ConvTranspose (Continue) #23429

[js/webgpu] Optimize ConvTranspose (Continue) #23429

Conversation

qjia7 commented Jan 20, 2025

qjia7 commented Jan 20, 2025

guschmue commented Jan 22, 2025

guschmue commented Jan 22, 2025

azure-pipelines bot commented Jan 22, 2025

guschmue commented Jan 22, 2025

azure-pipelines bot commented Jan 22, 2025

guschmue commented Jan 22, 2025

azure-pipelines bot commented Jan 22, 2025

azure-pipelines bot commented Jan 22, 2025

guschmue left a comment

Choose a reason for hiding this comment