Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loads of issues resulting from use of VkFFT #129

Open
3 of 5 tasks
inducer opened this issue Aug 5, 2022 · 40 comments
Open
3 of 5 tasks

Loads of issues resulting from use of VkFFT #129

inducer opened this issue Aug 5, 2022 · 40 comments

Comments

@inducer
Copy link
Owner

inducer commented Aug 5, 2022

Issues showed up after #114:

cc @isuruf

@inducer
Copy link
Owner Author

inducer commented Aug 5, 2022

Pytential's Mac CI on main is failing now, too: https://github.com/inducer/pytential/runs/7694158164?check_suite_focus=true#step:3:1057

A number of @alexfikl's PRs had picked up similar failures, and this establishes that the issue is not specific to them.

I conjecture that this is connected to the #114 merge. To help back up that conjecture, here's a run without that merge: inducer/pytential#175. If there are no failures there, then that is further evidence that #114 is causing problems.

The CI runs don't necessarily back up the slowness claim.

At least the overall CI times are relatively similar. @alexfikl, could you provide a way to reproduce your Stokes run that got substantially slower?

The missing barrier is something I encountered in work with @rckirby. I still have to work out what's happening, I will report back here.

@inducer
Copy link
Owner Author

inducer commented Aug 5, 2022

The first run of inducer/pytential#175 passed. I've just started another run as an additional data point.

@inducer
Copy link
Owner Author

inducer commented Aug 5, 2022

Second run passed as well.

@alexfikl
Copy link
Contributor

alexfikl commented Aug 6, 2022

Just to add more information here from my side.

First, for the slowdown, I noticed that on the 3D Stokes operator from inducer/pytential#29 (slightly modified and merged with main). That uses the Laplace formulation from
https://github.com/isuruf/pytential/blob/feec35e928767dd6add0cba55883cd15bed93445/pytential/symbolic/stokes.py#L578-L639
I'll try to reproduce something and come back with some numbers.

Then, for the warnings on sumpy, they were

/mnt/data/code/inducer/loopy/loopy/kernel/creation.py:1909: LoopyWarning: in kernel m2l_generate_translation_classes_dependent_data: The single-writer dependency heuristic added dependencies on instruction ID(s) 'set_d' to instruction ID 'm2l__insn_2' after kernel creation is complete. This is deprecated and may stop working in the future. To fix this, ensure that instruction dependencies are added/resolved as soon as possible, ideally at kernel creation time. (add 'single_writer_after_creation' to silenced_warnings kernel attribute to disable)

with different m2l__insn_XX in there. From what I can tell, that comes from here somewhere

sumpy/sumpy/e2e.py

Lines 560 to 565 in 4ace8ea

[idata]: m2l_translation_classes_dependent_data[
itr_class, idata] = \
m2l_data(
src_rscale,
[idim]: d[idim],
) {id=update,dep=set_d}

but the dependencies seem correctly declared at least.

@alexfikl
Copy link
Contributor

alexfikl commented Aug 6, 2022

For a reproducer of my slowness claim, this should work
https://gist.github.com/alexfikl/d16750b2407ccbc064151976c5473c53
It just evaluates the 3D Stokeslet using the Laplace kernel (Tornberg).

EDIT: Just started running in pre-114 and it seems to be equally slow. I'll post the numbers once it finishes 😢

EDIT2: I take that back, something definitely seems off. Pre-114 it was giving

elapsed: 2094.80s wall 0.91x CPU
elapsed: 1917.34s wall 1.01x CPU

so about 30min to compile and evaluate the Stokeslet. But now it's over an hour. I'm running this on dunkel with everything installed through conda.

EDIT3: Ok, something is very fishy, just ran that script on main and

elapsed: 23082.66s wall 1.00x CPU

which is somewhere around 6h! Let me know if you get a change to take a look or trying it out, maybe there's something silly in there :\

@isuruf
Copy link
Collaborator

isuruf commented Aug 6, 2022

Warnings about write races are false positives. See inducer/loopy#564

@isuruf
Copy link
Collaborator

isuruf commented Aug 6, 2022

macos failure is a segfault in pocl.

@inducer
Copy link
Owner Author

inducer commented Aug 6, 2022

Are you able to get a backtrace? Is it in the pocl runtime or in a kernel?

@isuruf
Copy link
Collaborator

isuruf commented Aug 6, 2022

(lldb) bt
* thread #4, stop reason = EXC_BAD_ACCESS (code=2, address=0x70000ff87240)
  * frame #0: 0x000000014f69227e VkFFT_main.so`_pocl_kernel_VkFFT_main_workgroup + 7886
    frame #1: 0x0000000136ce7c47 libpocl-devices-pthread.so`work_group_scheduler + 727
    frame #2: 0x0000000136ce723e libpocl-devices-pthread.so`pocl_pthread_driver_thread + 334
    frame #3: 0x00007fff5137c661 libsystem_pthread.dylib`_pthread_body + 340
    frame #4: 0x00007fff5137c50d libsystem_pthread.dylib`_pthread_start + 377
    frame #5: 0x00007fff5137bbf9 libsystem_pthread.dylib`thread_start + 13

@inducer
Copy link
Owner Author

inducer commented Aug 6, 2022

So does pocl miscompile vkfft?

@inducer
Copy link
Owner Author

inducer commented Aug 7, 2022

Is there an easy way to reproduce this? Will it reproduce on the CEESD M1? Should we start thinking about reverting #114 for now?

@inducer
Copy link
Owner Author

inducer commented Aug 7, 2022

Or maybe turn off vkfft on Mac?

@isuruf
Copy link
Collaborator

isuruf commented Aug 7, 2022

I can reproduce on appletini.

@isuruf
Copy link
Collaborator

isuruf commented Aug 7, 2022

Running the test with oclgrind passes and shows no out of bounds accesses.

@isuruf
Copy link
Collaborator

isuruf commented Aug 7, 2022

VkFFT_main.so in the backtrace is the OpenCL kernel compiled to a binary by pocl.

@isuruf
Copy link
Collaborator

isuruf commented Aug 7, 2022

There's a warning just before the segfault due to a bad access, but don't think they are related.

POCL: in fn clEnqueueNDRangeKernel at line 282:
  |   WARNING |  readonly buffer used as kernel arg, but arg type is not const

@isuruf
Copy link
Collaborator

isuruf commented Aug 7, 2022

Using an unoptimized loopy kernel instead of vkfft makes the test pass. See #130

@inducer
Copy link
Owner Author

inducer commented Aug 7, 2022

Thanks! Does this reproduce when calling pyvkfft on its own? I am asking with an eye towards potentially reporting this upstream to pocl

@isuruf
Copy link
Collaborator

isuruf commented Aug 7, 2022

Yes, running the example https://github.com/vincefn/pyvkfft/blob/master/examples/opencl-test.py segfaults.

@isuruf
Copy link
Collaborator

isuruf commented Aug 7, 2022

Running the example as a script doesn't reproduce the issue, but opening up a jupyter terminal and running the code does. (No idea why)

isuruf added a commit to isuruf/sumpy that referenced this issue Aug 8, 2022
@inducer
Copy link
Owner Author

inducer commented Aug 9, 2022

@hirish99 reported another VkFFT issue occurring on a Mac: https://gist.github.com/hirish99/16d2888092595283b0f698bf5d8106c0

I do not know whether this is Apple silicon or not.

@inducer inducer changed the title Funny stuff happening since FFT/vkfft merge Loads of issues resulting from use of VkFFT Aug 9, 2022
@hirish99
Copy link
Contributor

hirish99 commented Aug 9, 2022

When I try to run helmholtz-dirichlet.py on my Mac (I don't know how much of this matters) running MacOS Catalina Version 10.15.7, Processor: 2.6 GHz 6-Core Intel Core i7, Memory: 16 GB 2400 MHz DDR4, Graphics: Radeon Pro 560X 4 GB
Intel UHD Graphics 630 1536 MB. Gist: https://gist.github.com/hirish99/16d2888092595283b0f698bf5d8106c0

@inducer
Copy link
Owner Author

inducer commented Aug 9, 2022

Quite possibly this is another possible symptom of the presumed miscompilation described in pocl/pocl#1084.

@isuruf
Copy link
Collaborator

isuruf commented Aug 9, 2022

@hirish99
Copy link
Contributor

hirish99 commented Aug 9, 2022

@isuruf
Copy link
Collaborator

isuruf commented Aug 9, 2022

Thanks. What do you get when you run ./a.out; echo $?

@hirish99
Copy link
Contributor

hirish99 commented Aug 9, 2022

I have not modified sumpy yet btw, I updated the gist to show the output of ./a.out; echo. $

@isuruf
Copy link
Collaborator

isuruf commented Aug 9, 2022

Thanks. Can you try the updated program at https://gist.github.com/isuruf/17f6b210cf4cf8c8b103c18e155e00d6?

@hirish99
Copy link
Contributor

hirish99 commented Aug 9, 2022

inducer pushed a commit that referenced this issue Aug 9, 2022
* use loopy fft

* fix inverse

* Implement broadcasting FFT

* use enum for fft backend

* Add gh-129 link

* unit test for loopy_fft

* Unit test for loopy_fft and fix warnings

* don't use vkfft only if x86 mac

* Add missing import

* Fix platform.machine()
@inducer
Copy link
Owner Author

inducer commented Aug 9, 2022

@hirish99 #130 just landed. Could you check whether this resolves the issue you ran into?

@inducer
Copy link
Owner Author

inducer commented Aug 10, 2022

Pytential picked up a failure as well on Gitlab CI: https://gitlab.tiker.net/inducer/sumpy/-/jobs/443459

Edit: That seems to be intermittent. https://gitlab.tiker.net/inducer/sumpy/-/pipelines/324267

@inducer
Copy link
Owner Author

inducer commented Aug 17, 2022

@alexfikl #132 avoids at least one instance of lengthy checks during Loopy "scheduling" for me. How does it do on the slowness you report?

@alexfikl
Copy link
Contributor

alexfikl commented Aug 17, 2022

@alexfikl #132 avoids at least one instance of lengthy checks during Loopy "scheduling" for me. How does it do on the slowness you report?

Can confirm that that works around it nicely for that Stokeslet case! Just ran #132 compared to fd355eb (before pyvkfft) with the script from #129 (comment) and got

fd355eb: elapsed: 3351.24s wall 1.04x CPU
132:	 elapsed: 728.44s wall 1.02x CPU

So that seems to be even faster than before.

@inducer
Copy link
Owner Author

inducer commented Aug 17, 2022

#133 could help further.

FWIW, #132 and #133 are workarounds more than solutions. Instead, we should fix the code generation to be in the style of #131.

@alexfikl
Copy link
Contributor

#133 could help further.

Yep, ran the same benchmark script and got

133:	elapsed: 427.10s wall 0.98x CPU

@inducer
Copy link
Owner Author

inducer commented Aug 18, 2022

Between #132 and #131, I think we've more or less dealt with the slowness, I think. I've checked off that issue up top.

@inducer
Copy link
Owner Author

inducer commented Aug 18, 2022

That leaves the intermittent VkFFT errors. I'm still a bit lost there. @isuruf mentioned he also can't seem to reproduce them.

@isuruf
Copy link
Collaborator

isuruf commented Aug 18, 2022

macOS errors are gone because we don't use VkFFT for macOS.

As to the VkFFT OpenCL compile failure in https://gitlab.tiker.net/inducer/sumpy/-/jobs/443459, I've tried many times and failed to reproduce. Only explanation I can think of is that the machine was particularly busy and the OS killed the pocl process compiling the OpenCL kernel.

@inducer
Copy link
Owner Author

inducer commented Aug 18, 2022

Let's see how often they recur. I feel like I've seen them more frequently than could be explained by heavy machine load.

@alexfikl
Copy link
Contributor

That pipeline was running on dunkel from what I can tell. I've been running my Stokes stuff there too with the latest main and haven't seen any crashes at all, so no idea what could be causing them..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants