Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedups #15

Merged
merged 6 commits into from
May 17, 2023
Merged

Speedups #15

merged 6 commits into from
May 17, 2023

Conversation

milankl
Copy link
Contributor

@milankl milankl commented May 10, 2023

implements some ideas of #14

@milankl
Copy link
Contributor Author

milankl commented May 10, 2023

With the last commit, the RK4 integration (excl the comp_u_v_eta_t(nx, rhs, params, interp, grad, advec) call) is allocation free

julia> ExplicitSolver.integrate(10,100,100);
  0.284759 seconds

however, computing the tendencies currently allocates 16MB on every call

julia> using BenchmarkTools
julia> @btime ExplicitSolver.advance($a...)
  16.884 ms (424 allocations: 16.08 MiB)

a here is a Tuple of all the arguments that advance takes. Meaning that 1000 time steps would allocate 16GB... 🙈

@milankl
Copy link
Contributor Author

milankl commented May 10, 2023

I think I start to understand, operations like

    rhs.h .= rhs.eta1 .+ params.H 

    rhs.U .= rhs.u1 .* rhs.h_u 
    rhs.V .= rhs.v1 .* rhs.h_v 

    rhs.kinetic_sq .= sqrt.(rhs.kinetic)

can be done in-place thanks to the broadcasting ., however, all the sparse matrix-vector multiplications currently cannot be done in-place, i.e. an operation like

rhs.h_u .= interp.ITu * rhs.h

actually does conceptually more something like this

temporary_array = interp.ITu * rhs.h
rhs.h_u .= temporary_array
garbage_collect(temporary_array)

I know the python swm code does the same, but I'd need to benchmark that to see how fast this gets in comparison.

@milankl milankl marked this pull request as draft May 10, 2023 22:57
@milankl
Copy link
Contributor Author

milankl commented May 11, 2023

One way to avoid that is to use SparseArrays.mul! which has an interface to do in-place sparse matrix - dense vector multiplaction, as follows

julia> using SparseArrays, BenchmarkTools
julia> M = spdiagm(rand(100));
julia> v = ones(100);
julia> o = zero(v);
julia> function f!(o,M,v)     # what is currently done
           o .= M*v
       end
f! (generic function with 1 method)

julia> function g!(o,M,v)     # in-place version calling mul! directly
           SparseArrays.mul!(o,M,v,true,false)
       end
g! (generic function with 1 method)

julia> @btime f!($o,$M,$v);    # currently
  215.717 ns (1 allocation: 896 bytes)

julia> @btime g!($o,$M,$v);    # allocation free
  158.479 ns (0 allocations: 0 bytes)

The speed difference is here not massive, but given that's done over and over again, you can quickly allocate the >20GB in ~40s as mentioned in #15, so the overall savings are likely >2x

@milankl
Copy link
Contributor Author

milankl commented May 11, 2023

@swilliamson7 I thought it might come handy to have a macro that translates
c = A*b with the in-place call to SparseArrays.mul! so that we can continue to write c = A*b but directly store the result in c instead of the problem outlined above

@milankl
Copy link
Contributor Author

milankl commented May 11, 2023

The rhs (without advection) is now allocation free:

julia> M = ExplicitSolver.integrate(10,100,100);
  6.769857 seconds

@milankl
Copy link
Contributor Author

milankl commented May 11, 2023

The last commit implements the Sadourny advection scheme, just to illustrate how this can be done allocation free too, however, this line

@inplacemul adv_v = Iqv * q  # v-component -qhu

is currently commented out as I'm getting a DimensionMismatch error, could you check the Iqv/Ivq interpolation operators?

@milankl
Copy link
Contributor Author

milankl commented May 11, 2023

With the last commit, moving the dissipative terms from RK4 to Euler (option 1 in #15) we're at

julia> M = ExplicitSolver.integrate(10,100,100);
  5.513422 seconds

So almost 10x faster than before. Sure, Arakawa and Lamb would make it a bit slower again.

@swilliamson7
Copy link
Collaborator

Sorry it's taken so long for me to get to this, other work came up. From what I can see the code actually runs substantially faster now, which is amazing! I'm going to try running the changes on my computer, and specifically am curious if Enzyme can differentiate through the macro you wrote (pretty sure it should be possible, I don't seen any reason it wouldn't be.) Referring to the issue with the interpolation operators, that should be fixed now (I had flipped the order in which they were passed to the structure by accident), thanks for catching it.

Another question, is there a reason this was only a draft pull request? Or better phrased, I'm not quite sure what it means that it was written as a draft, other than that it's a work in progress...

@milankl
Copy link
Contributor Author

milankl commented May 17, 2023

No worries!

  • Enzyme: I believe there shouldn't be a difference for Enzyme because as far as I understand it it would work on the compiler level, and afaik a macro is the first thing to be expanded, so very high up on Julia's parsing chain - but I might be wrong!
  • draft: I was expecting you to add a commit with a fix for the Iqv/Ivq interpolation operator, hence it shouldn't be merged immediately as it would break your code, because you haven't set up any tests it might look like "ready to merge" if not draft

@swilliamson7
Copy link
Collaborator

Ah, that makes sense! I fixed it a couple days ago but I'll double check to be safe

@milankl
Copy link
Contributor Author

milankl commented May 17, 2023

if you didn't do it here in this PR then git will throw a conflict, let's merge it correctly

@milankl milankl marked this pull request as ready for review May 17, 2023 15:45
@milankl
Copy link
Contributor Author

milankl commented May 17, 2023

hmm there are no conflicts with the base branch, in which branch did you apply the changes to?

@swilliamson7
Copy link
Collaborator

I'd done it in a different branch and merged to main, if that makes sense? It's here: #16.

I'm still learning Github, if you can't tell....:)

@milankl
Copy link
Contributor Author

milankl commented May 17, 2023

Ah yeah, sorry I didn't affect the lines you changed in that PR that's why git doesn't see a conflict - so smart ;)

@milankl
Copy link
Contributor Author

milankl commented May 17, 2023

Then feel free to merge it!

@swilliamson7 swilliamson7 merged commit 6cbc0bf into DJ4Earth:main May 17, 2023
@swilliamson7
Copy link
Collaborator

Not sure the best place to contact, but it seems that something in the speedups made the model unstable, I was getting NaN's for both velocity fields and the displacement

For now I've reverted to the code as it was before, but do you know what might have caused it/was running it on your computer diverging? I saw that the assertion checks had been commented out, so that might have hidden the problem

@milankl
Copy link
Contributor Author

milankl commented May 17, 2023

Sorry, yeah, put them back in again! What's your default diffusion? Maybe that's too low? If you were at the edge of stability before then evaluating the diffusion only once every timestep may push you over the edge that's true. What does the vorticity field look like?

@swilliamson7
Copy link
Collaborator

It's all good! For diffusion I mimicked the choice in your Python code, which ultimately chose $$\nu = \frac{128 \cdot 540 \cdot \max(\Delta x, \Delta y)^2}{\min(N_x, N_y)$$. For a 70 by 70 grid this turns into something on the order of 1e12 so rather large (unless I'm misunderstanding what you mean by diffusion? I think it's the term with the viscosity coefficient). From what I understood this choice should be sufficient to ensure stability, but definitely still learning about how to choose these parameters

The vorticity is small, after 10 days on a 70 by 70 grid it's on the order of 1e-7

@milankl
Copy link
Contributor Author

milankl commented May 18, 2023

viscosity and diffusion are often used interchangeably in the ocean/atmosphere modelling community. However, there's quite a nuance to it:

  • If you were to simulate a viscous fluid (newtonian) then the viscous forces take the form of a Laplacian (divergence of the stress tensor, $\nabla \cdot \nu \nabla = \nu \nabla^2$) with a viscosity coefficient that derives from the material properties of the fluid (think honey vs water).
  • diffusion however is the general concept of spreading things out (e.g. heat diffusion in a solid), so you can think of viscosity as being the reason that momentum diffuses inside a fluid.

Now when we simulate fluid motion at resolution larger than the viscous scale where the actual viscosity of water/air would remove energy we need something that balances the energy cascade at the grid scale. Otherwise we'd pile up energy there and the model blows up due to numerical instability. One way of doing this is adding a Laplacian with a coefficient that, in most cases (unless you do DNS), would need to be several orders of magnitude larger than the viscosity coefficient that would represent the viscous forces inside water/air. So in simulations we need diffusion for numerical stability, which however has conceptually little to do with viscosity unless you take honey as a good approximation for water/air. But there's somewhat an equivalence here, hence the ambiguity most people have between viscosity and diffusion. I try to separate these terms by what they are supposed to do: represent viscous forces or ensure numerical stability.

For viscosity there's no reason to use a biharmonic Laplacian: it's the divergence of the stress tensor not the Laplacian of the divergence of the stress tensor. For numerical stability, however, there's all sorts of good arguments why you'd want a higher power of your Laplacian, it's stronger on smaller scales and leaves the larger scales more unaffected. And you may want to choose your diffusion coefficient as a function of the state of your flow (what you're optimising ;) to only have higher diffusion where you actually need it for stability. But if you think of the diffusion in your model as viscosity then there's little reason to go from $\nu\nabla^2$ as it is in the Navier-Stokes equations to something more like $\nu(u,v) \nabla^4$ which would be a biharmonic diffusion operator with a coefficient that depends on current state of $(u,v)$. I hope this makes sense?

@milankl
Copy link
Contributor Author

milankl commented May 18, 2023

Regarding the scaling of the diffusion coefficient, I took $\nu_0 = 540 m^2/s$ for the Laplacian from Fenwick's 2015 paper (cited in my thesis somwhere) at 128x128 grid points with $\Delta x = 30km$. In the shallow water system it's fine to linearly decrease the harmonic diffusion $\nu_2$ with increased resolution, i.e. at 256x256 you can use $\nu_2 = 270m^2/s$, at 512x512 you can use $\nu_2 = 135 m^2/s$ etc. The reason my grid points where powers of two are only because Fenwick's model was spectral. Going to 100x100 etc is absolutely fine. So the scaling for the harmonic should be (domain size $L_x$ x $L_y$ unchanged)
$$\nu_2 = \frac{128}{\min(N_x, N_y)} 540 m^2/s$$
the $\min$ is just so that you have diffusion determined by the larger grid spacing not the smaller. You then multiply by the grid scale squared to go from harmonic $\nu_2$ to $\nu_4$ in units of $m^4/s$. I probably wrote exactly that in my thesis so sorry if I just repeat myself.

@milankl
Copy link
Contributor Author

milankl commented May 18, 2023

For comparison with ShallowWaters.jl

using ShallowWaters
run_model(nx=100,output=true,L_ratio=1,Lx=3840e3,g=9.81,bc="nonperiodic",wind_forcing_x="double_gyre",seasonal_wind_x=false,topography="flat",Ndays=200)
Starting ShallowWaters run 1 on Wed, 17 May 2023 23:00:17
Model integration will take approximately 21.0s,
and is hopefully done on Wed, 17 May 2023 23:00:38
100% Integration done in 16.8s.
All data stored.

produces

out.mp4

So interesting stuff should only happen after 10 days (the simulation runs for 200)! I can always recommend to look at videos. I use ncvis to create the frames and then ffmpeg to turn them into a video

ffmpeg -r 30 -i ncvis%06d.png -c:v libx264 -crf 20 -pix_fmt yuv420p out.mp4

unfortunately, ncvis does not seem to adapt to the square aspect ratio...

@swilliamson7
Copy link
Collaborator

This was extremely helpful! I think I'm starting to understand a little bit. I'm going to be confusing and move back to the notation you'd used in your original Python code from here.

If I'm understanding correct, what you're saying is that the choice of $\nu_B$ (which is what I wrote out in my first comment) is only because you'd taken a biharmonic Laplacian for your diffusion term. In my code I didn't intend to implement a biharmonic Laplacian, I thought I was implementing the simpler term $\nu \nabla^2 \mathbf{u}$ to avoid building the stress tensor, but now see that I'd always been using LLu in my code so accidentally implementing a biharmonic diffusion. Your explanation also helps me better understand the documentation, I was admittedly a little confused why there was both $\nu_A$ and $\nu_B$ but now I see that the choice was based on what was chosen to represent diffusion in the model. I'm still learning the physics behind all of this, but the first comment you wrote out definitely helps! Regardless though, since I'm taking $h$ to be a constant, I think what I did should have been consistently stable. Namely I was choosing $\nu_B \cdot \nabla^4 \mathbf{u}$ (eq. 22 in your documentation for constant $h$) in my code which based on the above explanation should have made the code more, not less, stable, yeah? So I guess I'm still confused about why computing the diffusion only once every timestep made the code unstable, unless there's some explanation I'm missing

I'm going to re add the changes and see if moving to $\nu_2 \cdot \nabla^2 \mathbf{u}$ changes anything, but I'm inclined to believe that I'll still diverge

@milankl
Copy link
Contributor Author

milankl commented May 19, 2023

Maybe we've introduced another typical source of error: I usually think of the diffusion coefficients as being positive but if you go from harmonic to biharmonic you'd need to flip the sign.

The 1D discrete Laplacian has a 1 -2 1 stencil, so you take away from the middle and put it to the sides, which is what you want for diffusion, so you'd do $+\nu\nabla^2$ in the equations. The 1D discrete biharmonic Laplacian (4th derivative) however is 1 -4 6 -4 1, with a positive coefficient in the middle. So you'll need to flip the sign to get a diffusion (and not anti-diffusion!): $-\nu\nabla^4$

@milankl
Copy link
Contributor Author

milankl commented May 19, 2023

So I guess I'm still confused about why computing the diffusion only once every timestep made the code unstable, unless there's some explanation I'm missing

Just to say this is numerically definitely possible, as this is what I'm doing in ShallowWaters.jl too ☝🏼 so the question is just how to find the bug 😃

@swilliamson7
Copy link
Collaborator

swilliamson7 commented May 19, 2023

I don't think it's a sign issue in the laplacian operator, I've like quadruple checked that my time-derivatives matched the ones in the original python code 😶 so at least for the implementation of constant $h$ and $\nu_B \cdot \nabla^4 \mathbf{u}$ everything was okay, I would have gotten an incorrect $u_t$ and/or $v_t$ if I was computing $\mathbf{M}$ wrong, right?

find the bug 😃

ah yes, the best part of coding

@swilliamson7
Copy link
Collaborator

adding that I also quadruple checked every discrete operator as I was building them!

@milankl
Copy link
Contributor Author

milankl commented May 19, 2023

I don't know how you've checked equalities between the python swm and your tendencies, but note that the diffusion is really small in the first 10 days and only really kicks in when you start to develop strong gradients in u,v which isn't the case in the first 10 days...

@swilliamson7
Copy link
Collaborator

The discrete operators are constant, so they're easy to compare. The time derivatives I just ran the model for some amount of time, I believe I ran it for maybe 30 days, and then compared to my code for the same setup. This I can do again and see, but I'm confused about why they would match for a set amount of time and only after X days diverge, that seems unlikely to me

@milankl
Copy link
Contributor Author

milankl commented May 19, 2023

But you're dealing with a chaotic system, that's exactly one of its properties! Unless you are executing exactly the same float operation in exactly the same order you're solutions will always diverge eventually, the question is just when that is. The first tendencies you compute will be dominted by waves, if you were to output every time step you should see gravity waves bouncing around that eventually die out, then there's Kelvin waves that will zoom around the boundaries (where they are confined to) that kick off Rossby waves in the East that you can even see in the video propagating westward throughout the domain. All of these waves require only a correct implementation of the pressure gradient and coriolis terms, but non-linear terms are small and diffusion usually too. So if you compare the tendencies before eddies actually start to develop you may think they're the same apart from rounding errors even though something is still wrong!

@milankl
Copy link
Contributor Author

milankl commented May 19, 2023

Can you point me to the code that you are running that blows up? I struggle to find that version in your repo...

@swilliamson7
Copy link
Collaborator

But you're dealing with a chaotic system, that's exactly one of its properties! Unless you are executing exactly the same float operation in exactly the same order you're solutions will always diverge eventually, the question is just when that is.

Yes yes, I understand that, but if all my forcings, parameters, discrete operators, and initial conditions match then I still don't see why they would diverge. I don't see how the physics in the model would be different if the setups are the same

Can you point me to the code that you are running that blows up? I struggle to find that version in your repo...

Not surprised, it's not currently on the repo, give me a couple minutes to send it

@swilliamson7
Copy link
Collaborator

swilliamson7 commented May 19, 2023

Okay I'm finished, could you do me a favor and open a new draft pull request where we can discuss the code and keep track of changes? I can email a file with all the scripts and stuff that I'd want to use, it has all your changes. I needed to move things around a bit. Putting a couple notes here I've just learned:

(1) Moving my old code to a harmonic Laplacian and $\nu_A$ as the viscosity parameter led to instabilities (which makes me panic a little not sure if that's warranted) but leaving it as a biharmonic Laplacian with $\nu_B$ was still okay

(2) The speedup code with a harmonic Laplacian and $\nu_A$ were stable where my old code wasn't, which is interesting. This was for 100 days on a 50 by 50 grid

I knew that the diffusion term affected stability I just always begin to worry when I see my code diverge, I don't like when things break

@swilliamson7
Copy link
Collaborator

If you're okay with my request could you send your email?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants