Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching for Repeated Vector Assebly #1029

Closed
hgpeterson opened this issue Sep 9, 2024 · 8 comments
Closed

Caching for Repeated Vector Assebly #1029

hgpeterson opened this issue Sep 9, 2024 · 8 comments

Comments

@hgpeterson
Copy link

hgpeterson commented Sep 9, 2024

Hello,

I was wondering if there is any way to save a cache for assemble_vector(l, V) calls, in a similar way to how the evaluate!(cache, f, x...) function works here?

To explain what I mean, let me set up an example. Suppose you are trying to solve an advection problem for a tracer c(x, t) where the flow field is a function of space and time u = u(x, t). Then you might want to be able to assemble a "right-hand-side" vector for the linear form,

l(d) = ∫( c*d - Δt*u⋅∇(c)*d )dΩ

Since u and c change with time and the advection term is non-linear, we'll need to recompute assemble_vector(l, D) quite frequently in order to step c forward in time. This is very costly for large systems.

Since u is also a FEFunction, shouldn't we be able to store some kind of cache to compute A? Naively, one could compute the rank-3 tensor

∫( φ_j*φ_i - Δt*ψ_k⋅∇(φ_j)*φ_i )dΩ

for each of the basis functions φ_i, φ_j, and ψ_k and then multiply this by c_j and u_k to get our vector, but the tensor is probably too large for realistic systems. Maybe there is something cheaper memory-wise to save so that assemble_vector does not become such a bottleneck for these problems? I suppose one could throw GridapDistributed at the problem, though I'm not sure how that would work in practice.

Anyways, perhaps this is something you all have already thought of and I just haven't read the docs well enough... let me know what you think!

-Henry

@hgpeterson hgpeterson changed the title Caching for Repeated Matrix Assebly Caching for Repeated Vector Assebly Sep 9, 2024
@JordiManyer
Copy link
Member

There are in-place assembly functions provided in the API. In particular, you should have a look at assemble_vector! and assemble_vector_add! (see here)

If you are using a FEOperator, the function residual! provides this functionality for you (using those underneath). TransientFEOperators also do that for you.

@hgpeterson
Copy link
Author

Hi @JordiManyer,

Thanks for the quick reply! I noticed the assemble_vector! function, but from what I can tell, it doesn't help much for this problem. Here's a MWE to explain what I mean:

using Gridap
using BenchmarkTools

n = 10
domain = (0, 2π, 0, 2π)
partition = (n, n)
model = CartesianDiscreteModel(domain, partition)

order = 2
reffe = ReferenceFE(lagrangian, Float64, order)
D = TestFESpace(model, reffe, conformity=:H1, dirichlet_tags="boundary")
C = TrialFESpace(D, 0)

degree = order^2
Ω = Triangulation(model)
dΩ = Measure(Ω, degree)

u = VectorValue(1, 1)
c = interpolate_everywhere(x->sin(x[1])*cos(x[2]), C)
l(d) = ∫( u⋅∇(c)*d )*dΩ

@benchmark assemble_vector($l, $D)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):  281.972 μs …   5.908 ms  ┊ GC (min … max): 0.00% … 90.77%
#  Time  (median):     304.259 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   325.728 μs ± 242.140 μs  ┊ GC (mean ± σ):  5.12% ±  6.34%

#          ▄▆███▆▅▄▃▂▁                                             
#   ▁▁▁▁▂▄▆███████████▇▆▅▄▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁ ▃
#   282 μs           Histogram: frequency by time          386 μs <

#  Memory estimate: 239.08 KiB, allocs estimate: 2110.

a = SparseMatrixAssembler(D, D)
rhs = zeros(C.space.nfree)
@benchmark Gridap.FESpaces.assemble_vector!($l, $rhs, $a, $D)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):  269.974 μs …   3.949 ms  ┊ GC (min … max): 0.00% … 87.31%
#  Time  (median):     304.235 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   325.446 μs ± 239.634 μs  ┊ GC (mean ± σ):  5.11% ±  6.34%

#               ▃▆███▇▆▅▃▂▁                                        
#   ▁▁▁▁▁▁▁▁▁▂▄█████████████▇▆▅▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
#   270 μs           Histogram: frequency by time          385 μs <

#  Memory estimate: 235.86 KiB, allocs estimate: 2101.

The timings and allocations are the same between the two methods. Am I missing something here?

@JordiManyer
Copy link
Member

Those "caches" are allocations for a single cell. We allocate the elemental matrix/vector for a single cell (and all intermediary steps) and reuse that for every cell. Not much more can be optimised, unless you go very deep into the code.

I am quite sure that if you increase the size of your problem, those allocations will stay the same (or they should). Its a constant overhead, so to speak.

Something you can try is using "performance mode", which deactivates all checks in the code. Those checks can be relatively quite expensive, specially for small problems. Try setting

Gridap.Helpers.set_performance_mode()

Then close julia and open it again. The code should recompile, without any checks.

@hgpeterson
Copy link
Author

I see. You're right that the memory allocations for the in-place method stay the same as I increase n. Unfortunately, this doesn't lead to any noticeable time speed-ups; it still takes just as long as the memory-intensive method.

I'm okay with this computation taking a while the first time, but since it needs to be called at least once every timestep I was hoping there would be some way to store a cache that would improve performance... This should be possible, since it's really just doing the same computations over and over (see integral in original post). Alternatively, if you know of a way to parallelize this function, please let me know.

I'll give set_performance_mode() a try, but it looks like it was not yet added to the version of Gridap that I am currently using.

Thanks again!

@JordiManyer
Copy link
Member

JordiManyer commented Sep 11, 2024

If you are willing to go low-level, these are the functions you have to look at:

function numeric_loop_vector!(b,a::SparseMatrixAssembler,vecdata)
  strategy = get_assembly_strategy(a)
  for (cellvec, _cellids) in zip(vecdata...)
    cellids = map_cell_rows(strategy,_cellids)
    if length(cellvec) > 0
      rows_cache = array_cache(cellids)
      vals_cache = array_cache(cellvec)
      vals1 = getindex!(vals_cache,cellvec,1)
      rows1 = getindex!(rows_cache,cellids,1)
      add! = AddEntriesMap(+)
      add_cache = return_cache(add!,b,vals1,rows1)
      caches = add_cache, vals_cache, rows_cache
      _numeric_loop_vector!(b,caches,cellvec,cellids)
    end
  end
  b
end

The caches you are looking for are created by

rows_cache = array_cache(cellids)
vals_cache = array_cache(cellvec)add_cache = return_cache(add!,b,vals1,rows1)
add_cache = return_cache(add!,b,vals1,rows1)
caches = add_cache, vals_cache, rows_cache

In order, these are the caches for

  1. Accessing the row ids where each cell has to be assembled
  2. Computing the reference vectors for each cell
  3. Adding the reference vectors to the global vector

When all checks are disabled, most allocations should come from those caches. If nothing changes in your problem, those caches could be reused.

@JordiManyer
Copy link
Member

JordiManyer commented Sep 11, 2024

Still, I maintain that when dealing with real problems (i.e 10^5/10^6 cells, non-cartesian meshes) those allocations should be negligible compared to the weakform evaluation, integration and assembly.

@JordiManyer
Copy link
Member

JordiManyer commented Sep 11, 2024

I'll give set_performance_mode() a try, but it looks like it was not yet added to the version of Gridap that I am currently using.

It's new to Gridap v0.18, and it's still not fully taken advantage of (I have a branch somewhere with more changes to come soon).

@hgpeterson
Copy link
Author

Still, I maintain that when dealing with real problems (i.e 10^5/10^6 cells, non-cartesian meshes) those allocations should be negligible compared to the weakform evaluation, integration and assembly.

Yes, I think you're right; after timing each of the individual steps with a "real" problem, it looks like the _numeric_loop_vector! function here is where the meat of calculation lies. Perhaps I can try parallelizing this piece.

I still feel that, given the repetitive nature of this problem, there must be some clever way to save time here, but at the moment I'm not familiar enough with the code to see how.

@hgpeterson hgpeterson closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants