Caching for Repeated Vector Assebly #1029

hgpeterson · 2024-09-09T23:44:18Z

Hello,

I was wondering if there is any way to save a cache for assemble_vector(l, V) calls, in a similar way to how the evaluate!(cache, f, x...) function works here?

To explain what I mean, let me set up an example. Suppose you are trying to solve an advection problem for a tracer c(x, t) where the flow field is a function of space and time u = u(x, t). Then you might want to be able to assemble a "right-hand-side" vector for the linear form,

l(d) = ∫( c*d - Δt*u⋅∇(c)*d )dΩ

Since u and c change with time and the advection term is non-linear, we'll need to recompute assemble_vector(l, D) quite frequently in order to step c forward in time. This is very costly for large systems.

Since u is also a FEFunction, shouldn't we be able to store some kind of cache to compute A? Naively, one could compute the rank-3 tensor

∫( φ_j*φ_i - Δt*ψ_k⋅∇(φ_j)*φ_i )dΩ

for each of the basis functions φ_i, φ_j, and ψ_k and then multiply this by c_j and u_k to get our vector, but the tensor is probably too large for realistic systems. Maybe there is something cheaper memory-wise to save so that assemble_vector does not become such a bottleneck for these problems? I suppose one could throw GridapDistributed at the problem, though I'm not sure how that would work in practice.

Anyways, perhaps this is something you all have already thought of and I just haven't read the docs well enough... let me know what you think!

-Henry

The text was updated successfully, but these errors were encountered:

JordiManyer · 2024-09-09T23:56:03Z

There are in-place assembly functions provided in the API. In particular, you should have a look at assemble_vector! and assemble_vector_add! (see here)

If you are using a FEOperator, the function residual! provides this functionality for you (using those underneath). TransientFEOperators also do that for you.

hgpeterson · 2024-09-10T17:49:15Z

Hi @JordiManyer,

Thanks for the quick reply! I noticed the assemble_vector! function, but from what I can tell, it doesn't help much for this problem. Here's a MWE to explain what I mean:

using Gridap
using BenchmarkTools

n = 10
domain = (0, 2π, 0, 2π)
partition = (n, n)
model = CartesianDiscreteModel(domain, partition)

order = 2
reffe = ReferenceFE(lagrangian, Float64, order)
D = TestFESpace(model, reffe, conformity=:H1, dirichlet_tags="boundary")
C = TrialFESpace(D, 0)

degree = order^2
Ω = Triangulation(model)
dΩ = Measure(Ω, degree)

u = VectorValue(1, 1)
c = interpolate_everywhere(x->sin(x[1])*cos(x[2]), C)
l(d) = ∫( u⋅∇(c)*d )*dΩ

@benchmark assemble_vector($l, $D)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):  281.972 μs …   5.908 ms  ┊ GC (min … max): 0.00% … 90.77%
#  Time  (median):     304.259 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   325.728 μs ± 242.140 μs  ┊ GC (mean ± σ):  5.12% ±  6.34%

#          ▄▆███▆▅▄▃▂▁                                             
#   ▁▁▁▁▂▄▆███████████▇▆▅▄▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁ ▃
#   282 μs           Histogram: frequency by time          386 μs <

#  Memory estimate: 239.08 KiB, allocs estimate: 2110.

a = SparseMatrixAssembler(D, D)
rhs = zeros(C.space.nfree)
@benchmark Gridap.FESpaces.assemble_vector!($l, $rhs, $a, $D)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):  269.974 μs …   3.949 ms  ┊ GC (min … max): 0.00% … 87.31%
#  Time  (median):     304.235 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   325.446 μs ± 239.634 μs  ┊ GC (mean ± σ):  5.11% ±  6.34%

#               ▃▆███▇▆▅▃▂▁                                        
#   ▁▁▁▁▁▁▁▁▁▂▄█████████████▇▆▅▅▄▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
#   270 μs           Histogram: frequency by time          385 μs <

#  Memory estimate: 235.86 KiB, allocs estimate: 2101.

The timings and allocations are the same between the two methods. Am I missing something here?

JordiManyer · 2024-09-10T21:33:42Z

Those "caches" are allocations for a single cell. We allocate the elemental matrix/vector for a single cell (and all intermediary steps) and reuse that for every cell. Not much more can be optimised, unless you go very deep into the code.

I am quite sure that if you increase the size of your problem, those allocations will stay the same (or they should). Its a constant overhead, so to speak.

Something you can try is using "performance mode", which deactivates all checks in the code. Those checks can be relatively quite expensive, specially for small problems. Try setting

Gridap.Helpers.set_performance_mode()

Then close julia and open it again. The code should recompile, without any checks.

hgpeterson · 2024-09-10T23:52:55Z

I see. You're right that the memory allocations for the in-place method stay the same as I increase n. Unfortunately, this doesn't lead to any noticeable time speed-ups; it still takes just as long as the memory-intensive method.

I'm okay with this computation taking a while the first time, but since it needs to be called at least once every timestep I was hoping there would be some way to store a cache that would improve performance... This should be possible, since it's really just doing the same computations over and over (see integral in original post). Alternatively, if you know of a way to parallelize this function, please let me know.

I'll give set_performance_mode() a try, but it looks like it was not yet added to the version of Gridap that I am currently using.

Thanks again!

JordiManyer · 2024-09-11T01:08:15Z

If you are willing to go low-level, these are the functions you have to look at:

function numeric_loop_vector!(b,a::SparseMatrixAssembler,vecdata)
  strategy = get_assembly_strategy(a)
  for (cellvec, _cellids) in zip(vecdata...)
    cellids = map_cell_rows(strategy,_cellids)
    if length(cellvec) > 0
      rows_cache = array_cache(cellids)
      vals_cache = array_cache(cellvec)
      vals1 = getindex!(vals_cache,cellvec,1)
      rows1 = getindex!(rows_cache,cellids,1)
      add! = AddEntriesMap(+)
      add_cache = return_cache(add!,b,vals1,rows1)
      caches = add_cache, vals_cache, rows_cache
      _numeric_loop_vector!(b,caches,cellvec,cellids)
    end
  end
  b
end

The caches you are looking for are created by

rows_cache = array_cache(cellids)
vals_cache = array_cache(cellvec)add_cache = return_cache(add!,b,vals1,rows1)
add_cache = return_cache(add!,b,vals1,rows1)
caches = add_cache, vals_cache, rows_cache

In order, these are the caches for

Accessing the row ids where each cell has to be assembled
Computing the reference vectors for each cell
Adding the reference vectors to the global vector

When all checks are disabled, most allocations should come from those caches. If nothing changes in your problem, those caches could be reused.

JordiManyer · 2024-09-11T01:11:23Z

Still, I maintain that when dealing with real problems (i.e 10^5/10^6 cells, non-cartesian meshes) those allocations should be negligible compared to the weakform evaluation, integration and assembly.

JordiManyer · 2024-09-11T01:13:03Z

I'll give set_performance_mode() a try, but it looks like it was not yet added to the version of Gridap that I am currently using.

It's new to Gridap v0.18, and it's still not fully taken advantage of (I have a branch somewhere with more changes to come soon).

hgpeterson · 2024-09-12T19:12:11Z

Still, I maintain that when dealing with real problems (i.e 10^5/10^6 cells, non-cartesian meshes) those allocations should be negligible compared to the weakform evaluation, integration and assembly.

Yes, I think you're right; after timing each of the individual steps with a "real" problem, it looks like the _numeric_loop_vector! function here is where the meat of calculation lies. Perhaps I can try parallelizing this piece.

I still feel that, given the repetitive nature of this problem, there must be some clever way to save time here, but at the moment I'm not familiar enough with the code to see how.

hgpeterson changed the title ~~Caching for Repeated Matrix Assebly~~ Caching for Repeated Vector Assebly Sep 9, 2024

hgpeterson closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching for Repeated Vector Assebly #1029

Caching for Repeated Vector Assebly #1029

hgpeterson commented Sep 9, 2024 •

edited

Loading

JordiManyer commented Sep 9, 2024

hgpeterson commented Sep 10, 2024

JordiManyer commented Sep 10, 2024

hgpeterson commented Sep 10, 2024

JordiManyer commented Sep 11, 2024 •

edited

Loading

JordiManyer commented Sep 11, 2024 •

edited

Loading

JordiManyer commented Sep 11, 2024 •

edited

Loading

hgpeterson commented Sep 12, 2024

Caching for Repeated Vector Assebly #1029

Caching for Repeated Vector Assebly #1029

Comments

hgpeterson commented Sep 9, 2024 • edited Loading

JordiManyer commented Sep 9, 2024

hgpeterson commented Sep 10, 2024

JordiManyer commented Sep 10, 2024

hgpeterson commented Sep 10, 2024

JordiManyer commented Sep 11, 2024 • edited Loading

JordiManyer commented Sep 11, 2024 • edited Loading

JordiManyer commented Sep 11, 2024 • edited Loading

hgpeterson commented Sep 12, 2024

hgpeterson commented Sep 9, 2024 •

edited

Loading

JordiManyer commented Sep 11, 2024 •

edited

Loading

JordiManyer commented Sep 11, 2024 •

edited

Loading

JordiManyer commented Sep 11, 2024 •

edited

Loading