Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

50% productivity on particle simulation #140

Open
ocharles opened this issue Sep 13, 2016 · 22 comments
Open

50% productivity on particle simulation #140

ocharles opened this issue Sep 13, 2016 · 22 comments

Comments

@ocharles
Copy link
Collaborator

Hi! I've put together a little benchmark of doing some good old 90s particle simulations: https://github.com/ocharles/reactive-banana-fireworks-benchmark.

Essentially every clock tick I spawn a new particle, which has position/velocity/life time, and I move all particles around on each clock tick. This exhibits some pretty disappointing performance, so hopefully we can work out what's going wrong and get it nice and speedy.

[nix-shell:~/work/physics-2d]$ ./dist/build/physics-2d/physics-2d +RTS -s
benchmarking Sparks
time                 35.56 ms   (33.94 ms .. 37.54 ms)
                     0.984 R²   (0.964 R² .. 0.994 R²)
mean                 32.63 ms   (31.16 ms .. 34.14 ms)
std dev              3.098 ms   (2.399 ms .. 4.069 ms)
variance introduced by outliers: 36% (moderately inflated)

   7,448,025,048 bytes allocated in the heap
   2,634,634,600 bytes copied during GC
       9,666,432 bytes maximum residency (238 sample(s))
         221,744 bytes maximum slop
              29 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     14201 colls,     0 par    2.319s   2.321s     0.0002s    0.0012s
  Gen  1       238 colls,     0 par    0.644s   0.646s     0.0027s    0.0079s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    3.658s  (  3.660s elapsed)
  GC      time    2.963s  (  2.967s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    6.624s  (  6.627s elapsed)

  %GC     time      44.7%  (44.8% elapsed)

  Alloc rate    2,036,032,028 bytes per MUT second

  Productivity  55.3% of total user, 55.2% of total elapsed

I've attached the result of running with -p here, but it's not very enlightening (to me) - here's the header:

        Tue Sep 13 15:11 2016 Time and Allocation Profiling Report  (Final)

           physics-2d +RTS -p -RTS

        total time  =        4.24 secs   (4239 ticks @ 1000 us, 1 processor)
        total alloc = 3,356,123,352 bytes  (excludes profiling overheads)

COST CENTRE       MODULE                              %time %alloc

getOverhead       Criterion.Monad                      21.3    5.7
getGCStats        Criterion.Measurement                11.5    0.0
insertEdge        Reactive.Banana.Prim.Graph            4.2    9.4
minViewWithKey    Data.PQueue.Prio.Internals            3.7   13.6
dfs'              Reactive.Banana.Prim.Graph            3.5    7.2
readPulseP        Reactive.Banana.Prim.Plumbing         3.5    1.1
insert            Data.Vault.ST.Lazy                    3.2    7.3
go                Reactive.Banana.Prim.Evaluation       3.1    2.7
deRefWeaks        Reactive.Banana.Prim.Util             2.5    3.0
hash              Data.HashMap.Base                     2.5    1.8
lookup            Data.Vault.ST.Lazy                    2.4    0.7
evaluateNodeP     Reactive.Banana.Prim.Evaluation       2.2    0.0
applyP            Reactive.Banana.Prim.Combinators      2.1    1.7
liftCached2       Reactive.Banana.Prim.Cached           2.1    2.9
rwsT              Control.Monad.Trans.RWSIO             1.9    2.3
cachedLatch       Reactive.Banana.Prim.Plumbing         1.6    1.1
readLatchIO       Reactive.Banana.Prim.Plumbing         1.5    1.3
listen            Control.Monad.Trans.ReaderWriterIO    1.5    2.4
newUnique         Data.Unique.Really                    1.3    1.6
tell              Control.Monad.Trans.RWSIO             1.3    2.5
writePulseP       Reactive.Banana.Prim.Plumbing         1.2    0.7
copy              Data.HashMap.Array                    1.1    0.9
getChildren       Reactive.Banana.Prim.Graph            1.0    0.2
buildDependencies Reactive.Banana.Prim.Dependencies     1.0    0.0
liftBuildP        Reactive.Banana.Prim.Plumbing         1.0    1.3
insert            Data.PQueue.Prio.Internals            1.0    4.9
tell              Control.Monad.Trans.ReaderWriterIO    0.8    1.9
new_              Data.HashMap.Array                    0.7    3.0
cache             Reactive.Banana.Prim.Cached           0.7    2.9
addChild          Reactive.Banana.Prim.Dependencies     0.2    1.3


@ocharles
Copy link
Collaborator Author

It's worth noting that things aren't much better with a static FRP network:

fireworks :: Event Double -> MomentIO (Behavior [SparkData])
fireworks stepPhysics = do
  sparks <- for (replicate 200 ()) $ \_ -> newSpark stepPhysics
  return (sequenceA (map observeSpark sparks))
  where
    observeSpark Spark {..} =
      SparkData <$> particlePosition sparkParticle <*> sparkIntensity
    deleteOne i s = deleteAt i <$ sparkFaded s

Gives

benchmarking Sparks
time                 825.5 ms   (718.7 ms .. 917.2 ms)
                     0.998 R²   (0.993 R² .. 1.000 R²)
mean                 805.1 ms   (783.2 ms .. 816.3 ms)
std dev              18.99 ms   (0.0 s .. 19.34 ms)
variance introduced by outliers: 19% (moderately inflated)

   9,378,473,584 bytes allocated in the heap
   5,185,469,432 bytes copied during GC
      13,970,456 bytes maximum residency (246 sample(s))
         471,616 bytes maximum slop
              41 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     17125 colls,     0 par    4.550s   4.558s     0.0003s    0.0015s
  Gen  1       246 colls,     0 par    1.909s   1.912s     0.0078s    0.0154s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    7.900s  (  7.908s elapsed)
  GC      time    6.458s  (  6.470s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   14.360s  ( 14.378s elapsed)

  %GC     time      45.0%  (45.0% elapsed)

  Alloc rate    1,187,168,848 bytes per MUT second

  Productivity  55.0% of total user, 55.0% of total elapsed

@HeinrichApfelmus
Copy link
Owner

I don't know what exactly is going on here, but I do know that reactive-banana currently uses monad transformers internally and that last time that I've worked on performance improvements, it turned out that they are responsible for performance problems. Essentially, the problem is that many uses of >>= are not inlined (though for good reason), which leads to closures being generated on the heap and discarded again, which leads to a high GC load.

I have also thought of a solution, but haven't gotten around to implementing it yet. For the record: The key idea is to remove the EvalP monad layer (and others) as much as possible, and supply most of the reader/writer/state contained in the monad directly via function arguments to the evaluatePulses function. The key step for this is a technique called "defunctionalization"; in this case, the step is to replace the higher order argument to newPulse by an algebraic data type that captures all the possible primitive combinators. In other words, the call to the function _evalP directly when determining the current value of a pulse node in the graph should be replaced by a pattern match on different node types, essentially moving the formulas for computing node values (e.g. how to compute an fmap, filter etc.) from the Prim.Combinators to the Prim.Evaluation module.

In summary: Making an algebraic data type for the different nodes types (e.g. fmap, filter, union etc.) allows us to get rid of EvalP closures, which should speed up things considerably.

@ocharles
Copy link
Collaborator Author

Interesting, and that all loosely makes sense.

I'm also surprised to see pqueue allocating so much. I briefly tried replacing it with psqueues, but that requires us to explicitly key every entry, so I didn't have time to explore that more. Do you think it's worth investigating further?

@ocharles
Copy link
Collaborator Author

I think ocharles@37fca51 is what you are describing with replacing _evalP, did I understand correctly?

@HeinrichApfelmus
Copy link
Owner

I think ocharles/reactive-banana@37fca51 is what you are describing with replacing _evalP, did I understand correctly?

Yes, that's exactly what I meant! (This is called "defunctionalization", because you replace a general closure (EvalP) by a data constructor (PulseCalculation) which is only later mapped to a function/closure.)

Now, it's possible to unwrap the EvalP type in the result of runPulseCalculation and evaluateNode, i.e. to write a new definition

runPulseCalculationNew r calc = unwrapEvalP r $ runPulseCalculationOld calc

and expand the right-hand side of this line. The point would be to make sure that every application of >>=, <$> etc. acts on values of IO type, not on values of EvalP type. This may not seem like much, but the difference between

(\r1 -> x1 r1) >>= (\a r2 -> x2 r2 a)

and

\r -> x1 r >>= x2 r a

is precisely building a closure on the heap and discarding it again.

Note that it may not be necessary to do all of this tedious rewriting by hand. If you look at the generated core, you can see whether the inliner can already perform it, or at least parts of it. The sign to look for is whether the core contains calls to the function >>= (bad, because the second argument is a closure), or whether it contains only functions that pass RealWorld tokens areound (good).

(Also, it may help to write runPulseCalculation as a big case statement, this way, you don't have to mention the argument r again and again.)

Do use a performance test case to check whether these changes to the code actually make things faster. :-)

@ocharles
Copy link
Collaborator Author

Ok, well ocharles@72ba09b "pushes" EvalP further out, but unfortunately this makes no difference on the benchmark :( I'm just surprised at the amount of allocation which would be my first port of call. Running with -pA points to

deRefWeaks                Reactive.Banana.Prim.Util              11.2   34.1   2634 4096401360
minViewWithKey            Data.PQueue.Prio.Internals              1.7   12.6    402 1512666920

I successfully swapped out pqueue for psqueues, but it doesn't really make a dent. But why is deRefWeaks allocating so much? Presumably because everything is stored as a Weak Ref, so to do anything we have to perform a lookup on the weak reference.

@HeinrichApfelmus
Copy link
Owner

Ok, well ocharles/reactive-banana@72ba09b "pushes" EvalP further out, but unfortunately this makes no difference on the benchmark :(

I think that goes in the right direction, but it's probably necessary to unwrap everything inside the runPulseCalculation function as well. In other words, I think it may be necessary to inline all calls to unwrapEvalP, so that the operators >>=, <*> and so on are only applied to values of type IO, never to values of type EvalP.

Whether this has to be done or not can only be judged by inspecting the compiled core. The Makefile* contains two (phony) targets profile and core.

  • The profile target will compile and run a small example program, I use this for checking whether my changes are making progress or not. Note that one has to use the -auto-all option to get more cost centers, as resources are usually not spent where you think they are.
  • The core makefile target will compile the Prim.Evaluation module and output the Haskell core functions. Inspecting this for let statements with values of type EvalP helps to understand where the issues with the monad transformers are.

*The Makefile assumes that a cabal sandbox is used.

@HeinrichApfelmus
Copy link
Owner

(By the way, I have written a couple of notes on the issue of optimizing monad transformer stacks, which is relevant to what we are trying to do here. It also shows a Haskell Core output example and what to look for.)

@ocharles
Copy link
Collaborator Author

Thanks, I do get what's being suggested and I'll have a dig further. By the way, are you aware that the latest transformers adds significantly more INLINEs now? You can quite often use a concrete stack now without any observation of it in the core - not sure if that happened before or after you wrote those notes.

@ocharles
Copy link
Collaborator Author

An idle thought that does come out when I stare at Evaluation though, is that just planning execution is (at worst?) linear with the amount of nodes. In the case of the physics benchmark, everything depends on the tick of a physics clock. Whenever the clock ticks, we replan all the work that needs to be done, building up and tearing down a huge priority search queue. Do you think there is any advantage to having some mutable state somewhere such that firing add handlers become distributing some information, rather than building up a data structure?

@ocharles
Copy link
Collaborator Author

ocharles commented Sep 15, 2016

Alright, I got a bit more organised and took a stab at doing some serious optimisations.

My master branch has all my work so far, here's a comparison. In short, I've:

  • Removed the use of vault. Instead, Each Pulse now directly has an IORef that contains its value at a particular clock tick. I then clear everything at the end of a tick. There's a substantial amount of work happening building up and tearing down the internal HashMap in vault, so if we can avoid it - we should. The only thing I'm concerned about is event unions, but all tests pass.
  • Removed RWSIO - we don't have any state now, so we can just use ReaderWriterIO
  • A lot of INLINE annotations, and now compile with -O2.
  • Defunctionalized pulse and output calculations, still need to do latch.
  • Do not build a queue for evaluation, but just proceed depth-first from the input pulse that's firing. This one is my most recent addition, and again - tests pass, but I'm nervous. Will need to think more about this.

With the "john lato" benchmark, I move from 0.120s total time to 0.046s - with 1/3 bytes allocated.

With my own criterion based benchmark, I move from a mean execution time of 6.456 μs to 1.363 μs.

Unfortunately, my particle physics application is still only 70% productive, so I will continue to see what I can do.

@HeinrichApfelmus
Copy link
Owner

HeinrichApfelmus commented Sep 16, 2016

Impressive! 😄

It would be nice to know which of the changes has the most impact, i.e. which yield the biggest reduction in runtime. After all, each of them has the tendency to make the code more "spaghetti", so I would like to limit myself to those that give the best bang for the buck.

• A lot of INLINE annotations, and now compile with -O2.

Did you have a chance to look at the generated Core? My (admittedly limited) experience was that GHC is pretty good at inlining things, so it's usually not necessary to add INLINE pragmas everywhere.

• Removed the use of vault. Instead, Each Pulse now directly has an IORef that contains its value at a particular clock tick. I then clear everything at the end of a tick.

The nice thing about the Vault approach is that it is easy to prove that the values are, in fact, all cleared at the end. Proof: The Vault is simply discarded from scope! So, this would be a change that negatively affects the structure of the code. But I'd be happy to accept if it it brings a significant performance benefit.

• Do not build a queue for evaluation, but just proceed depth-first from the input pulse that's firing. This one is my most recent addition, and again - tests pass, but I'm nervous. Will need to think more about this.

Unfortunately, I'm quite sure that this is not correct. It should be possible to construct a counterexample along the following lines: If e1 is an event, then the pulse corresponding to an event e4 defined as

e2 = fmap something e1
e3 = fmap something' e1
e4 = unionWith (+) e2 e3

might get evaluated by the depth-first approach before the pulse corresponding to e3 has been evaluated, giving an incorrect result.

I think it's hard to beat the explicit priority queue compared to an implict "an IORef per node" approach, so that's probably not a low-hanging fruit.

Unfortunately, my particle physics application is still only 70% productive, so I will continue to see what I can do.

I still suspect that this is due to closures being allocated and destroyed in the tight loop that is go. Only Core can tell.

@HeinrichApfelmus
Copy link
Owner

HeinrichApfelmus commented Oct 11, 2016

Any further luck with this? I have to finish a couple of other things first, but then I'm happy to take a closer look at this as well.

@ocharles
Copy link
Collaborator Author

Still working on different approaches. At the moment I'm exploring running with +RTS -xc -K10K to see if we're using excessive stack, which implies overly lazy data structures. It needs a lot of stack space just to create a network of 200 particles, which is making me suspicious. For example, the Graph data structure isn't strict in its arguments, and there are many other places where I think a strict tuple is beneficial. No finish work just yet though!

@mitchellwrosen
Copy link
Collaborator

Hello! Any updates on this work?

@ocharles
Copy link
Collaborator Author

ocharles commented Aug 21, 2017 via email

@mitchellwrosen
Copy link
Collaborator

It would be great to get some of your work merged, I think the only thing @HeinrichApfelmus objected to was ripping out the priority queue of nodes to evaluate.

@ocharles
Copy link
Collaborator Author

I'll have to dig out the branch again and see if I can make some sensible commits from it. I'll see if I can put some time aside.

@ocharles
Copy link
Collaborator Author

I resurrected this and applied all my current fixes. Unfortunately there's still a significant space leak:

image

As this uses switchB with stepper that holds on to a common event, I think this space leak might be #261

@HeinrichApfelmus
Copy link
Owner

As #261 is now fixed — perhaps the situation has improved here as well?

@mitchellwrosen
Copy link
Collaborator

Things seem pretty good on current master, actually! I'm seeing around 88% productivity.

@mitchellwrosen
Copy link
Collaborator

mitchellwrosen commented Feb 25, 2023

Oh, actually only around 80%; I was getting higher productivity when doing profiled runs, oddly.

Here's the heap graph:
Screen Shot 2023-02-25 at 8 57 57 AM

And the worst offenders:
Screen Shot 2023-02-25 at 8 58 58 AM

Unfortunately there are nameless things. I believe they're all from base, but might include boot libraries, too. I'm not totally sure what's up with that - do I have to build ghc itself with -finfo-table-map -fdistinct-constructor-tables?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants