Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kokkos] Work around performance issue by using only 'unsigned long' in AtomicPairCounter #333

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

makortel
Copy link
Collaborator

@makortel makortel commented Mar 8, 2022

Better workaround than #309, see kokkos/kokkos#4780 for more details.

In addition, this PR adds support for using Kokkos' profiling tools via the KOKKOS_PROFILE_LIBRARY environment variable. (functionality that we were missing because of heavily customized initialization of Kokkos).

@fwyzard
Copy link
Contributor

fwyzard commented Mar 9, 2022

Mhm, this doesn't seem to be working as intended, at least in my test on a GTX 1080 Ti:

master

taskset -c 0-15 ./kokkos --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 20000
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 2.791611e+01 seconds, throughput 716.432 events/s, CPU usage per thread: 62.2%

master + #309

taskset -c 0-15 ./kokkos --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 20000
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 2.793876e+01 seconds, throughput 715.851 events/s, CPU usage per thread: 62.4%

master + #333

taskset -c 0-15 ./kokkos --cuda --numberOfThreads 16 --numberOfStreams 16 --maxEvents 20000
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 4.445962e+01 seconds, throughput 449.846 events/s, CPU usage per thread: 61.1%

Though I'm still at my first coffee, so I cannot guarantee there weren't any mistakes...

@makortel
Copy link
Collaborator Author

makortel commented Mar 9, 2022

I ran similar tests (1 thread with 1k events, and 16 threads with 20k events) on

RTX 2080 SUPER

master

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 3.812683e+00 seconds, throughput 262.282 events/s, CPU usage per thread: 113.2%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 4.519340e+01 seconds, throughput 442.543 events/s, CPU usage per thread: 51.1%

master + #309

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 1.846951e+00 seconds, throughput 541.433 events/s, CPU usage per thread: 122.8%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.564639e+01 seconds, throughput 1278.25 events/s, CPU usage per thread: 72.0%

master + #333

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 1.848977e+00 seconds, throughput 540.84 events/s, CPU usage per thread: 117.9%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.646719e+01 seconds, throughput 1214.54 events/s, CPU usage per thread: 72.9%

GTX 1050 Ti

master

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 7.080183e+00 seconds, throughput 141.239 events/s, CPU usage per thread: 96.4%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.020225e+02 seconds, throughput 196.035 events/s, CPU usage per thread: 54.9%

master + #309

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 6.535374e+00 seconds, throughput 153.013 events/s, CPU usage per thread: 89.3%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.004504e+02 seconds, throughput 199.103 events/s, CPU usage per thread: 59.1%

master + #333

Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 6.933106e+00 seconds, throughput 144.235 events/s, CPU usage per thread: 87.1%
Processing 20000 events, of which 16 concurrently, with 16 threads.
Processed 20000 events in 1.102847e+02 seconds, throughput 181.349 events/s, CPU usage per thread: 58.9%

Earlier I had tested only on RTX 2080 and was happy with the #333 giving similar performance as #309. But my GTX 1050 Ti test reproduces the GTX 1080 result in #333 (comment), so this appears to be a real effect.

@makortel
Copy link
Collaborator Author

Here is a plot on V100 (~2 minutes running for each point, on the same CoriGPU node)
kokkos_cuda_throughput

So on Volta both fixes work, but disabling the "new atomics" yields a bit higher throughput for >= 3 concurrent events. Perhaps it would be best to go with #309 for now, and rebase this PR on top of that and leave it open for time being.

@makortel makortel force-pushed the kokkosAtomicPairCounter branch from b9755a1 to 0882c02 Compare March 11, 2022 21:55
@makortel
Copy link
Collaborator Author

Rebased following the merge of #309.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants