Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alpaka] Refactor prefixScan implementation #220

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

antoniopetre
Copy link
Contributor

The prefixScan algorithm is implemented in Alpaka using two kernels, while a single kernel is used for Native CUDA.

I refactored the prefixScan implementation in order to use a single kernel (similar with the Native CUDA implementation).

@makortel makortel added the alpaka label Sep 9, 2021
@fwyzard fwyzard force-pushed the refactor_prefixScan branch 2 times, most recently from 20130b8 to fb7bd6f Compare October 12, 2021 09:49
@fwyzard
Copy link
Contributor

fwyzard commented Oct 12, 2021

Fixed conflicts and applied code formatting.

@fwyzard fwyzard force-pushed the refactor_prefixScan branch 3 times, most recently from be2894f to d427564 Compare October 14, 2021 08:33
@fwyzard
Copy link
Contributor

fwyzard commented Oct 14, 2021

Rebased and fixed conflicts.

@fwyzard fwyzard force-pushed the refactor_prefixScan branch from d427564 to f8a75ea Compare October 15, 2021 12:32
@fwyzard
Copy link
Contributor

fwyzard commented Oct 15, 2021

Rebased and fixed conflicts.

@fwyzard fwyzard requested review from makortel and waredjeb October 15, 2021 12:35
Copy link
Collaborator

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks ok.

src/alpaka/AlpakaCore/prefixScan.h Outdated Show resolved Hide resolved
@makortel
Copy link
Collaborator

makortel commented Oct 15, 2021

On Cori (with CUDA 11.2) I got the following failure when running ./alpaka --cuda

Processing 1000 events, of which 1 concurrently, with 1 threads.
terminate called after throwing an instance of 'std::runtime_error'
  what():  .../pixeltrack-standalone/external/alpaka/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(101) 'cudaFree(reinterpret_cast<void*>(memPtr))' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!

I'm really puzzled what BufUniformCudaHipRt is doing here (ok, maybe it is something that works with both CUDA and HIP). The master version runs fine.

@makortel
Copy link
Collaborator

Here is a stack trace of the exception

#0  __cxxabiv1::__cxa_throw (obj=obj@entry=0xb42e930, tinfo=0x2aaaac5909d0 <typeinfo for std::runtime_error>, dest=0x2aaaac2c3b90 <std::runtime_error::~runtime_error()>) at ../../.././libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x00002aaaab8f1ad8 in alpaka::uniform_cuda_hip::detail::rtCheck (line=<optimized out>, file=<optimized out>, desc=<optimized out>, error=<optimized out>) at .../pixeltrack-standalone/external/alpaka/include/alpaka/core/UniformCudaHip.hpp:67
#2  alpaka::uniform_cuda_hip::detail::rtCheckIgnore<>(cudaError const&, char const*, char const*, int const&) (error=<optimized out>, cmd=<optimized out>, file=<optimized out>, line=<optimized out>) at .../pixeltrack-standalone/external/alpaka/include/alpaka/core/UniformCudaHip.hpp:88
#3  0x00002aaab6cf3324 in alpaka::traits::CurrentThreadWaitFor<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtBase, void>::currentThreadWaitFor (queue=...) at /global/common/cori_cle7/software/sles15_cgpu/gcc/8.3.0/include/c++/8.3.0/bits/shared_ptr_base.h:1018
#4  alpaka::wait<alpaka::QueueUniformCudaHipRtNonBlocking> (awaited=...) at .../pixeltrack-standalone/external/alpaka/include/alpaka/wait/Traits.hpp:38
#5  alpaka_cuda_async::gpuVertexFinder::Producer::makeAsync (this=this@entry=0xbd5db8, tksoa=tksoa@entry=0x2aaae6000000, ptMin=<optimized out>, queue=...) at .../pixeltrack-standalone/src/alpaka/plugin-PixelVertexFinding/alpaka/gpuVertexFinder.cc:179
#6  0x00002aaab6cf952a in alpaka_cuda_async::PixelVertexProducerAlpaka::produce (this=0xbd5da8, iEvent=..., iSetup=...) at .../pixeltrack-standalone/src/alpaka/plugin-PixelVertexFinding/alpaka/PixelVertexProducerAlpaka.cc:53
#7  0x00002aaab6cfa1d4 in edm::EDProducer::doProduce (eventSetup=..., event=..., this=<optimized out>) at .../pixeltrack-standalone/src/alpaka/Framework/EDProducer.h:19
#8  edm::WorkerT<alpaka_cuda_async::PixelVertexProducerAlpaka>::doWorkAsync(edm::Event&, edm::EventSetup const&, edm::WaitingTask*)::{lambda(std::__exception_ptr::exception_ptr const*)#1}::operator()(std::__exception_ptr::exception_ptr const*) (iPtr=<optimized out>, this=<optimized out>)
    at .../pixeltrack-standalone/src/alpaka/Framework/Worker.h:69
#9  edm::FunctorWaitingTask<edm::WorkerT<alpaka_cuda_async::PixelVertexProducerAlpaka>::doWorkAsync(edm::Event&, edm::EventSetup const&, edm::WaitingTask*)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() (this=0x2aaab7f3fd40) at .../pixeltrack-standalone/src/alpaka/Framework/WaitingTask.h:78
#10 0x00002aaaabd6d07d in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x2aaab7f93e00, context_guard=..., t=t@entry=0x2aaab7f3fd40, isolation=isolation@entry=0) at ../../include/tbb/task.h:992
#11 0x00002aaaabd6d375 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x2aaab7f93e00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:992
#12 0x000000000041bf5b in tbb::task::wait_for_all (this=0x2aaab7f97d40) at .../pixeltrack-standalone/external/tbb/include/tbb/task.h:992
#13 edm::EventProcessor::runToCompletion (this=this@entry=0x7fffffff5960) at .../pixeltrack-standalone/src/alpaka/bin/EventProcessor.cc:37
#14 0x00000000004112ce in main (argc=<optimized out>, argv=<optimized out>) at .../pixeltrack-standalone/src/alpaka/bin/main.cc:176

@fwyzard
Copy link
Contributor

fwyzard commented Oct 20, 2021

Fixed conflicts, rebased, etc.

@fwyzard fwyzard force-pushed the refactor_prefixScan branch from 3ba0e0d to 63cae86 Compare October 20, 2021 07:04
@fwyzard
Copy link
Contributor

fwyzard commented Oct 20, 2021

While the validation is good, now I see a small but systematic loss in performance.

Before:

$ CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --validation --maxEvents 10000; echo; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --maxEvents 10000; done
Processing 10000 events, of which 16 concurrently, with 8 threads.
CountValidator: all 10000 events passed validation
 Average relative track difference 0.000880287 (all within tolerance)
 Average absolute vertex difference 0.0007 (all within tolerance)
Processed 10000 events in 4.353466e+01 seconds, throughput 229.702 events/s.

Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.096583e+01 seconds, throughput 244.106 events/s.
Processed 10000 events in 4.049791e+01 seconds, throughput 246.926 events/s.
Processed 10000 events in 4.007989e+01 seconds, throughput 249.502 events/s.
Processed 10000 events in 4.102423e+01 seconds, throughput 243.758 events/s.

After:

$ CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --validation --maxEvents 10000; echo; for N in 1 2 3 4; do CUDA_VISIBLE_DEVICES=0 numactl -N 0 ./alpaka --cuda --numberOfThreads 8 --numberOfStreams 16 --maxEvents 10000; done
Processing 10000 events, of which 16 concurrently, with 8 threads.
CountValidator: all 10000 events passed validation
 Average relative track difference 0.00088813 (all within tolerance)
 Average absolute vertex difference 0.0004 (all within tolerance)
Processed 10000 events in 4.477171e+01 seconds, throughput 223.355 events/s.

Processing 10000 events, of which 16 concurrently, with 8 threads.
Processed 10000 events in 4.160250e+01 seconds, throughput 240.37 events/s.
Processed 10000 events in 4.151133e+01 seconds, throughput 240.898 events/s.
Processed 10000 events in 4.216559e+01 seconds, throughput 237.16 events/s.
Processed 10000 events in 4.186934e+01 seconds, throughput 238.838 events/s.

So 2-3% slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants