numba.tex

\section{NUMBA ASSEMBLY OF INTEGRAL OPERATORS}
The main focus of Numba \cite{numba} within Bempp-cl is to provide accelerated implementations of routines with linear complexity, such as grid iterations, integration of functions over grids, or assembly of certain sparse matrices. However, we also provide a fall-back implementation of the OpenCL dense operator assembly in Numba.

With Numba, we use loop parallelism: each loop iteration is the assembly of one test triangle with all trial triangles. We then parallelize over the test triangles through a parallel for-loop.
Within each loop we try to optimize for auto-vectorization by linearly passing through the data in memory order for the individual operations. However, a much smaller fraction of operations is SIMD optimized due to the lack of targeted SIMD constructs in Numba.

We stress that while Numba provides backends not only for CPU, but also for ROCm and CUDA, we currently only use the CPU component of Numba.