# Using Libraries for Accelerator Offload

This section discusses using efficient functions from libraries
like oneAPI Math Kernel Library (oneMKL) or oneAPI Deep Neural Network Library (oneDNN) instead of hand-coded alternatives.
Unless
you’re an expert studying a particular mathematical operation, it’s usually a
bad idea to write your own version of that operation. For example, matrix
multiplication is a common, straightforward mathematical operation:

It’s also easy to implement with just a few lines of code:

```
// Multiply matrices A and B
for (m = 0; m < M; m++) {
for (n = 0; n < N; n++) {
C[m][n] = 0.0;
for (k = 0; k < K; k++) {
C[m][n] += A[m][k] * B[k][n];
}
}
} // End matrix multiplication
```

However, this naive implementation won’t give the best possible performance.
Simple visual inspection of the inner loop shows non-contiguous memory access
for matrix B. Cache reuse, and hence performance, will be poor.

It’s not difficult to port the naive algorithm to Data Parallel C++ (DPC++)
to offload the matrix multiplication kernel to an accelerator. The following
code initializes the queue to submit work to the default device and allocates
space for the matrices in unified shared memory (USM):

```
// Initialize SYCL queue
sycl::queue Q(sycl::default_selector{});
auto sycl_device = Q.get_device();
auto sycl_context = Q.get_context();
std::cout << "Running on: "
<< Q.get_device().get_info<sycl::info::device::name>() << std::endl;
// Allocate matrices A, B, and C in USM
auto A = sycl::malloc_shared<float *>(M, sycl_device, sycl_context);
for (m = 0; m < M; m++)
A[m] = sycl::malloc_shared<float>(K, sycl_device, sycl_context);
auto B = sycl::malloc_shared<float *>(K, sycl_device, sycl_context);
for (k = 0; k < K; k++)
B[k] = sycl::malloc_shared<float>(N, sycl_device, sycl_context);
auto C = sycl::malloc_shared<float *>(M, sycl_device, sycl_context);
for (m = 0; m < M; m++)
C[m] = sycl::malloc_shared<float>(N, sycl_device, sycl_context);
// Initialize matrices A, B, and C
```

Data in USM can be moved between host and device memories by the DPC++ runtime.
Explicit buffering is not required. To offload the computation to the default
accelerator, it is converted to a DPC++ kernel and submitted to the queue:

```
// Offload matrix multiplication kernel
Q.parallel_for(sycl::range<2>{M, N}, [=](sycl::id<2> id) {
unsigned int m = id[0];
unsigned int n = id[1];
float sum = 0.0;
for (unsigned int k = 0; k < K; k++)
sum += A[m][k] * B[k][n];
C[m][n] = sum;
}).wait(); // End matrix multiplication
```

However, simply offloading such code to an accelerator is unlikely to restore
performance. In fact, performance gets worse. Badly written code is still badly
written whether it runs on the host or a device. The table below shows how the
naive matrix multiplication implementation performs on matrices of different
shapes.

Common, computationally-demanding operations like matrix multiplication are
well-studied. Experts have devised a number of algorithms that give better
performance than naive implementations of the basic mathematical formulas.
They also use tuning techniques like cache blocking and loop unrolling to
achieve performance regardless of the shapes of matrices A and B.

oneMKL provides an optimized general matrix
multiplication function (

oneapi::mkl::blas::gemm

) that gives high
performance on the host processor or a variety of accelerator devices. The
matrices are allocated in USM as before, and passed to the gemm

function
along with the device queue, matrix dimensions, and various other options:```
// Offload matrix multiplication
float alpha = 1.0, beta = 0.0;
oneapi::mkl::transpose transA = oneapi::mkl::transpose::nontrans;
oneapi::mkl::transpose transB = oneapi::mkl::transpose::nontrans;
sycl::event gemm_done;
std::vector<sycl::event> gemm_dependencies;
gemm_done = oneapi::mkl::blas::gemm(Q, transA, transB, M, N, K, alpha, A, M,
B, K, beta, C, M, gemm_dependencies);
gemm_done.wait();
```

As expected, the library function gives better performance and is more
versatile than the naive implementations. For example, the library function
can transpose one or both matrices before multiplication, if necessary.

Implementation | 4000 x 4000 | 8000 x 2000 | 2000 x 8000 | Processor |
---|---|---|---|---|

Naive DPC++ | 19.2 | 38.1 | 9.8 | Gen9 |

oneMKL gemm | 0.9 | 1.3 | 0.8 | Gen9 |

This simple example illustrates the separation of concerns between application
developers and tuning experts. The former should rely on the latter to
encapsulate common computations in highly-optimized libraries. The oneAPI
specification defines many libraries to help create accelerated applications,
e.g.:

- oneMKL for math operations
- oneDAL for data analytics and machine learning
- oneDNN for the development of deep learning frameworks
- oneVPL for video processing

Check whether your required operation is already available in a oneAPI library
before implementing it from scratch.