ASEArch - Algorithms and Software for Emerging Architectures

OpenACC

OpenACC

OpenACC is an OpenMP-like approach to tackle GPU programming. The OpenACC API specification supports accelerator programming through a set of compiler directives in C, C++ and Fortran. An accelerator in general is any processor that supports the offloading of computation from the host processor, such as GPUs, Intel MIC, AMD APU, FPGA and future technologies. Compilers from PGI, Cray and CAPS have the built in support for specification v1.0. These compilers may vary in performance since the implementation is vendor specific. The aim of OpenACC is to provide portability across operating systems, hosts and accelerators.

The syntax of OpenACC directives is the following:

C syntax:

#pragma acc directive [clause [[,] clause] ] new-line

FORTRAN syntax:

!$acc directive [clause [[,] clause] ]

These pragmas have to be followed by code block segment or a loop. A simple C example for adding two arrays:

#pragma acc kernels
for(i=0; i<N; i++) {
  c[i] = a[i] + b[i];
}

The parallel and kernels constructs support the parallelized (vectorized) code generation. The loop construct is used to specify the (nested) loops that are desired to be vectorized. The data construct is used to defined the region in the code where the data can be accessed by the accelerator. Explicit data movement (like cudaMemcpy) can also be specified with this directive. Other directives and API functions exist to define more fine grained optimization.

Data managment use cases:

Here we attempt to explain the two typical use cases of data managment. In particular the two cases are separated as:

  1. "Implicit" memory managment: one may declear and allocate data on the host and then use data directives to perform memory allocation on the accelerator, copy between host and accelerator. The data managment is then "implicitly" done by the compiler. Clause involved in this use case are: present, copy, copyin, copyout, present_copy, create, update etc. For explicit use of the device pointer one may need #pragma acc host_data use_device(ptr)
  2. "Explicit" memory managment: in this case the user explicitly declares, allocates, uses and frees memory. Allocation and deletion is done through API calls such as acc_malloc(), acc_free(); use of device pointers if done with #pragma acc data deviceptr(ptr). In OpenACC v2.0 device data lifetime can also be managed with with #pragma acc enter data and #pragma acc exit data.

News:

Useful links, materials:

OpenACC v2.0

  • Specification - extensive specification
  • Reference Guide - quick guide
  • PGI in compiler suite v14.1 introduced OpenACC v2.0 support. The new compiler utilises the LVVM back-end and thus seems to support CUDA 5.5 from now on.
  • Some novelties in v2.0:
    • Support for NVIDIA and also AMD GPUs through LLVM
    • routine contruct:
      • to create a device function that can be called within a parallel or kernels region
      • nested parallelism is also supported
    • dynamic data lifetime with enter data and exit data construct
    • support for multiple host threads to handle separate GPUs
    • tile clause for loop construct to support loop tiling and thus provide better data locality
    • async queues - identified with unique ID - to handle CUDA-like streams
    • atomic operations with atomic construct
    • new API functions and new PGI extensions above standard API
  • OpenACC v2.0 slides with simple use cases (without explanation) by Michael Wolfe

OpenACC v1.0