ASEArch - Algorithms and Software for Emerging Architectures

AVX Vectorisation and Cilk Plus

 

The parallel computing curriculum developed by the University of Oregon IPCC (Intel Parallel Computing Center) is available here.

Some Knights Corner training materials developed by Intel are available here.

 


 

AVX (Advanced Vector eXtensions) is an extension to the standard x86 ISA to support floating point computation demanding algorithms. AVX is more suited to multimedia applications. Compared to the ISA of NVIDIA GPUs AVX is not capable of handling non-consecutive memory access patterns and looses significant performance when the data is contiguous but non-aligned.

Performance provided by AVX in CPUs can be utilized by

  1. embedding AVX assembly code into source,
  2. using AVX instrinsic intructions defined in header file ymmintrin.h,
  3. auto-vectorization of GNU and Intel compilers by potentially helping compilers with #pragma directives,
  4. implicit vectorization with AMD's of Intel's OpenCL driver/compilers suite,
  5. using Intel's array notation or CilkPlus langugae extension

A brief overview and further references are given here .

An introduction to AVX, programming techniques and a tutorial can be found here .

Vectorization by Kent Milfeld from Texas Advanced Computing Center

Intel - Introduction to Vectorization

A Guide to Vectorization with Intel C++ compilers

Intel - Essential programming techniques for the Intel Xeon Phi coprocessor by Stephen Blair-Chappell

Intel - User-mandated or SIMD Vectorization

Intel - Scicomp 2013 Tutorial Intel Xeon Phi Product Family Vectorization - Klaus-Dieter Oertel


AVX instrinsics

Intrinsics are functions embedding assebly inline functions to simplify programming complexity. They are defined in header files ( eg. ymmintrin.h ) and are available in both GNU and Intel compilers for both Fortran and C/C++.

Intel's Intrinsics Guide is an on-line tool for finding specific AVX instruction on various vector instructions. Avaiable here .

The performance of different AVX instructions on Sandy Bridge hardware is presented here .


Auto-vectorization and #pragma directives

Auto-vectorization is general feature in compilers and can be switched on by compilers flags from optimization level 1. Eg:

icc -O1 -xAVX source.c

Learn more...

This link has an interesting article on OpenMP 4.0, including the new simd directive.

About the vector pragma: link

 


OpenCL

Intel's and AMD's OpenCL (Open Computing Language) driver provides "implicit vectorzation" with CPU vector instructions. OpenCL provides better code portablility across parallel platforms. To achieve vectorized code with OpenCL it is essantial to optimize the code specifically to SIMD-like vector instruction and multithreaded execution. One may use the Intel Offline Compiler to learn more about how what prvents the parallelization in the code being developed.

 


CilkPlus, array notation and elemental functions

These three notions in the Intel compiler can be used to help ILP and thread-level parallel implementation of algorithms and they can be used independently from each other. Cilk is a task-parallel (multi-threading) feature provided in Intel clkrt library, to schedule parallel problem (eg. for loop) to multiple cores on a CPU. Similar to OpenMP's parallel for , with the major difference of handling parallel jobs as a set of problem chunks, and by scheduling these chunks on threads with a work-stealing scheduler. An example for a for loop parallelisation:

In OpenMP one would use the following pragma:

#pragma omp parallel for
for(i=0; i<N; i++) { ... }
Note: compiler needs the -fopenmp option to recognise the omp declarations.

In Cilk the task level parallelism is implemented the following way:

_Cilk_for(i=0; i<N; i++) { ... } Note: compiler needs the -lcilkrt Cilk run-time library to find the _Cilk_for function.

CilkPlus is Intel's attempt to help the compiler to perform better parallelization on ILP and task-level by language extensions. CilkPlus is essentially a suite of Cilk's thread/task level parallelism capabilities and the array notation language extension. The array notation is introduced to help the compiler utilize SIMD instruction by adding more information about the datastructure and to simplify implementation. Using array notation gives better code readability and clearity. Eg.:

#define N 256
int a[N];
int b[N];
int c[N];
c[:] = a[:] + b[:]; // equivalent: for(i=0;i<N;i++) c[i]=a[i]+b[i];
c[0:2:128] = a[0:2:128] + b[0:2:128]; // equivalent: for(n=0; n<128; n++) c[2*n] = a[2*n] + b[2*n];

From the point of the compiler it is in some sense similar to Fortran's array handling. A tutorial on CilkPlus can be found here . For Cilk and CilkPlus example source codes click here . Elemental functions are a key structure for vectorisation. They are actually functions that are vectorized by the Intel compiler and they are later inline into a for loop in a later step of the compilation. An example for defining an elementray function:

    
        
            __attribute__
        
        
            ((
        
        
            vector
        
        
            (
        
        
            linear
        
        
            (
        
        a
        
            ),
        
        
            linear
        
        
            (b
        
        
            ))))
        
        
            
inline
void foo ( float * a , float * b ) { ... } // Compiler will report: FUNCTION WAS VECTORIZED

int main() {
...
for(i=0; i<N;i++) foo(a[i], b[i]);

//OR

foo(&a[:],&b[:]); // array notation might be used intead of for loop
}

More on the following topics:

Intel - Elemental Functions

Intel - Function Annotations and the SIMD Directive for Vectorization

Intel - Usage of linear and uniform clause in Elemental function (SIMD enabled function)

Cilk Plus homepage

 


Tips and Tricks for AVX vectorization

  • Reduction within 2 YMM registers in parallel: Stackoverflow post
  • Manual loop unrolling for better performance. Although in some cases the Intel compiler is able to vectorize a loop, the performance can be further increased by manual loop unrolling. The reader is suggested to visit the following blog post . Note: the code in the post only works for N>stepsize . In general one might consider to use

#define ROUND_DOWN(N,step) (((N)/(step))*step)

instead of the proposed one:

#define ROUND_DOWN(N,step) ((N) & ((step)-1))

If N and step is declared with const int or it is a C/C++ macro or a C++ template, than the expression with division and multiplication gets calculated in compile-time. Otherwise one might consider using the latter one.