GPU Centre Of Excellence Award

CUDA tips

There are some little things which can make a big difference to the performance of your CUDA application:

  • Be careful about floating point constant types 

    If the variables a and b are single precision floats, then the code 
    a = 1.0 / b; 
    promotes b to double precision, calculates a double precision inverse, and then stores it as a single precision result. This is much more costly than a single precision division, so instead you should use 
    a = 1.0f / b;

 

  • Don't always use default shared memory / cache settings 

    When working in double precision on Kepler hardware, it is almost always best to put the shared memory into 64-bit mode through the host code command 
    cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte); 

    When using a lot of shared memory, it is usually best to get 48kB of shared memory through the host code command 
    cudaDeviceSetCacheConfig(cudaFuncCachePreferShared); 
    Alternatively, when using very little shared memory it is usually better to use only 16kB of shared memory through the host code command 
    cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);

 

  • Process multiple elements with each thread 

    A simple, highly parallel CUDA implementation often uses a separate thread to process each point in a computational grid, or each item in a list. 

    The drawback of this approach is that there can be a lot of initialisation overhead which can limit the overall performance. To reduce this, it is often better to process several elements with each thread, so the cost of the initialisation is effectively spread across the multiple elements.

 

  • Use the best nvcc compilation options 

    When compiling for a GTX 670, I use 

    nvcc -O -arch=sm_30 -use_fast_math 

    The -O flag turns on optimisation, the -arch=sm_30 flag tells nvcc to generate executable code for GPUs of Compute Capability 3.0, and the -use_fast_math flag tells nvcc to use some of the special single precision math functions. 

    It is very useful to get traceback information when something goes wrong. Instead of using -G which also turns off all optimisation, an alternative is to use the flag -lineinfo. For more information, see here.