There are some little things which can make a big difference to the performance of your CUDA application:
- Be careful about floating point constant types
If the variables a and b are single precision floats, then the code
a = 1.0 / b;
promotes b to double precision, calculates a double precision inverse, and then stores it as a single precision result. This is much more costly than a single precision division, so instead you should use
a = 1.0f / b;
- Don't always use default shared memory / cache settings
When working in double precision on Kepler hardware, it is almost always best to put the shared memory into 64-bit mode through the host code command
When using a lot of shared memory, it is usually best to get 48kB of shared memory through the host code command
Alternatively, when using very little shared memory it is usually better to use only 16kB of shared memory through the host code command
- Process multiple elements with each thread
A simple, highly parallel CUDA implementation often uses a separate thread to process each point in a computational grid, or each item in a list.
The drawback of this approach is that there can be a lot of initialisation overhead which can limit the overall performance. To reduce this, it is often better to process several elements with each thread, so the cost of the initialisation is effectively spread across the multiple elements.
- Use the best nvcc compilation options
When compiling for a GTX 670, I use
nvcc -O -arch=sm_30 -use_fast_math
The -O flag turns on optimisation, the -arch=sm_30 flag tells nvcc to generate executable code for GPUs of Compute Capability 3.0, and the -use_fast_math flag tells nvcc to use some of the special single precision math functions.
It is very useful to get traceback information when something goes wrong. Instead of using -G which also turns off all optimisation, an alternative is to use the flag -lineinfo. For more information, see here.