CUDA compiles your .cu c-like code into intermediate ptx code which is then further compiled into cubin binary code. It can be very useful to look at the intermediate code by using the "-keep" option. Furthermore, one can have the cu source code inserted into the ptx code by using the "-Xopencc -LIST:source=on" option.
The programming guide and software development kit documentation and source code contain lots of useful advice and examples on how to make code work well under CUDA.
However, I have found optimizing CUDA code to be rather frustrating, in part because Nvidia release few details of what is actually going on in the hardware. This could be because the next generation of cards may do things differently or it could be simply because things such as the thread and block schedulers are very complicated and messy so there's nothing very general you can say. I would have appreciated basic timing and scaling information on kernel launches as a function of threads per block and blocks per grid for example; the only way to see whether my code went faster if one kernel called O(10^5) blocks which each did a little work or if it called O(10^3) blocks that each did a lot more work was to laboriously code each up and see (the latter was almost a factor of two faster). Also, there is special texture memory that is "optimized for 2D spatial locality" but without more information it is difficult to know where to start in using this for efficiently coding up a block matrix multiplication routine say.
One has to focus on the "big" parts of the problem only and not to bother much about how slowly the "little" parts run. I've spent some time trying to do things like average the final results efficiently across a thread block; good for learning how to "think in CUDA" but probably not the most efficient use of my time.
Many "clever" things that I've tried to increase performance (e.g. in a matrix problem prefetching the next submatrix into shared memory while working on a current one) haven't helped much if at all (with the exception of loop unrolling that has significantly helped in some cases). It seems best to keep things as simple, clean and straightforward as possible, while of course obeying Nvidia's advice about e.g. memory access and block/grid configuration, and to let the compiler and hardware have free reign over the code generation and execution. Algorithm optimization/effective problem parallelization is probably far more important and profitable than code optimization per se.
I haven't found much performance improvement in fiddling around with either the nvopencc ("-Xopencc -O?") or ptxas ("-Xptxas -O?") optimization settings; the defaults have basically always been the best.
The "-usefastmath" option helps sometimes if you have operations like sine and cosine on the device that you haven't explicitly indicated (e.g. using __sin(x) ) for the fast versions to be used for.
The (now-deprecated?) "-Xptxas -fastimul" option forces all integer multiplys to be fast 24-bit ones rather than slower 32-bit ones. If you're sure this won't break your code this sometimes helps and saves the use of __mul24 etc. I am not sure whether the option only affects explicit multiplies in the code or whether it also speeds up 2D array indexing and the like.
Perhaps in decreasing order of importance one should make sure that one's code (see the programming guide for discussions of all of these):
A "magic number" in CUDA on G80 series cards is 16, this being the size of a half-warp (thus determining global memory coalescing) and also the number of shared memory banks (thus determining conflict-free shared memory accesses). If you have a 2D array in shared memory, another magic number is 16+1=17; you can access arr both by row (...arr[.][tx] where "tx" is the x component of the thread ID or label) and column (...arr[tx][.]) without bank conflicts.
I was worried about the speed problem of using thread-local arrays (that are automatically put into "slow" global memory) for my eternal inflation project that would otherwise be a very natural way of doing things. I spent a lot of time thinking if there was some way I could split the shared memory up and use that instead but there just wasn't enough of it for what I wanted to record. So in the end I implemented the local array version. My concerns were unjustified: the program ran 500-1000 times quicker than the CPU version, even using these local arrays! I think the compiler must be being "clever" and ensuring that such arrays are accessed in a coalesced manner by the threads and that the thread scheduler must be doing its job properly switching between "warps" of threads to hide the memory latency.
A very useful (unofficial) tool for optimization has been developed that disassembles the "cubin" binary files into the native instruction set of the cards. It is called "decuda" and is available here. It has been used for example in understanding performance issues in accessing shared memory for matrix multiplication; see towards the end of this thread.
One major problem in CUDA 1.0 for me that I came across early on was that code such as "z=sinhf(.5f);" failed to compile on the device: there seems to be problems with hyperbolic trig functions with small immediate arguments. I got around this by writing something like "volatile float y=.5f; z=sinhf(y);".
There is also a problem with the "pow" function with integers that is documented in the 1.0 release notes.
Note that the compiler supports the "long long" data type; very useful in 32-bit mode if you want a 64-bit integer.
I've had partial success in running both 32- and 64-bit code on 64-bit linux by installing the 32-bit toolkit to a separate directory and then compiling with something like "nvcc __.cu -L/usr/local/cuda32/lib -m32".