A quick overview of CUDA

CUDA is Nvidia's vision for the use of graphics processing units for general purpose computing ('GPGPU'). A key component of course is a graphics card, but also needed is a suitable driver to control the card for GPGPU and a way of actually writing software to exploit the card to do something useful.

The best place to learn about CUDA is with Nvidia's Programming Guide, but a brief description here might be useful.

To use a graphics card in the past, one apparently had to convert all problems to look like a graphics transformation and use graphics functions. Now, one can program basically in C for the graphics card.

You should think of the card as a collection of O(100) simple processors, running at O(1 GHz) and capable of doing certain floating point mathematical operations (sin, cos, ln, exp ...) extremely quickly. The processors can also do certain integer and other operations quickly too. The processors are organised into groups called multiprocessors and can share information very quickly within a group. The processors can also access "device" memory (i.e. graphics card main memory).

You copy any data from the "host" computer to the graphics card main memory, then call a "kernel" function on the graphics card to operate on the data. Next you either copy the data back to the host to use it in the rest of your program or you call another "kernel" to further process the data on the graphics card.

A way to think about kernel functions is as letting loose a large number of more-or-less independent "threads" to operate on your data. Aligning with the physical hardware, these threads are grouped into "blocks", themselves grouped into a "grid". Threads within a block are all processed on one multiprocessor, can share data with each other via fast shared memory and can have their execution coordinated. Blocks however execute independently of one another. In fact even the execution order of blocks is not controllable. Thus a block can never rely on the results of another block within the same kernel for doing its work. It is the case though that only one kernel function can execute on the card at a time, so by sequentially calling kernel functions in the host code one can impose some global order to the program.