Cholesky factorization in CUDA

Update:

30 Mar 2009: I've now developed and am beginning to test multi-gpu versions of the codes; see mpitranschol.cu and mpidoublechol.cu. MPI is used to spread the work between cpu/gpu pairs. See compiling.txt for the rough idea of how to compile and run. The codes should be improving significantly after proper testing.

14 Oct 2008 Note: I have corrected an error in the position the d_choldc_strip kernel writes to in the "row-wise" codes, transchol.cu and doublechol.cu.

9 Oct 2008: I have revised my routine to operate "block row-wise" and return the upper triangular (in C row-major order) part of the matrix, and as hoped the slowdowns for certain matrix sizes mentioned below have pretty much gone away. See transchol.cu, and for further discussion of the slowdowns see posts some way along this thread on the Nvidia discussion forums.

I have also just done a first port to double precision of the routine. See doublechol.cu. Performance is very encouraging, at 40 GFLOP/s on a GTX 260, already 2/3 of double precision peak!

I plan to tweak and tidy up the code making it straightforward to use very soon. I have plans for a multi-gpu version too.

A common bottleneck in certain computations is matrix inversion/determinant calculation. For example, one approach I am thinking of to approximating the stochastic path integral in my eternal inflation project involves calculating the determinant of the matrix representing small flucutations away from the saddle-point history. In my main line of work, CMB extraction from sky maps, one often needs to calculate the probability for various models given the data, again involving inverse matrices and determinants. A useful step to calculating these quantities (for symmetric positive-definite matrices) is to first "factorize" the matrix in a certain way, expressing it as the product of "lower" matrix and its transpose. (A "lower" matrix is one that only has entries on and below its leading diagonal). This is the Cholesky factorization; see e.g. Wikipedia for more details and references.

So I thought I'd have a go at implementing a Cholesky factorization routine on the graphics and see how it performed.

The first thing to say is that this took a lot longer than I had anticipated, the first issue being to think a suitable way to parallelize the computation, paying attention to synchronization issues. The next point is that the cards are currently only single precision and do not support subnormal numbers, so accuracy might be a concern.

Having said all that, I have been very impressed with the performance; my routine can factorize a 12288x12288 matrix on an 8800 GTS in about 11s! To compare, 4 3GHz Intel cores of the local supercomputer take about 17s to do the same calculation, so I am very happy with the performance. (If I understand the NVIDIA documentation correctly, the theoretical time required to do the required number of multiply-add operations is 2.7s on my card (N^3/6 for an NxN matrix), up to delays due to the operands being from shared memory -- see this thread.)

The code only currently works for square matrices whose side length is a multiple of 16. One strange thing I came across is that when the side length is a multiple of 20x16 performance drops by about a factor of two. I have suspicisions that this is due to the way the card's physical memory is partitioned, and thus that the number 20 could be different on different cards, perhaps being 24 on 8800 GTX/Ultra, 16 on 8800 GT, and 8 or even 4 on some of the low-end cards. I hope to test this out when I get another card.

If you would like to take a glance at the code, a recentish version is here. As this is a work in progress, it is not suitable for proper use, not having been thoroughly tested or documented. I plan to tidy it up and make it more properly available after double precision cards become available and I have it working on them, making it more generally useful in science. If you do have any feedback on the routine of course I'd be glad to hear it.

Incidentally, HEALPix (a pixellization scheme for the sphere) users will notice the choice of side length 12288 for the matrix mentioned above. This corresponds to "res5" or "nside=32", and in single precision such a matrix takes 576 MB and so just fits onto my 640 MB card (unfortunately a 512 MB 8800 GT wouldn't be able to hold it...). Experimenting, the largest matrix I could get on to the card was 596 MB in size; as might be expected not quite all of the card's memory is available. In double precision, memory requirements double, so res4/nside=16 might be the simple limit for future cards, or res5/nside=32 if they come with a lot of memory. (I am considering writing a version of the code that works on pieces of the matrix at a time, perhaps across multiple cards, which would remove this memory-imposed size restriction, but that won't be for a while.)

Back to main CUDA page.