AMD Stream Computing for Cosmology


Update:

Oct 2011: In order to try and achieve close to optimal performance on a new HD6950 card, I've ended up writing my own assembler for such 69XX cards. It is currently unfinished, is buggy, needs refactoring, and of course should only be used at one's own risk, but I thought others might be interested in taking a look!

I've developed it on a windows machine, but it should work on linux too with minor modifications to the compile lines. Windows users will need minGW installed.

The assembler supports labels and vectorized macros. The assembler on such a card is a rare chance for one to try programming a modern VLIW machine directly!

To have a look, extract the appropriate .tgz file below and follow the instructions.

Any feedback would be appreciated!

To download: Windows: asm69.tgz , Linux: asm69linux.tgz .


On these pages I'll try and document my experiences with using AMD/ATI graphics cards for general purpose computing, with a particular bias towards applications that are of relevance for cosmological research.

To start with, I'll describe the hardware I've been using so far. I've assembled a custom system, containing:

On the software front, I'm using:

To program the card, you can either use the high-level C-like Brook+ language (.br files), the "Intermediate Language" pseudo-assembly language (IL), or low-level device-dependent assembly language (GPUISA) itself. Seeing as I'd like to be able to see roughly what the card is actually doing, I thought that to start with I'd try and see if I could write a non-trivial program using IL. (Brook+ seems conceptually very nice and as features like double precision and integer support are added it could be very useful. However, it generates very complicated-looking GPUISA. This perhaps is not too surprising as I think a Brook+ program first gets compiled into HLSL (another virual language), then into IL, and only then finally into GPUISA!) Having written a Cholesky factorization routine using CUDA for Nvidia that performs pretty well, this seemed like a natural thing to try. I've also had a go at various versions of matrix multiply.

There are two types of IL programs one can write: the new "compute shaders" (4800 series only) as well as the regular "pixel shaders". The former expose features such as data sharing between subsets of threads whereas the latter allow for streamed outputs. (Note that both the 3800 and 4800 series allow for the "global buffer", a memory buffer that allows read-write accesses to arbitrary locations within itself.) Threads are also grouped together in different ways in both cases, the former in a 1D manner, the latter in a 2D manner.

I currently have a number of codes in more- or less- partially working states in here, including:

For an example of what gpuisa looks like, see "compute.gpu" which is the gpuisa for the "compute" program.

Performance often leaves a lot to be desired; unfortunately at the moment there is not enough concrete information on how the hardware works (in particular the memory and caching) to really guide any optimization efforts. In particular, reading from a global buffer is very slow. However, there is a trick by which a given piece of memory can be mapped both as a global buffer and as a regular buffer, improving performance (in general one has to watch out for cache-coherency issues if doing this).

I plan to develop a double-precision version of the code if the speed issues can be addressed. This will be slightly complicated by the fact that there does not appear to be an intrinsic "dqsrt" instruction in IL. Incidentally, "ddiv" maps to about 12 gpuisa instruction groups, so should take about 12 cycles per VLIW processor. Seeing as composite division is implemented, I am hopeful that a composite dsqrt will be implemented shortly too. In the meantime, I've written my own (non-optimized and not overly tested...) "dsqrt" IL code.

Note that at present there are a couple of tricky issues in fully utilising the memory on AMD cards. On my system at least, it seems that the maximum global buffer size is 256MB less than the amount of memory on the card, so a 1 GB card is highly recommended (especially for compute shaders which can only output to a global buffer...). In addition, one can only easily access from the cpu side buffers that are less than 255MB in size (though this can be worked around with some effort).

A useful source of information about AMD Stream is the Stream Forum at:


Back to main GPGPU page.