AMD Stream Computing for Cosmology

Update:

Oct 2011: In order to try and achieve close to optimal performance on a new HD6950 card, I've ended up writing my own assembler for such 69XX cards. It is currently unfinished, is buggy, needs refactoring, and of course should only be used at one's own risk, but I thought others might be interested in taking a look!

I've developed it on a windows machine, but it should work on linux too with minor modifications to the compile lines. Windows users will need minGW installed.

The assembler supports labels and vectorized macros. The assembler on such a card is a rare chance for one to try programming a modern VLIW machine directly!

To have a look, extract the appropriate .tgz file below and follow the instructions.

Any feedback would be appreciated!

To download: Windows: asm69.tgz , Linux: asm69linux.tgz .

On these pages I'll try and document my experiences with using AMD/ATI graphics cards for general purpose computing, with a particular bias towards applications that are of relevance for cosmological research.

To start with, I'll describe the hardware I've been using so far. I've assembled a custom system, containing:

An MSI K9A2 Platinum motherboard, chosen because it can potentially take 3 or 4 dual slot graphics cards and supports PCIe 2.0 and for its reasonable price,
An AMD Phenom 9850 quad core processor,
4GB memory,
A PowerColor HD4870 graphics card with 1GB of onboard memory and 10 arrays (or "SIMDs") consisting of 16 VLIW processors each, running at 750 MHz with every VLIW processor itself made up of 5 "stream cores" (for a total of 800 stream cores).
(Nb. initial development used an ATI HD3870 graphics card, kindly provided by AMD, which has 512MB onboard memory and 4 arrays of 16 VLIW processors, each consisting of 5 "stream cores" (for a total of 320 cores), running at 775 MHz.)

On the software front, I'm using:

Scientific Linux 5.1, 64-bit. This is a free clone of RHEL 5.1, which is one of the distributions that is officially supported by AMD,
The AMD Stream Software Development Kit and Driver, 64-bit,
along with a recent 64-bit ATI "Catalyst" graphics driver.

To program the card, you can either use the high-level C-like Brook+ language (.br files), the "Intermediate Language" pseudo-assembly language (IL), or low-level device-dependent assembly language (GPUISA) itself. Seeing as I'd like to be able to see roughly what the card is actually doing, I thought that to start with I'd try and see if I could write a non-trivial program using IL. (Brook+ seems conceptually very nice and as features like double precision and integer support are added it could be very useful. However, it generates very complicated-looking GPUISA. This perhaps is not too surprising as I think a Brook+ program first gets compiled into HLSL (another virual language), then into IL, and only then finally into GPUISA!) Having written a Cholesky factorization routine using CUDA for Nvidia that performs pretty well, this seemed like a natural thing to try. I've also had a go at various versions of matrix multiply.

There are two types of IL programs one can write: the new "compute shaders" (4800 series only) as well as the regular "pixel shaders". The former expose features such as data sharing between subsets of threads whereas the latter allow for streamed outputs. (Note that both the 3800 and 4800 series allow for the "global buffer", a memory buffer that allows read-write accesses to arbitrary locations within itself.) Threads are also grouped together in different ways in both cases, the former in a 1D manner, the latter in a 2D manner.

I currently have a number of codes in more- or less- partially working states in here, including:

compute: a compute shader matrix multiply program, utilising the "Local Data Share", that calculates C=A^T B, with each thread calculating a 256x16 subblock,
matmult: a pixel shader matrix multiply program writing to the global buffer, with each thread calculating an 8x16 subblock,
bothstripedmmm: a version of the above that assumes the input matrices are partitioned in a particular manner in order to improve performance,
cholmultiwithmemcpy: a pixel shader in-place Cholesky factorization program that exclusively uses a global buffer,
newcholforweb: a pixel shader in-place Cholesky factorization program that uses streaming outputs, mapping the same piece of memory both as an input and an output buffer.

For an example of what gpuisa looks like, see "compute.gpu" which is the gpuisa for the "compute" program.

Performance often leaves a lot to be desired; unfortunately at the moment there is not enough concrete information on how the hardware works (in particular the memory and caching) to really guide any optimization efforts. In particular, reading from a global buffer is very slow. However, there is a trick by which a given piece of memory can be mapped both as a global buffer and as a regular buffer, improving performance (in general one has to watch out for cache-coherency issues if doing this).

I plan to develop a double-precision version of the code if the speed issues can be addressed. This will be slightly complicated by the fact that there does not appear to be an intrinsic "dqsrt" instruction in IL. Incidentally, "ddiv" maps to about 12 gpuisa instruction groups, so should take about 12 cycles per VLIW processor. Seeing as composite division is implemented, I am hopeful that a composite dsqrt will be implemented shortly too. In the meantime, I've written my own (non-optimized and not overly tested...) "dsqrt" IL code.

Note that at present there are a couple of tricky issues in fully utilising the memory on AMD cards. On my system at least, it seems that the maximum global buffer size is 256MB less than the amount of memory on the card, so a 1 GB card is highly recommended (especially for compute shaders which can only output to a global buffer...). In addition, one can only easily access from the cpu side buffers that are less than 255MB in size (though this can be worked around with some effort).

A useful source of information about AMD Stream is the Stream Forum at:

AMD Developer Forums.

Back to main GPGPU page.