For my simulations of eternal inflation I wanted to generate a chain of random numbers for each thread in a quick manner. It would not do to use the same random number generator with the same seed for each thread; then each chain would be the same: not very good for averaging over!

One option would be to use the same random number generator but with a different seed for each thread. I was worried though that the chains would be very correlated if the choice of seeds was unfortunate.

Another option would be to use a random number generator specifically developed with parallel applications in mind; see for example the Mersenne Twister example in the CUDA software development kit. I was concerned about complexity and speed for my application though.

What I decided on as a compromise was to use some simple fast algorithm for each thread, but with each thread having a different multiplier as well as starting with a random seed.

After some investigation I came across Marsaglia's Multiply-With-Carry (MWC) generator (see here, and here for the theory), which seems well-regarded and crucially has a simple prescription for finding "good" multipliers. The generator looks like:

y_(n+1) = (a*y_n + c) mod m

where a is the multiplier and m is a modulus (typically 2^p for
some integer p on digital computers). The c is the "carry", that is
the number of m's dropped in the previous step of forming y_n. The
initial carry is set by hand and should be chosen to be less than a in
order to make the chain strictly periodic and for all the theory to
apply. A "good" multiplier is one for
which a*m-1 is *safeprime* (a number q is safeprime if both q
and (q-1)/2 are prime).

Typical examples pick one suitable multiplier to use, perhaps including a commented list of several alternatives. However, for my application I needed potentially many many of them, a different one for each thread. Also, I considered using modulus 2^24 rather than the usual 2^32, since the G80 hardware is very quick at multiplying 24 bit integers but not so good at 32 bit ones. Somewhat surprisingly I was unable to find suitable lists elsewhere, So I decided to generate my own list of suitable multipliers for a selection of moduli. The files are available here. The file name indicates the modulus. The first column is the multiplier a, the second is the safeprime a*m-1, and the third is the prime (a*m-1)/2. Note that for even p, the period of each generator is the last number, while for odd p, the period is the middle number.

It turns out that the algorithm basically generates the base 2^p digits of the reciprocal of the prime a*m-1. I have no reason to expect these digits to be correlated for different a's, hence my trust in using this scheme for my application. Of course for other applications this would have to be investigated more carefully, and at some stage I'd like to run some tests to see whether there are any important correlations between the chains.

My application actually needed gaussianly-distributed random numbers, not uniformly distributed ones, so I used a standard Box-Muller transform to get two of the former from two of the latter. I used the simple version of the transform that doesn't reject any numbers but requires a logs, square roots, sines and cosines as opposed to logs, square roots and divides. This is because the G80 GPUs perform sines and cosines only twice as slowly as divides and I wanted to avoid any conditional code that could cause threads in a warp to diverge and slow everything down.