Misconceptions about rand().

I was recently alerted to a very sad state of affairs regarding the comp.lang.c FAQ (it has since been improved from when I first looked at it, no doubt in part thanks to me, however it still misses the point and says inaccurate things). For those that care to know these things, I hope people know that rand() is just not very good, and alternatives should be used. Explaining this to a lazy programmer who just wants a reasonable solution is probably a bit much. For someone just looking for an answer that is "good enough" we need something quick and dirty, but reasonable in practice.

The Travesty

The point is that

x = rand () % RANGE; /* return a random number in [0,RANGE) */

is not good enough. There are three major problems with using the above:

rand() returns a value in [0, RAND_MAX]. So if RANGE does not divide evenly into RAND_MAX+1, the distribution is blatantly wrong. Specifically the probability of choosing x in [(RAND_MAX % RANGE), RANGE) is less than choosing x in [0, (RAND_MAX % RANGE)).
The quality of rand() on most systems is not that good; in particular the low bits can follow a short cyclical pattern, or there can be dependencies between the bits.
What happens if RAND_MAX+1 < RANGE?

For the typical programmer concerned with robustness, the primary concern will be for fixing the first condition above. Programmers that don't actually care about this, are essentially happy with "loaded dice" and can stick with the solution given above. The comp.lang.c FAQ attempts to solve the second but just ignores the first condition totally. They recommend the following:

/* WARNING!! Don't use this, its a bad partial fix of rand () */
x = rand() / (RAND_MAX / RANGE + 1); /* Bad advice from C.L.C. FAQ */

Or in other forms:

/* Using floating point like this doesn't help in any way. */
#define ranrange(a, b) (int)((a) + rand()/(RAND_MAX + 1.0) * ((b) - (a) + 1))

Ok, the point is that this solution doesn't do anything about the bias that's created by trying (and failing) to evenly divide the original number of potential results amongst the desired number of potential results. The FAQ then tries to excuse itself by suggesting that this solution is only good if RANGE is much smaller than RAND_MAX when the bias will be smaller. The problem is that the bias will be easily detected (the point where you are 99% sure that its following its incorrect distribution, rather than the one you want) after about 1000 * (RAND_MAX / RANGE) samples regardless (the bias still exists for a smaller number of samples, its just harder to detect).

A More Reasonable Answer

So does that mean that fixing rand() is infeasible and that we really should be trying to push everyone to use some of the more sophisticated generators from Marsaglia or the Mersenne Twister? Not necessarily -- first of all those generators are not always portable, they are not always as fast as rand() and include details that steepens their learning curves. If you can get over these hurdles, I whole heartedly recommend you study them and pick the one you find most appropriate. But the much simpler goal of getting an even distribution is actually very achievable and also applies to those more sophisticated random number generators:

/* This actually works correctly. */
#define RAND_INV_RANGE(r) ((int) ((RAND_MAX + 1) / (r))) do { x = rand(); } while (x >= RANGE * RAND_INV_RANGE (RANGE)); x /= RAND_INV_RANGE (RANGE);

The idea is very simple. Just reject a small number of values ((RAND_MAX % RANGE)) off the top of the range of output of rand (); forcing a retry if such values are encountered. The output will be an exactly equally distributed choice from the number range of [0, RANGE) for any value of RANGE that is less or equal to RAND_MAX.

This does nothing about the 3rd problem above. For the moment we are going to ignore this.

Now, of course, one might be concerned that this may have poor running time since it may have to call rand() multiple times before producing a single result. So lets do some calculations to see how bad this really can be in practice.

Let p = (RAND_MAX % RANGE) / (RAND_MAX + 1.0). This is the probability that any given call to rand() will require a retry. Note that this value is maximized when RANGE*2 = RAND_MAX+3, and which will yield a value of p roughly equal to 1/2.

The average number of times that rand() will be called is 1/(1-p) (with a worst case of about 2).
The probability that at least n calls to rand() will be required beyond the first one is pⁿ.

So for most values of p which will be much less than 1/32 say, performance should not be a concern at all. The question is, then, what should be done about values of p which are too large? Well, if its because RAND_MAX is too small and in fact RAND_MAX * RAND_MAX < INT_MAX (a very typical, and sad, situation) then the simplest solution is to just contruct a better rand() function:

#define XRAND_MAX (RAND_MAX*(RAND_MAX + 2)) unsigned int xrand (void) { return rand () * (RAND_MAX + 1) + rand (); }

Otherwise, multiprecision arithmetic would be required (at which point one might as well pick up one of the alternative random number generators.) We can see that this alternative will help with the 3rd problem above by expanding the range. However, RANGE may still be too large.

Another Alternative

The other obvious alternative to fixing the rand() function is to construct a random floating point number from it, so that range fixing of it will be more straight forward:

#include <stdlib.h> #define RS_SCALE (1.0 / (1.0 + RAND_MAX)) double drand (void) { double d; do { d = (((rand () * RS_SCALE) + rand ()) * RS_SCALE + rand ()) * RS_SCALE; } while (d >= 1); /* Round off */ return d; } #define irand(x) ((unsigned int) ((x) * drand ()))

The result returned by drand() is a double precision floating point number in the range of [0,1). The irand(r) macro will return a random integer in the range [0,r). This has the advantage, that if RAND_MAX = 32767 (which is very typical) and your platform has a double with a 53 bit mantissa (also very typical), then this actually produces a bit-faithful random number with 45 bits of precision. 45 bits will be large enough for pretty much any practical RANGE, however, it is still not perfect (so technically problem 3 is still not fully addressed).

The theoretical disadvantages are two-fold: 1) overflows (which actually result from RAND_MAX being too large, or the platform's double mantissa being too small) in the mantissa will create roundings that will introduce a slight bias on the order of 1 ULP (Unit in Last Place); notice that the extremely unlikely case of erroneous overflows are shielded by a do-while() loop. 2) The conversion back to integer can introduce a bias of about 1 ULP. A bias of 1 ULP is typically so small that it is not even realistically feasible to test for its existence from a statistical point of view.

The more tangible disadvantages are that 1) rand () is definitely called 3 times (which is worse than the worst case for the average expected running time of the earlier solution), and that 2) rand () is usually actually quite terrible at guaranteeing that successive calls to it behave with apparent independence.

Generalizing to a real range

One of the advantages of the drand() function given above is that it extends to use of floating point probability ranges trivially. So if you want an indicator function on a random sample with a bias of x, then it is simply drand() < x .

To get the probability precisely, we can use rand() to give us a ranged slot and see if it falls entirely below or entirely above x. If the slot straddles x, then we refine the choice within that slot (to a sub-slot) and repeat:

#include <stdlib.h> #define RS_SCALE (1.0 / (1.0 + RAND_MAX)) int randbiased (double x) { for (;;) { double p = rand () * RS_SCALE; if (p >= x) return 0; if (p+RS_SCALE <= x) return 1; /* p < x < p+RS_SCALE */ x = (x - p) * (1.0 + RAND_MAX); } }

Although this function also technically can loop an unbounded number of times, this time around the probabilities of successive loops drop off extremely quickly; the expected number of iterations is 1+1.0/RAND_MAX. I.e., its not worth pursuing a fix for this.

Sampling from an arbitrary discrete distribution

Continuing with the interval idea from above for a boolean distribution, we can now proceed to implement a generalized discrete distribution (i.e., one of a finite number of outcomes, but with arbitrary probabilities.) We start by creating a sorted sequence which represents the cumulative distribution. I.e., we create the mapping i => [ slot_i-1, slot_i ), where slot_-1 = 0, slot_n-1 = 1 and slot_i-1 <= slot_i. So the probability of choosing the ith entry is slot_i - slot_i-1. This requires an array of n-1 (not n) entries to be constructed in which the -1 and n-1 indexes are omitted. The function will return with a value of 0 to n-1 inclusive:

#include <stdlib.h> #define RS_SCALE (1.0 / (1.0 + RAND_MAX)) /* A non-overflowing average function */ #define average2scomplement(x,y) ((x) & (y)) + (((x) ^ (y))/2) size_t randslot (const double slots[/* n-1 */], size_t n) { double xhi; /* Select a random range [x,x+RS_SCALE) */ double x = rand () * RS_SCALE; /* Perform binary search to find the intersecting slot */ size_t hi = n-2, lo = 0, mi, li; while (hi > lo) { mi = average2scomplement (lo, hi); if (x >= slots[mi]) lo = mi+1; else hi = mi; } /* Taking slots[-1]=0.0, this is now true: slots[lo-1] <= x < slots[lo] */ /* If slots[lo-1] <= x < x+RS_SCALE <= slots[lo] then any point in [x,x+RS_SCALE) is in [slots[lo-1],slots[lo]) */ if ((xhi = x+RS_SCALE) <= slots[lo]) return lo; /* Otherwise x < slots[lo] < x+RS_SCALE */ for (;;) { /* x < slots[lo] < xhi */ if (randbiased ((slots[lo] - x) / (xhi - x))) return lo; x = slots[lo]; lo++; if (lo >= n-1) return n-1; if (xhi <= slots[lo]) { /* slots[lo-1] = x <= xhi <= slots[lo] */ return lo; } } }

Now by this point, you might think that this sort of thing is excessive. However, we can immediately see the justification for this function by considering the distribution:

[0, 0.3 / RAND_MAX), [0.3 / RAND_MAX, 1.0 / RAND_MAX), [1.0 / RAND_MAX, 1.0)

It would be difficult to sample from this distribution accurately without a solution similar to the above.

Ok, and now we see we can finally address the 3rd problem completely. If we replace slots[x] above with x/(double)n then we will finally have a solution for what we were originally looking for. But we can simplify this massively:

#include <stdlib.h> #include <math.h> #define RS_SCALE (1.0 / (1.0 + RAND_MAX)) size_t randrange (size_t n) { double xhi; double resolution = n * RS_SCALE; double x = resolution * rand (); /* x in [0,n) */ size_t lo = (size_t) floor (x); xhi = x + resolution; for (;;) { lo++; if (lo >= xhi || randbiased ((lo - x) / (xhi - x))) return lo-1; x = lo; } }

Finally!

Seeding the random number generator

The standard typical method for seeding the random number generator is to do the following:

#include <stdlib.h> #include <time.h> srand (time (NULL));

This is fine so long as one does not perform this seeding operation at any rate higher than once per second. So the question is, can we somehow increase the number of times we can reseed and expect variance in the actual seeds? The obvious idea is to try to get multiple sources of entropy and iterate through them slowly:

#include <stdlib.h> #include <limits.h> #include <time.h> static struct { int which; time_t t; clock_t c; int counter; } entropy = { 0, (time_t) 0, (clock_t) 0, 0 }; static unsigned char * p = (unsigned char *) (&entropy + 1); static int accSeed = 0; int reseed (void) { if (p == ((unsigned char *) (&entropy + 1))) { switch (entropy.which) { case 0: entropy.t += time (NULL); accSeed ^= entropy.t; break; case 1: entropy.c += clock(); break; case 2: entropy.counter++; break; } entropy.which = (entropy.which + 1) % 3; p = (unsigned char *) &entropy.t; } accSeed = ((accSeed * (UCHAR_MAX+2U)) | 1) + (int) *p; p++; srand (accSeed); return accSeed; }

So we are using time(NULL), clock() and an incrementing counter as the sources of entropy. Obviously there's a high degree of dependency, amongst all these sources. But the point of an entropic sequences is not for it to be a pure source of random numbers, by itself (that's what the PRNG is for). The only requirement we have is some degree of non-deterministic variability, which we should expect. The formula for accSeed will tend to create at least a sort of pseudo random sequence but which is deterministically perturbed by the entropic sources.

On a 32 bit system each entropy source is fetched, on average, once every 36 times that reseed() is called. If more entropy sources are added this will increase the time between each repeated fetch from each source. Examples of entropy sources that might be appropriate are 1) an on-disk counter, 2) the current process-ID, 3) processor clock counter (the TSC MSR on the x86 for example) values when certain events (such as network response, mouse movement, keypresses, etc) occurr.

Download sources for the above.

Going beyond rand()

The discussion so far has focussed on trying to make sure that the aggregate number of each possible output from the random number selection is a single constant. In mathematical terms we are trying to find a random number generator that satisfies: E(|{X=x}ⁿ|) = np (the C.L.C. FAQ solution fails simply by virtue of choosing a large enough n whenever 1/p is not a divisor of RAND_MAX) as well as hoping for a couple other properties such as E(|{X+αX=x (mod RAND_MAX+1)}ⁿ|) = np (since just an incrementing counter satisfies the first test alone). Generators such as the various Marsaglia generators or the Mersenne Twister are primarily concerned with testing the following:

P(X^k = (x₀, x₁, ... , x_k-1)) = p^k.

In non-math speak, it means that every plausible sub-sequence of sequentially fetched random numbers is equally probable. Or that successively generated random numbers appear independent from the ones generated immediately before them. One might, very roughly, quantify how good these generators are by how high of a value of k they can satisfy (how many outputs you can produce before you can, with sufficient computing resources, determine that not every possible sequence of that length is equally represented). For example, the Mersenne Twister (which is portable to systems with 32 bit integers) has a k-value of roughly 600 (corresponding to a cycle of length 2¹⁹⁹³⁷-1) and Marsaglia's CMWC generator (which, unfortunately, requires assembly language support) has a k-value of roughly 1000 (corresponding to a cycle length of about 10¹⁰⁰⁰⁷). Real world applications (even lotteries) would hardly ever need a k-value of more than about 16 or so and more typically can get away with values less than 5.

Cryptographers have entirely different needs for random numbers. For example, the requirement of choosing very large prime numbers is necessary for public key based cryptography. For cryptographers, the state of the random number generator must not be deducible from the examination of any number of outputs with any amount of computing power. The Fortuna random number generator by Schneier at al (which is an improvment over the Yarrow random number generator) hides the output using non-reversible hash functions (like SHA256) and uses multiple sources of entropy. The idea is that even if the complete state of the generator is known (itself highly unlikely), it will very quickly become unknown. So the sequence of the numbers output are not determinable with anything short of complete state knowledge and control over the entropy. As with anything having to do with cryptography, it pays to listen to what the experts say before attempting to roll a home grown solution.

One final note should be spoken with regard to multithreading systems. We must remember Knuth's warning from The Art of Computer Programming, that randomly constructed algorithms don't necessarily lead to a good source of random numbers. The same comment needs to apply to race conditions affecting random number servers. Although it would seem that allowing race conditions would be harmless for PRNGs, this is not the case. First of all many random number generators are susceptible to "0-sticking"; that is if their internal seeds all end up going to 0 simultaneously, they would proceed to issue constant-only output. Secondly, a race condition could cause the internal state to skip an update meaning that the same number could be output more than once in a row soley because of an occurring race condition. Third, the mechanisms of a PRNG carry it from state to state along a maximal path -- unpredictably perturbing it cannot increase the path but in fact only decrease it, thus decreasing the length of the PRNG's cycle. So in multithreaded systems, PRNGs should be properly mutexed, just like malloc() is.