Parallel Random Numbers in MATLAB #1

July 13th, 2010 | Categories: Condor, matlab, parallel programming, programming, random numbers | Tags:

A bit of background to this post

I work in the IT department of the University of Manchester and we are currently developing a Condor Pool which is basically a method of linking together hundreds of desktop machines to produce a high-throughput computing resource.  A MATLAB user recently submitted some jobs to our pool and complained that all of them gave identical results which is stupid because his code used MATLAB’s rand command to mix things up a bit.

I was asked if I knew why this should happen to which I replied ‘yes.’  I was then asked to advise the user how to fix the problem and I did so.  The next request was for me to write some recommendations and tutorials on how users should use random numbers in MATLAB (and Mathematica and possibly Python while I was at it) along with our Condor Pool and I went uncharacteristically quiet for a while.

It turned out that I had a lot to learn about random numbers.  This is the first of a series of (probably 2) posts that will start off by telling you what I knew and move on to what I have learned.  It’s as much a vehicle for getting the concepts straight in my head as it is a tutorial.

Ask MATLAB for 10 Random Numbers

Before we go on, I’d like you to try something for me. You have to start on a system that doesn’t have MATLAB running at all so if MATLAB is running then close it before proceeding. Now, open up MATLAB and before you do anything else issue the following command

rand(10,1)

As many of you will know, the rand command produces random numbers from the uniform distribution between 0 and 1 and the command above is asking for 10 such numbers. You may reasonably expect that the 10 random numbers that you get will be different from the 10 random numbers that I get; after all, they are random right? Well, I got the following numbers when running the above command on MATLAB 2009b running on Linux.

ans =
0.8147
0.9058
0.1270
0.9134
0.6324
0.0975
0.2785
0.5469
0.9575
0.9649

Look familiar?

Now I’ve done this experiment with a number of people over the last few weeks and the responses can be roughly split into two different camps as follows:

1. Oh yeah, I know all about that – nothing to worry about. It’s pretty obvious why it happens isn’t it?
2. It’s a bug. How can the numbers be random if MATLAB always returns the same set?

What does random mean anyway?

If you are new to the computer generation of random numbers then there is something that you need to understand and that is that, strictly speaking, these numbers (like all software generated ‘random’ numbers) are not ‘truly’ random.  Instead they are pseudorandom – my personal working definition of which is “A sequence of numbers generated by some deterministic algorithm in such a way that they have the same statistical properties of ‘true’ random numbers”.  In other words, they are not random they just appear to be but the appearance is good enough most of the time.

Pseudorandom numbers are generated from deterministic algorithms with names like Mersenne Twister, L’Ecuyer’s mrg32k3a [1]  and Blum Blum Schub whereas ‘true’ random numbers come from physical processes such as radioactive decay or atmospheric noise (the website www.random.org uses atmospheric noise for example).

For many applications, the distinction between ‘truly random’ and ‘pseudorandom’ doesn’t really matter since pseudorandom numbers are ‘random enough’ for most purposes.  What does ‘random enough’ mean you might ask?  Well as far as I am concerned it means that the random number generator in question has passed a set of well defined tests for randomness – something like Marsaglia’s Diehard tests or, better still, L’Ecuyer and Simard’s TestU01 suite will do nicely for example.

The generation of random numbers is a complicated topic and I don’t know enough about it to do it real justice but I know a man who does.  So, if you want to know more about the theory behind random numbers then I suggest that you read Pierre L’Ecuyer’s paper simply called ‘Random Numbers’ (pdf file).

Back to MATLAB…

Always the same seed

So, which of my two groups are correct?  Is there a bug in MATLAB’s random number generator or not?

There is nothing wrong with MATLAB’s random number generator at all. The reason why the command rand(10,1) will always return the same 10 numbers if executed on startup is because MATLAB always uses the same seed for its pseudorandom number generator (which at the time of writing is a Mersenne Twister) unless you tell it to do otherwise.

Without going into details, a seed is (usually) an integer that determines the internal state of a random number generator.  So, if you initialize a random number generator with the same seed then you’ll always get the same sequence of numbers and that’s what we are seeing in the example above.

Sometimes, this behaviour isn’t what we want.  For example, say I am doing a Monte Carlo simulation and I want to run it several times to verify my results.  I’m going to want a different sequence of random numbers each time or the whole exercise is going to be pointless.

One way to do this is to initialize the random number generator with a different seed at startup and a common way of achieving this is via the system clock.  The following comes straight out of the current MATLAB documentation for example

RandStream.setDefaultStream(RandStream('mt19937ar','seed',sum(100*clock)));

If you are using MATLAB 2011a or above then you can use the following, much simpler syntax to do the same thing

rng shuffle

Do this once per MATLAB session and you should be good to go (there is usually no point in doing it more than once per session by the way….your numbers won’t be any ‘more random’ if you so.  In fact, there is a chance that they will become less so!).

Condor and ‘random’ random seeds

Sometimes the system clock approach isn’t good enough either.  For example, at my workplace, Manchester University, we have a Condor Pool of hundreds of desktop machines which is perfect for people doing Monte Carlo simulations.  Say a single simulation takes 5 hours and it needs to be run 100 times in order to get good results.  On one machine that’s going to take about 3 weeks but on our Condor Pool it can take just 5 hours since all 100 simulations run at the same time but on different machines.

If you don’t think about random seeding at all then you end up with 100 identical sets of results using MATLAB on Condor for the reasons I’ve explained above.  Of course you know all about this so you switch to using the clock seeding method, try again and….get 100 identical sets of results[2].

The reason for this is that the time on all 100 machines is synchronized using internet time servers.  So, when you start up 100 simultaneous jobs they’ll all have the same timestamp and, therefore, have the same random seed.

It seems that what we need to do is to guarantee (as far as possible) that every single one of our condor jobs gets a unique seed in order to provide a unique random number stream and one way to do this would be to incorporate the condor process ID into the seed generation in some way and there are many ways one could do this.  Here, however, I’m going to take a different route.

On Linux machines it is possible to obtain small numbers of random numbers using the special files /dev/random and /dev/urandom which are interfaces to the kernel’s random number generator.  According to the documentation ‘The random number generator gathers environmental noise from device drivers and other sources into an entropy pool. The generator also keeps an estimate of the number of bit of the noise in the entropy pool.  From this entropy pool random numbers are created.’

This kernel generator isn’t suitable for simulation purposes but it will do just fine for generating an initial seed for MATLAB’s pseudorandom number generator.  Here’re the MATLAB commands

[status seed] = system('od /dev/urandom --read-bytes=4 -tu | awk ''{print $2}''');
seed=str2double(seed);
RandStream.setDefaultStream(RandStream('mt19937ar','seed',seed));

In MATLAB 2011a or above you can change this to

[status seed] = system('od /dev/urandom --read-bytes=4 -tu | awk ''{print $2}''');
seed=str2double(seed);
rng(seed);

Put this at the beginning of the MATLAB script that defines your condor job and you should be good to go.  Don’t do it more than once per MATLAB session though – you won’t gain anything!

The end or the end of the beginning?

If you asked me the question ‘How do I generate a random seed for a pseudorandom number generator?’ then I think that the above solution answers it quite well.  If, however, you asked me ‘What is the best way to generate multiple independent random number streams that would be good for thousands of monte-carlo simulations?‘ then we need to rethink somewhat for the following reasons.

Seed collisions: The Mersenne twister algorithm currently used as the default random number generator for MATLAB uses a 32bit integer seed which means that it can take on 2^32 or 4,294,967,296 different values – which seems a lot!  However, by considering a generalisation of the birthday problem it can be seen that if you select such a seed at random then you have a 50% chance choosing two identical seeds after only 65,536 runs.  In other words, if you perform 65,536 simulations then there is a 50% chance that two such simulations will produce identical results.

Bad seeds: I have read about (but never experienced) the possibility of ‘bad seeds’; that is some seeds that produce some very strange, non-random results – for the first few thousand iterates at least.  This has led to some people advising that you should ‘warm-up’ your random number generator by asking for, and throwing away, a few thousand random numbers before you start using them. Does anyone know of any such bad seeds?

Overlapping or correlated sequences: All pseudorandom number generators are periodic (at least, all the ones that I know about are) – which means that after N iterations the sequence repeats itself.  If your generator is good then N is usually large enough that you don’t need to worry about this.  The Mersenne Twister used in MATLAB, for instance, has a huge period of (2^19937 – 1)/2 (half of the standard 32bit implementation).

The point I want to make is that you don’t get several different streams of random numbers, you get just one, albeit a very big one.  Now, when you choose a seed you are essentially choosing a random point in this stream and there is no guarantee how far apart these two points are.  They could be separated by a distance of trillions of points or they could be right next to each other – we simply do not know – and this leads to the possibility of overlapping sequences.

Now, one could argue that the possibility of overlap is very small in a generator such as the Mersenne Twister and I do not know of any situation where it has occurred in practice but that doesn’t mean that we shouldn’t worry about it.  If your work is based on the assumption that all of your simulations have used independent, uncorrelated random number streams then there is a possibility that your assumptions could be wrong which means that your conclusions could be wrong.  Unlikely maybe, but still no way to do science.

Next Time

Next time I’ll be looking at methods for generating guaranteed independent random number streams using MATLAB’s in-built functions as well as methods taken from the NAG Toolbox for MATLAB.  I’ll also be including explicit examples that use this stuff in Condor.

Ask the audience

I assume that some of you will be in the business of performing Monte-Carlo simulations and so you’ll probably know much more about all of this than me.  I have some questions

  • Has anyone come across any ‘bad seeds’ when dealing with MATLAB’s Mersenne Twister implementation?
  • Has anyone come across overlapping sequences when using MATLAB’s Mersenne Twister implementation?
  • How do YOU set up your random number generator(s).

I’m going to change my comment policy for this particular post in that I am not going to allow (most) anonymous comments.  This means that you will have to give me your email address (which, of course, I will not publish) which I will use once to verify that it really was you that sent the comment.

Notes and References

[1] L’Ecuyer P (1999) Good parameter sets for combined multiple recursive random number generators Operations Research 47:1 159–164

[2] More usually you’ll get several different groups of results.  For example you might get 3 sets of results, A B C, and get 30 sets of A, 50 sets of B and 20 sets of C.  This is due to the fact that all 100 jobs won’t hit the pool at precisely the same instant.

[3] Much of this stuff has already been discussed by The Mathworks and there is an excellent set of articles over at Loren Shure’s blog – Loren onThe Art of MATLAB.

  1. July 13th, 2010 at 09:14
    Reply | Quote | #1

    Good introduction.

    Speaking of “doing science”, one ought to print out the seed used to initialise each simulation. So that each run can, at least in principle, be recreated exactly.

  2. July 13th, 2010 at 10:52
    Reply | Quote | #2

    Hi David

    Completely agree! Annoyed that I didn’t mention it so thanks for doing so in the comments.

    Cheers,
    Mike

  3. LB
    July 15th, 2010 at 12:18
    Reply | Quote | #3

    For Mathematica, at least, this might be less of an issue. According to http://reference.wolfram.com/mathematica/tutorial/PseudorandomNumbers.html “When Mathematica starts up, it takes the time of day (measured in small fractions of a second) as the seed for the pseudorandom number generator. Two different Mathematica sessions will therefore almost always give different sequences of pseudorandom numbers.”
    Unless the jobs hit each node of the pool at EXACTLY the same time, the seed will be a different value. This strikes me as being a more sensible choice than using the same seed for every session.
    And according to http://reference.wolfram.com/mathematica/tutorial/RandomNumberGeneration.html the RandomReal[] function (ie uniform distribution) uses an extended cellular automaton algorithm that is supposedly superior to the Mersenne Twister, though this is available as an option. Prior to version 6.0, Mathematica used a Marsaglia-Zaman subtract-with-borrow generator.

  4. July 15th, 2010 at 15:37
    Reply | Quote | #4

    Hi LB

    We have got at least one Mathematica user on our Condor Pool and his simulations are stochastic in nature. Unless you take steps to prevent it, you DO get the same seed used in some cases because the jobs do hit the nodes at exactly the same time.

    As for the quality of the extended cellular automaton algorithm – I couldn’t comment. Do you have any references that demonstrate its superiority?

    There is also the issue of repeatability. It is often desirable to store the seed used as part of any random simulation to allow people your work to be repeated. As you point out, Mathematica uses a different seed in every session. So, how do you ask Mathematica what seed it used to initialise the generator?

    in MATLAB it would be

    stream = RandStream.getDefaultStream
    

    which gives

    stream = 
    mt19937ar random stream (current default)
                 Seed: 0
             RandnAlg: Ziggurat
    

    I’ve just had a quick look around the Mathematica documentation and can’t find how to do this.

    Cheers,
    Mike

  5. LB
    July 15th, 2010 at 23:46
    Reply | Quote | #5

    Hi again,
    Mathematica implements repeatability by telling it what seed to use, not asking what seed it did use. Just use the SeedRandom[x] function, where x is an integer or string, perhaps inside a BlockRandom[] function.

    As for whether the ExtendedCA algorithm is superior the Mersenne Twister, I’m not a number theorist and have no independent information. But it’s straightforward to use the Mersenne twister or any one of a number of other algorithms instead, using the Method option. There’s some commentary about the ExtendedCA default algorithm “passing stringent tests of randomness”, but because you can set other options for it, they don’t tell you what the period is. I’m guessing that if the Mersenne was really better than the extended CA, they wouldn’t set the latter as the default, even if Stephen Wolfram loves all things CA.

  6. July 16th, 2010 at 06:59
    Reply | Quote | #6

    Hi again LB

    In order to use SeedRandom[x] to repeat a calculation, you need to know which value of x was used in the calculation you want to repeat. So, if you rely on Mathematica’s automatic seeding method then you need some way to find out what seed it used.

    I’d be interested to know what random number experts think of Wolframs ExtendedCA algorithm. My gut feeling would be to want to avoid it unless its algorithm is fully written up in a peer reviewed journal along with results of all standard random number tests.

    If I were going to publish a set of results based on the output of a random number generator then I would want to use a generator with properties that were well understood. So, without any extra evidence, I’d probably switch to something more standard such as the Mersenne twister.

    Might blog about this….thanks for the inspiration.

    Cheers,
    Mike

  7. July 20th, 2010 at 12:10
    Reply | Quote | #7

    Excellent article, Mike, which nicely explains some tricky concepts about randomness. Perhaps I could just add another one to the mix, which is the distinction between pseudo and quasi random sequences. Whilst members of the former are – as you correctly point out – designed to have statistical properties (correlations, etc) which are as close as possible to numbers from a ‘true’ random sequence, a quasi random sequence is designed to be more evenly distributed in space. Thus, while members of a pseudo sequence are uncorrelated with each other (so there is a possibility that two of them could have the same, or a very close, value), members of a quasi sequence (also known as a low-discrepancy sequence) are designed to be maximally avoiding of each other.

    As a result of this, when used in a Monte Carlo simulation, quasi random numbers produce estimates that – for a given sequence length – are more accurate than those obtained from a Monte Carlo simulation using pseudo random numbers, because they give a more efficient sampling of space (strictly, the former type of simulation is usually referred to as Quasi-Monte Carlo in order to reflect the distinction between the two types of random numbers used). So – if the aim of the Monte Carlo simulation is the calculation of the estimate of some parameter – using a quasi random generator is to be preferred to a pseudo random generator (calculating the error in the estimate is more complicated, however).

    A little more detail – and a picture which highlights the psuedo/quasi distinction – can be found in this note: http://www.nag.co.uk/IndustryArticles/usingtoolboxmatlabpart3.asp#g05randomnumber

  8. July 20th, 2010 at 13:36
    Reply | Quote | #8

    Given that Matlab’s u = rand gives a dp fpn uniformly distributed over the open interval (0,1):

    Are the floating point numbers [2^(-400), 2^(-100), 2^(-1)]

    = 3.872591914849318e-121, 7.888609052210118e-031, 5.000000000000000e-001

    equally likely to be output by rand?

    Derek O’Connor

  9. July 20th, 2010 at 16:02
    Reply | Quote | #9

    @Jeremy you’ve stolen my thunder a bit. There will be more on Quasi random numbers in a future article but thanks for the early mention :)

  10. July 21st, 2010 at 02:36

    Hi Derek

    I believe that the smallest number returned by MATLABs implementation of the Mersenne twister is 2^-53 so the answer to your question would be no I guess.

  11. July 21st, 2010 at 15:35

    Dear Mike,

    The reason I asked the question is because many people I have asked (Engineers, scientists, mathematicians) have answered “yes”, and are surprised that their answer is wrong.

    Here is what Matlab’s >help rand gives:

    > 1. This method (‘twister’) generates double precision values in the closed interval [2^(-53), 1-2^(-53)], with a period of (2^19937-1)/2.

    This information often causes further puzzlement: if there are at most 2^64 double precision values in a 64-bit word, how can the generator have a period of (2^19937-1)/2?

    I raise these questions to point out the difficulties and dangers of using random numbers, pseudo or otherwise.

    Regards,

    Derek O’Connor

  12. July 21st, 2010 at 19:34

    Hi Derek

    Thanks for that…such questions are very useful I think.

    I’ve just been browsing your web site and see that you pose many similar questions on floating point numbers in general. Loved the slides from ‘you can’t always count on your computer’…would love to see you give that talk sometime. Ever thought of turning it into a book?

    Cheers,
    Mike

  13. LB
    July 22nd, 2010 at 13:16

    and if you do, will you remove the ten-year-old gripe about Mathematica version 2 which is no longer operative (at least in current version 7)?

  14. Super Ego
    August 31st, 2010 at 11:51

    I use the condor system, and Matlab to simulate Monte Carlo stuff.

    I create an initialisation script, for all my matrices and useful information, and a condor job specific file [essentially a number to label to job, and a random number]. I run the initialisation scrip once to create all my files and directories needed for the particular simulation. Within the initialisation script I seed the random number generator (using the clock like normal). Then when i iterate to create all the specific job files (10,000 files for 10,000 monte carlo runs), I call a random number and store it in the file. So now I have 10,000 random numbers, which will be different each time I initialise.

    Now, when condor starts up one of the 10,000 jobs, within the condor execution script, I seed my random number again (it needs to be done at some point within the single monte carlo run), this time I don’t seed from the clock, I seed from the random number created in the initialisation script.

    I should probably look more carefully at bad seeds (but not overlap, because I simulate conditional dynamics), but I remedy the possibility of the situation by using a try-catch when I compile the results, if the results are not what i expect a pseudo-random number to generate, I throw them away. I usually throw away a few in 10,000 for one reason or another (usually a string of NaNs).

  15. Jeremy Muhlich
    December 20th, 2011 at 16:03

    Super Ego has the right idea — generate a list of seeds ahead of time and pass one to each job. But Mike, surely your original code would be more cleanly written with fopen and fread instead of system!

        fid = fopen('/dev/urandom');
        seed = fread(fid, 1, 'uint32');
    
  16. December 20th, 2011 at 16:18

    Hi Jeremy

    Yep, your suggestion is much cleaner. I um…just didn’t think of it at the time :)

    Cheers,
    Mike

  17. July 11th, 2013 at 21:37

    Another simple alternative is to concatenate the machine’s MAC address(es) to the time-of-day (clock) as the seed. The reason it should work is that parallelization requires the machines to be on the same network, and this dictates unique machine MACs. The concatenation of the clock is needed in order for multiple calls to the same machine to get different seeds.