Archive for the ‘tutorials’ Category

June 14th, 2012

I recently got access to a shiny new (new to me at least) set of compilers, The Portland PGI compiler suite which comes with a great set of technologies to play with including AVX vector support, CUDA for x86 and GPU pragma-based acceleration.  So naturally, it wasn’t long before I wondered if I could use the PGI suite as compilers for MATLAB mex files.  The bad news is that The Mathworks don’t support the PGI Compilers out of the box but that leads to the good news…I get to dig down and figure out how to add support for unsupported compilers.

In what follows I made use of MATLAB 2012a on 64bit Windows 7 with Version 12.5 of the PGI Portland Compiler Suite.

In order to set up a C mex-compiler in MATLAB you execute the following

mex -setup

which causes MATLAB to execute a Perl script at C:\Program Files\MATLAB\R2012a\bin\mexsetup.pm.  This script scans the directory C:\Program Files\MATLAB\R2012a\bin\win64\mexopts looking for Perl scripts with the extension .stp and running whatever it finds. Each .stp file looks for a particular compiler.  After all .stp files have been executed, a list of compilers found gets returned to the user. When the user chooses a compiler, the corresponding .bat file gets copied to the directory returned by MATLAB’s prefdir function. This sets up the compiler for use.  All of this is nicely documented in the mexsetup.pm file itself.

So, I’ve had my first crack at this and the results are the following two files.

These are crude, and there’s probably lots missing/wrong but they seem to work.  Copy them to C:\Program Files\MATLAB\R2012a\bin\win64\mexopts. The location of the compiler is hard-coded in pgi.stp so you’ll need to change the following line if your compiler location differs from mine

my $default_location = "C:\\Program Files\\PGI\\win64\\12.5\\bin";

Now, when you do mex -setup, you should get an entry PGI Workstation 12.5 64bit 12.5 in C:\Program Files\PGI\win64\12.5\bin which you can select as normal.

An example compilation and some details.

Let’s compile the following very simple mex file, mex_sin.c, using the PGI compiler which does little more than take an elementwise sine of the input matrix.

#include <math.h>
#include "mex.h"

void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )
{
    double *in,*out;
    double dist,a,b;
    int rows,cols,outsize;
    int i,j,k;

    /*Get pointers to input matrix*/
    in = mxGetPr(prhs[0]);
    /*Get rows and columns of input*/
    rows = mxGetM(prhs[0]);
    cols = mxGetN(prhs[0]);

    /* Create output matrix */
    outsize = rows*cols;
    plhs[0] = mxCreateDoubleMatrix(rows, cols, mxREAL);
    /* Assign pointer to the output */
    out = mxGetPr(plhs[0]);

    for(i=0;i<outsize;i++){
        out[i] = sin(in[i]);
    }

}

Compile using the -v switch to get verbose information about the compilation

mex sin_mex.c -v

You’ll see that the compiled mex file is actually a renamed .dll file that was compiled and linked with the following flags

pgcc -c -Bdynamic  -Minfo -fast
pgcc --Mmakedll=export_all  -L"C:\Program Files\MATLAB\R2012a\extern\lib\win64\microsoft" libmx.lib libmex.lib libmat.lib

The switch –Mmakedll=export_all is actually not supported by PGI which makes this whole setup doubly unsupported! However, I couldn’t find a way to export the required symbols without modifying the mex source code so I lived with it.  Maybe I’ll figure out a better way in the future.  Let’s try the new function out

>> a=[1 2 3];
>> mex_sin(a)
Invalid MEX-file 'C:\Work\mex_sin.mexw64': The specified module could not be found.

The reason for the error message is that a required PGI .dll file, pgc.dll, is not on my system path so I need to do the following in MATLAB.

setenv('PATH', [getenv('PATH') ';C:\Program Files\PGI\win64\12.5\bin\']);

This fixes things

>> mex_sin(a)
ans =
    0.8415    0.9093    0.1411

Performance

I took a quick look at the performance of this mex function using my quad-core, Sandy Bridge laptop. I highly doubted that I was going to beat MATLAB’s built in sin function (which is highly optimised and multithreaded) with so little work and I was right:

>> a=rand(1,100000000);
>> tic;mex_sin(a);toc
Elapsed time is 1.320855 seconds.
>> tic;sin(a);toc
Elapsed time is 0.486369 seconds.

That’s not really a fair comparison though since I am purposely leaving mutithreading out of the PGI mex equation for now.  It’s a much fairer comparison to compare the exact same mex file using different compilers so let’s do that.  I created three different compiled mex routines from the source code above using the three compilers installed on my laptop and performed a very crude time test as follows

>> a=rand(1,100000000);
>> tic;mex_sin_pgi(a);toc              %PGI 12.5 run 1
Elapsed time is 1.317122 seconds.
>> tic;mex_sin_pgi(a);toc              %PGI 12.5 run 2
Elapsed time is 1.338271 seconds.

>> tic;mex_sin_vs(a);toc               %Visual Studio 2008 run 1
Elapsed time is 1.459463 seconds.
>> tic;mex_sin_vs(a);toc
Elapsed time is 1.446947 seconds.      %Visual Studio 2008 run 2

>> tic;mex_sin_intel(a);toc             %Intel Compiler 12.0 run 1
Elapsed time is 0.907018 seconds.
>> tic;mex_sin_intel(a);toc             %Intel Compiler 12.0 run 2
Elapsed time is 0.860218 seconds.

PGI did a little better than Visual Studio 2008 but was beaten by Intel. I’m hoping that I’ll be able to get more performance out of the PGI compiler as I learn more about the compilation flags.

Getting PGI to make use of SSE extensions

Part of the output of the mex sin_mex.c -v compilation command is the following notice

mexFunction:
     23, Loop not vectorized: data dependency

This notice is a result of the -Minfo compilation switch and indicates that the PGI compiler can’t determine if the in and out arrays overlap or not.  If they don’t overlap then it would be safe to unroll the loop and make use of SSE or AVX instructions to make better use of my Sandy Bridge processor.  This should hopefully speed things up a little.

As the programmer, I am sure that the two arrays don’t overlap so I need to give the compiler a hand.  One way to do this would be to modify the pgi.dat file to include the compilation switch -Msafeptr which tells the compiler that arrays never overlap anywhere.  This might not be a good idea since it may not always be true so I decided to be more cautious and make use of  the restrict keyword.  That is, I changed the mex source code so that

double *in,*out;

becomes

double * restrict in,* restrict out;

Now when I compile using the PGI compiler, the notice from -Mifno becomes

mexFunction:
     23, Generated 3 alternate versions of the loop
         Generated vector sse code for the loop
         Generated a prefetch instruction for the loop

which demonstrates that the compiler is much happier! So, what did this do for performance?

>> tic;mex_sin_pgi(a);toc
Elapsed time is 1.450002 seconds.
>> tic;mex_sin_pgi(a);toc
Elapsed time is 1.460536 seconds.

This is slower than when SSE instructions weren’t being used which isn’t what I was expecting at all! If anyone has any insight into what’s going on here, I’d love to hear from you.

Future Work

I’m happy that I’ve got this compiler working in MATLAB but there is a lot to do including:

  • Tidy up the pgi.dat and pgi.stp files so that they look and act more professionally.
  • Figure out the best set of compiler switches to use– it is almost certain that what I’m using now is sub-optimal since I am new to the PGI compiler.
  • Get OpenMP support working.  I tried using the -Mconcur compilation flag which auto-parallelised the loop but it crashed MATLAB when I ran it. This needs investigating
  • Get PGI accelerator support working so I can offload work to the GPU.
  • Figure out why the SSE version of this function is slower than the non-SSE version
  • Figure out how to determine whether or not the compiler is emitting AVX instructions.  The documentation suggests that if the compiler is called on a Sandy Bridge machine, and if vectorisation is possible then it will produce AVX instructions but AVX is not mentioned in the output of -Minfo.  Nothing changes if you explicity set the target to Sandy Bridge with the compiler switch -tp sandybridge-64.

Look out for more articles on this in the future.

Related WalkingRandomly Articles

My setup

  • Laptop model: Dell XPS L702X
  • CPU: Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
  • GPU: GeForce GT 555M with 144 CUDA Cores.  Graphics clock: 590Mhz.  Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
  • RAM: 8 Gb
  • OS: Windows 7 Home Premium 64 bit.
  • MATLAB: 2012a
  • PGI Compiler: 12.5
October 24th, 2011

The NAG C Library is one of the largest commercial collections of numerical software currently available and I often find it very useful when writing MATLAB mex files.  “Why is that?” I hear you ask.

One of the main reasons for writing a mex file is to gain more speed over native MATLAB.  However, one of the main problems with writing mex files is that you have to do it in a low level language such as Fortran or C and so you lose much of the ease of use of MATLAB.

In particular, you lose straightforward access to most of the massive collections of MATLAB routines that you take for granted.  Technically speaking that’s a lie because you could use the mex function mexCallMATLAB to call a MATLAB routine from within your mex file but then you’ll be paying a time overhead every time you go in and out of the mex interface.  Since you are going down the mex route in order to gain speed, this doesn’t seem like the best idea in the world.  This is also the reason why you’d use the NAG C Library and not the NAG Toolbox for MATLAB when writing mex functions.

One way out that I use often is to take advantage of the NAG C library and it turns out that it is extremely easy to add the NAG C library to your mex projects on Windows.  Let’s look at a trivial example.  The following code, nag_normcdf.c, uses the NAG function nag_cumul_normal to produce a simplified version of MATLAB’s normcdf function (laziness is all that prevented me from implementing a full replacement).

/* A simplified version of normcdf that uses the NAG C library
 * Written to demonstrate how to compile MATLAB mex files that use the NAG C Library
 * Only returns a normcdf where mu=0 and sigma=1
 * October 2011 Michael Croucher
 * www.walkingrandomly.com
 */

#include <math.h>
#include "mex.h"
#include "nag.h"
#include "nags.h"

void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )
{
    double *in,*out;
    int rows,cols,num_elements,i;

    if(nrhs>1)
    {
        mexErrMsgIdAndTxt("NAG:BadArgs","This simplified version of normcdf only takes 1 input argument");
    } 

    /*Get pointers to input matrix*/
    in = mxGetPr(prhs[0]);
    /*Get rows and columns of input matrix*/
    rows = mxGetM(prhs[0]);
    cols = mxGetN(prhs[0]);
    num_elements = rows*cols;

    /* Create output matrix */
    plhs[0] = mxCreateDoubleMatrix(rows, cols, mxREAL);
    /* Assign pointer to the output */
    out = mxGetPr(plhs[0]);

    for(i=0; i<num_elements; i++){
          out[i] = nag_cumul_normal(in[i]);
         }
}

To compile this in MATLAB, just use the following command

 mex nag_normcdf.c CLW6I09DA_nag.lib

If your system is set up the same as mine then the above should ‘just work’ (see System Information at the bottom of this post).  The new function works just as you would expect it to

>> format long
>> format compact
>> nag_normcdf(1)
ans =
   0.841344746068543

Compare the result to normcdf from the statistics toolbox

>> normcdf(1)
ans =
   0.841344746068543

So far so good. I could stop the post here since all I really wanted to do was say ‘The NAG C library is useful for MATLAB mex functions and it’s a doddle to use – here’s a toy example and here’s the mex command to compile it’

However, out of curiosity, I looked to see if my toy version of normcdf was any faster than The Mathworks’ version.  Let there be 10 million numbers:

>> x=rand(1,10000000);

Let’s time how long it takes MATLAB to take the normcdf of those numbers

>> tic;y=normcdf(x);toc
Elapsed time is 0.445883 seconds.
>> tic;y=normcdf(x);toc
Elapsed time is 0.405764 seconds.
>> tic;y=normcdf(x);toc
Elapsed time is 0.366708 seconds.
>> tic;y=normcdf(x);toc
Elapsed time is 0.409375 seconds.

Now let’s look at my toy-version that uses NAG.

>> tic;y=nag_normcdf(x);toc
Elapsed time is 0.544642 seconds.
>> tic;y=nag_normcdf(x);toc
Elapsed time is 0.556883 seconds.
>> tic;y=nag_normcdf(x);toc
Elapsed time is 0.553920 seconds.
>> tic;y=nag_normcdf(x);toc
Elapsed time is 0.540510 seconds.

So my version is slower!  Never mind, I’ll just make my version parallel using OpenMP – Here is the code: nag_normcdf_openmp.c

/* A simplified version of normcdf that uses the NAG C library
 * Written to demonstrate how to compile MATLAB mex files that use the NAG C Library
 * Only returns a normcdf where mu=0 and sigma=1
 * October 2011 Michael Croucher
 * www.walkingrandomly.com
 */

#include <math.h>
#include "mex.h"
#include "nag.h"
#include "nags.h"
#include <omp.h>

void do_calculation(double in[],double out[],int num_elements)
{
    int i,tid;

#pragma omp parallel for shared(in,out,num_elements) private(i,tid)
    for(i=0; i<num_elements; i++){
          out[i] = nag_cumul_normal(in[i]);
         }
}

void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] )
{
    double *in,*out;
    int rows,cols,num_elements;

    if(nrhs>1)
    {
        mexErrMsgIdAndTxt("NAG_NORMCDF:BadArgs","This simplified version of normcdf only takes 1 input argument");
    } 

    /*Get pointers to input matrix*/
    in = mxGetPr(prhs[0]);
    /*Get rows and columns of input matrix*/
    rows = mxGetM(prhs[0]);
    cols = mxGetN(prhs[0]);
    num_elements = rows*cols;

    /* Create output matrix */
    plhs[0] = mxCreateDoubleMatrix(rows, cols, mxREAL);
    /* Assign pointer to the output */
    out = mxGetPr(plhs[0]);

    do_calculation(in,out,num_elements);
}

Compile that with

 mex  COMPFLAGS="$COMPFLAGS /openmp"  nag_normcdf_openmp.c CLW6I09DA_nag.lib

and on my quad-core machine I get the following timings

>> tic;y=nag_normcdf_openmp(x);toc
Elapsed time is 0.237925 seconds.
>> tic;y=nag_normcdf_openmp(x);toc
Elapsed time is 0.197531 seconds.
>> tic;y=nag_normcdf_openmp(x);toc
Elapsed time is 0.206511 seconds.
>> tic;y=nag_normcdf_openmp(x);toc
Elapsed time is 0.211416 seconds.

This is faster than MATLAB and so normal service is resumed :)

System Information

  • 64bit Windows 7
  • MATLAB 2011b
  • NAG C Library Mark 9 – CLW6I09DAL
  • Visual Studio 2008
  • Intel Core i7-2630QM processor
July 27th, 2011

These days it seems that you can’t talk about scientific computing for more than 5 minutes without somone bringing up the topic of Graphics Processing Units (GPUs).  Originally designed to make computer games look pretty, GPUs are massively parallel processors that promise to  revolutionise the way we compute.

A brief glance at the specification of a typical laptop suggests why GPUs are the new hotness in numerical computing.  Take my new one for instance, a Dell XPS L702X, which comes with a Quad-Core Intel i7 Sandybridge processor running at up to 2.9Ghz and an NVidia GT 555M with a whopping 144 CUDA cores.  If you went back in time a few years and told a younger version of me that I’d soon own a 148 core laptop then young Mike would be stunned.  He’d also be wondering ‘What’s the catch?’

Of course the main catch is that all processor cores are not created equally.  Those 144 cores in my GPU are, individually, rather wimpy when compared to the ones in the Intel CPU.  It’s the sheer quantity of them that makes the difference.  The question at the forefront of my mind when I received my shiny new laptop was ‘Just how much of a difference?’

Now I’ve seen lots of articles that compare CPUs with GPUs and the GPUs always win…..by a lot!  Dig down into the meat of these articles,  however, and it turns out that things are not as simple as they seem.  Roughly speaking, the abstract of some them could be summed up as ‘We took  a serial algorithm written by a chimpanzee for an old, outdated CPU and spent 6 months parallelising and fine tuning it for a top of the line  GPU.  Our GPU version is up to 150 times faster!

Well it would be wouldn’t it?!  In other news, Lewis Hamilton can drive his F1 supercar around Silverstone faster than my dad can in his clapped out 12 year old van!  These articles are so prevalent that csgillespie.wordpress.com recently published an excellent article that summarised everything you should consider when evaluating them.  What you do is take the claimed speed-up, apply a set of common sense questions and thus determine a realistic speedup.  That factor of 150 can end up more like a factor of 8 once you think about it the right way.

That’s not to say that GPUs aren’t powerful or useful…it’s just that maybe they’ve been hyped up a bit too much!

So anyway, back to my laptop.  It doesn’t have a top of the range GPU custom built for scientific computing, instead it has what Notebookcheck.net refers to as a fast middle class graphics card for laptops.  It’s got all of the required bits though….144 cores and CUDA compute level 2.1 so surely it can whip the built in CPU even if it’s just by a little bit?

I decided to find out with a few randomly chosen tests.  I wasn’t aiming for the kind of rigor that would lead to a peer reviewed journal but I did want to follow some basic rules at least

  • I will only choose algorithms that have been optimised and parallelised for both the CPU and the GPU.
  • I will release the source code of the tests so that they can be critised and repeated by others.
  • I’ll do the whole thing in MATLAB using the new GPU functionality in the parallel computing toolbox.  So, to repeat my calculations all you need to do is copy and paste some code.  Using MATLAB also ensures that I’m using good quality code for both CPU and GPU.

The articles

This is the introduction to a set of articles about GPU computing on MATLAB using the parallel computing toolbox.  Links to the rest of them are below and more will be added in the future.

External links of interest to MATLABers with an interest in GPUs

October 30th, 2010

Some time ago now, Sam Shah of Continuous Everywhere but Differentiable Nowhere fame discussed the standard method of obtaining the square root of the imaginary unit, i, and in the ensuing discussion thread someone asked the question “What is i^i – that is what is i to the power i?”

Sam immediately came back with the answer e^(-pi/2) = 0.207879…. which is one of the answers but as pointed out by one of his readers, Adam Glesser, this is just one of the infinite number of potential answers that all have the form e^{-(2k+1) pi/2} where k is an integer. Sam’s answer is the principle value of i^i (incidentally this is the value returned by google calculator if you google i^i – It is also the value returned by Mathematica and MATLAB). Life gets a lot more complicated when you move to the complex plane but it also gets a lot more interesting too.

While on the train into work one morning I was thinking about Sam’s blog post and wondered what the principal value of i^i^i (i to the power i to the power i) was equal to. Mathematica quickly provided the answer:

N[I^I^I]
0.947159+0.320764 I

So i is imaginary, i^i is real and i^i^i is imaginary again. Would i^i^i^i be real I wondered – would be fun if it was. Let’s see:

N[I^I^I^I]
0.0500922+0.602117 I

gah – a conjecture bites the dust – although if I am being honest it wasn’t a very good one.  Still, since I have started making ‘power towers’ I may as well continue and see what I can see. Why am I calling them power towers? Well, the calculation above could be written as follows:

small power tower

As I add more and more powers, the left hand side of the equation will tower up the page….Power Towers.  We now have a sequence of the first four power towers of i:

i = i
i^i = 0.207879
i^i^i = 0.947159 + 0.32076 I
i^i^i^i = 0.0500922+0.602117 I

Sequences of power towers

“Will this sequence converge or diverge?”, I wondered.  I wasn’t in the mood to think about a rigorous mathematical proof, I just wanted to play so I turned back to Mathematica.  First things first, I needed to come up with a way of making an arbitrarily large power tower without having to do a lot of typing.  Mathematica’s Nest function came to the rescue and the following function allows you to create a power tower of any size for any number, not just i.

tower[base_, size_] := Nest[N[(base^#)] &, base, size]

Now I can find the first term of my series by doing

In[1]:= tower[I, 0]

Out[1]= I

Or the 5th term by doing

In[2]:= tower[I, 4] 

Out[2]= 0.387166 + 0.0305271 I

To investigate convergence I needed to create a table of these. Maybe the first 100 towers would do:

ColumnForm[
 Table[tower[I, n], {n, 1, 100}]
 ]

The last few values given by the command above are

0.438272+ 0.360595 I
0.438287+ 0.360583 I
0.438287+ 0.3606 I
0.438275+ 0.360591 I
0.438289+ 0.360588 I

Now this is interesting – As I increased the size of the power tower, the result seemed to be converging to around 0.438 + 0.361 i. Further investigation confirms that the sequence of power towers of i converges to 0.438283+ 0.360592 i.  If you were to ask me to guess what I thought would happen with large power towers like this then I would expect them to do one of three things – diverge to infinity, stay at 1 forever or quickly converge to 0 so this is unexpected behaviour (unexpected to me at least).

They converge, but how?

My next thought was ‘How does it converge to this value?  In other words, ‘What path through the complex plane does this sequence of power towers take?”  Time for a graph:

tower[base_, size_] := Nest[N[(base^#)] &, base, size];
complexSplit[x_] := {Re[x], Im[x]};
ListPlot[Map[complexSplit, Table[tower[I, n], {n, 0, 49, 1}]],
 PlotRange -> All]

Power Tower Spiral

Who would have thought you could get a spiral from power towers? Very nice! So the next question is ‘What would happen if I took a different complex number as my starting point?’ For example – would power towers of (0.5 + i) converge?’

The answer turns out to be yes – power towers of (0.5 + I) converge to 0.541199+ 0.40681 I but the resulting spiral looks rather different from the one above.

tower[base_, size_] := Nest[N[(base^#)] &, base, size];
complexSplit[x_] := {Re[x], Im[x]};
ListPlot[Map[complexSplit, Table[tower[0.5 + I, n], {n, 0, 49, 1}]],
 PlotRange -> All]

Another power tower spiralThe zoo of power tower spirals

So, taking power towers of two different complex numbers results in two qualitatively different ‘convergence spirals’. I wondered how many different spiral types I might find if I consider the entire complex plane?  I already have all of the machinery I need to perform such an investigation but investigation is much more fun if it is interactive.  Time for a Manipulate

complexSplit[x_] := {Re[x], Im[x]};

tower[base_, size_] := Nest[N[(base^#)] &, base, size];

generatePowerSpiral[p_, nmax_] :=
  Map[complexSplit, Table[tower[p, n], {n, 0, nmax-1, 1}]];

Manipulate[const = p[[1]] + p[[2]] I;
 ListPlot[generatePowerSpiral[const, n],
 PlotRange -> {{-2, 2}, {-2, 2}}, Axes -> ax,
 Epilog -> Inset[Framed[const], {-1.5, -1.5}]], {{n, 100,
 "Number of terms"}, 1, 200, 1,
 Appearance -> "Labeled"}, {{ax, True, "Show axis"}, {True,
 False}}, {{p, {0, 1.5}}, Locator}]

Power Tower Manipulate
After playing around with this Manipulate for a few seconds it became clear to me that there is quite a rich diversity of these convergence spirals. Here are a couple more

Power Tower Spirals

Some of them take a lot longer to converge than others and then there are those that don’t converge at all:

non convergent power

Optimising the code a little

Before I could investigate convergence any further, I had a problem to solve: Sometimes the Manipulate would completely freeze and a message eventually popped up saying “One or more dynamic objects are taking excessively long to finish evaluating……”  What was causing this I wondered?

Well, some values give overflow errors:

In[12]:= generatePowerSpiral[-1 + -0.5 I, 200]

General::ovfl: Overflow occurred in computation. >>
General::ovfl: Overflow occurred in computation. >>
General::ovfl: Overflow occurred in computation. >>
General::stop: Further output of General::ovfl will be suppressed during this calculation. >>

Could errors such as this be making my Manipulate unstable?  Let’s see how long it takes Mathematica to deal with the example above

AbsoluteTiming[ListPlot[generatePowerSpiral[-1  -0.5 I, 200]]]

On my machine, the above command typically takes around 0.08 seconds to complete compared to 0.04 seconds for a tower that converges nicely; it’s slower but not so slow that it should break Manipulate.  Still, let’s fix it anyway.

Look at the sequence of values that make up this problematic power tower

generatePowerSpiral[-0.8 + 0.1 I, 10]

{{-0.8, 0.1}, {-0.668442, -0.570216}, {-2.0495, -6.11826},
{2.47539*10^7,1.59867*10^8}, {2.068155430437682*10^-211800874,
-9.83350984373519*10^-211800875}, {Overflow[], 0}, {Indeterminate,
  Indeterminate}, {Indeterminate, Indeterminate}, {Indeterminate,
  Indeterminate}, {Indeterminate, Indeterminate}}

Everything is just fine until the term {Overflow[],0} is reached; after which we are just wasting time. Recall that the functions I am using to create these sequences are

complexSplit[x_] := {Re[x], Im[x]};
tower[base_, size_] := Nest[N[(base^#)] &, base, size];
generatePowerSpiral[p_, nmax_] :=
  Map[complexSplit, Table[tower[p, n], {n, 0, nmax-1, 1}]];

The first thing I need to do is break out of tower’s Nest function as soon as the result stops being a complex number and the NestWhile function allows me to do this.  So, I could redefine the tower function to be

tower[base_, size_] :=
 NestWhile[N[(base^#)] &, base, MatchQ[#, _Complex] &, 1, size]

However, I can do much better than that since my code so far is massively inefficient.  Say I already have the first n terms of a tower sequence; to get the (n+1)th term all I need to do is a single power operation but my code is starting from the beginning and doing n power operations instead.  So, to get the 5th term, for example, my code does this

I^I^I^I^I

instead of

(4th term)^I

The function I need to turn to is yet another variant of Nest – NestWhileList

fasttowerspiral[base_, size_] :=
 Quiet[Map[complexSplit,
 NestWhileList[N[(base^#)] &, base, MatchQ[#, _Complex] &, 1,
 size, -1]]];

The Quiet function is there to prevent Mathematica from warning me about the Overflow error.  I could probably do better than this and catch the Overflow error coming before it happens but since I’m only mucking around, I’ll leave that to an interested reader.  For now it’s enough for me to know that the code is much faster than before:

(*Original Function*)
AbsoluteTiming[generatePowerSpiral[I, 200];]

{0.036254, Null}
(*Improved Function*)
AbsoluteTiming[fasttowerspiral[I, 200];]

{0.001740, Null}

A factor of 20 will do nicely!

Making Mathematica faster by making it stupid

I’m still not done though. Even with these optimisations, it can take a massive amount of time to compute some of these power tower spirals. For example

spiral = fasttowerspiral[-0.77 - 0.11 I, 100];

takes 10 seconds on my machine which is thousands of times slower than most towers take to compute. What on earth is going on? Let’s look at the first few numbers to see if we can find any clues

In[34]:= spiral[[1 ;; 10]]

Out[34]= {{-0.77, -0.11}, {-0.605189, 0.62837}, {-0.66393,
  7.63862}, {1.05327*10^10,
  7.62636*10^8}, {1.716487392960862*10^-155829929,
  2.965988537183398*10^-155829929}, {1., \
-5.894184073663391*10^-155829929}, {-0.77, -0.11}, {-0.605189,
  0.62837}, {-0.66393, 7.63862}, {1.05327*10^10, 7.62636*10^8}}

The first pair that jumps out at me is {1.716487392960862*10^-155829929, 2.965988537183398*10^-155829929} which is so close to {0,0} that it’s not even funny! So close, in fact, that they are not even double precision numbers any more. Mathematica has realised that the calculation was going to underflow and so it caught it and returned the result in arbitrary precision.

Arbitrary precision calculations are MUCH slower than double precision ones and this is why this particular calculation takes so long. Mathematica is being very clever but its cleverness is costing me a great deal of time and not adding much to the calculation in this case. I reckon that I want Mathematica to be stupid this time and so I’ll turn off its underflow safety net.

SetSystemOptions["CatchMachineUnderflow" -> False]

Now our problematic calculation takes 0.000842 seconds rather than 10 seconds which is so much faster that it borders on the astonishing. The results seem just fine too!

When do the power towers converge?

We have seen that some towers converge while others do not. Let S be the set of complex numbers which lead to convergent power towers. What might S look like? To determine that I have to come up with a function that answers the question ‘For a given complex number z, does the infinite power tower converge?’ The following is a quick stab at such a function

convergeQ[base_, size_] :=
  If[Length[
     Quiet[NestWhileList[N[(base^#)] &, base, Abs[#1 - #2] > 0.01 &,
       2, size, -1]]] < size, 1, 0];

The tolerance I have chosen, 0.01, might be a little too large but these towers can take ages to converge and I’m more interested in speed than accuracy right now so 0.01 it is. convergeQ returns 1 when the tower seems to converge in at most size steps and 0 otherwise.:

In[3]:= convergeQ[I, 50]
Out[3]= 1

In[4]:= convergeQ[-1 + 2 I, 50]
Out[4]= 0

So, let’s apply this to a section of the complex plane.

towerFract[xmin_, xmax_, ymin_, ymax_, step_] :=
 ArrayPlot[
 Table[convergeQ[x + I y, 50], {y, ymin, ymax, step}, {x, xmin, xmax,step}]]
towerFract[-2, 2, -2, 2, 0.1]

Rough Power Tower Fractal

That looks like it might be interesting, possibly even fractal, behaviour but I need to increase the resolution and maybe widen the range to see what’s really going on. That’s going to take quite a bit of calculation time so I need to optimise some more.

Going Parallel

There is no point in having machines with two, four or more processor cores if you only ever use one and so it is time to see if we can get our other cores in on the act.

It turns out that this calculation is an example of a so-called embarrassingly parallel problem and so life is going to be particularly easy for us.  Basically, all we need to do is to give each core its own bit of the complex plane to work on, collect the results at the end and reap the increase in speed. Here’s the full parallel version of the power tower fractal code

(*Complete Parallel version of the power tower fractal code*)

convergeQ[base_, size_] :=
 If[Length[
 Quiet[NestWhileList[N[(base^#)] &, base, Abs[#1 - #2] > 0.01 &,
 2, size, -1]]] < size, 1, 0];

LaunchKernels[];
DistributeDefinitions[convergeQ];
ParallelEvaluate[SetSystemOptions["CatchMachineUnderflow" -> False]];

towerFractParallel[xmin_, xmax_, ymin_, ymax_, step_] :=
 ArrayPlot[
 ParallelTable[
 convergeQ[x + I y, 50], {y, ymin, ymax, step}, {x, xmin, xmax, step}
, Method -> "CoarsestGrained"]]

This code is pretty similar to the single processor version so let’s focus on the parallel modifications.  My convergeQ function is no different to the serial version so nothing new to talk about there.  So, the first new code is

LaunchKernels[];

This launches a set of parallel Mathematica kernels. The amount that actually get launched depends on the number of cores on your machine.  So, on my dual core laptop I get 2 and on my quad core desktop I get 4.

DistributeDefinitions[convergeQ];

All of those parallel kernels are completely clean in that they don’t know about my user defined convergeQ function. This line sends the definition of convergeQ to all of the freshly launched parallel kernels.

ParallelEvaluate[SetSystemOptions["CatchMachineUnderflow" -> False]];

Here we turn off Mathematica’s machine underflow safety net on all of our parallel kernels using the ParallelEvaluate function.

That’s all that is necessary to set up the parallel environment.  All that remains is to change Map to ParallelMap and to add the argument Method -> “CoarsestGrained” which basically says to Mathematica ‘Each sub-calculation will take a tiny amount of time to perform so you may as well send each core lots to do at once’ (click here for a blog post of mine where this is discussed further).

That’s all it took to take this embarrassingly parallel problem from a serial calculation to a parallel one.  Let’s see if it worked.  The test machine for what follows contains a T5800 Intel Core 2 Duo CPU running at 2Ghz on Ubuntu (if you want to repeat these timings then I suggest you read this blog post first or you may find the parallel version going slower than the serial one).  I’ve suppressed the output of the graphic since I only want to time calculation and not rendering time.

(*Serial version*)
In[3]= AbsoluteTiming[towerFract[-2, 2, -2, 2, 0.1];]
Out[3]= {0.672976, Null}

(*Parallel version*)
In[4]= AbsoluteTiming[towerFractParallel[-2, 2, -2, 2, 0.1];]
Out[4]= {0.532504, Null}

In[5]= speedup = 0.672976/0.532504
Out[5]= 1.2638

I was hoping for a bit more than a factor of 1.26 but that’s the way it goes with parallel programming sometimes. The speedup factor gets a bit higher if you increase the size of the problem though. Let’s increase the problem size by a factor of 100.

towerFractParallel[-2, 2, -2, 2, 0.01]

The above calculation took 41.99 seconds compared to 63.58 seconds for the serial version resulting in a speedup factor of around 1.5 (or about 34% depending on how you want to look at it).
Power Tower Fractal

Other optimisations

I guess if I were really serious about optimising this problem then I could take advantage of the symmetry along the x axis or maybe I could utilise the fact that if one point in a convergence spiral converges then it follows that they all do. Maybe there are more intelligent ways to test for convergence or maybe I’d get a big speed increase from programming in C or F#?  If anyone is interested in having a go at improving any of this and succeeds then let me know.

I’m not going to pursue any of these or any other optimisations, however, since the above exploration is what I achieved in a single train journey to work (The write-up took rather longer though). I didn’t know where I was going and I only worried about optimisation when I had to. At each step of the way the code was fast enough to ensure that I could interact with the problem at hand.

Being mostly ‘fast enough’ with minimal programming effort is one of the reasons I like playing with Mathematica when doing explorations such as this.

Treading where people have gone before

So, back to the power tower story. As I mentioned earlier, I did most of the above in a single train journey and I didn’t have access to the internet. I was quite excited that I had found a fractal from such a relatively simple system and very much felt like I had discovered something for myself. Would this lead to something that was publishable I wondered?

Sadly not!

It turns out that power towers have been thoroughly investigated and the act of forming a tower is called tetration. I learned that when a tower converges there is an analytical formula that gives what it will converge to:

h(z) = -\frac{W(-ln z)}{ln(z)

Where W is the Lambert W function (click here for a cool poster for this function).  I discovered that other people had already made Wolfram Demonstrations for power towers too

There is even a website called tetration.org that shows ‘my’ fractal in glorious technicolor.  Nothing new under the sun eh?

Parting shots

Well, I didn’t discover anything new but I had a bit of fun along the way. Here’s the final Manipulate I came up with

Manipulate[const = p[[1]] + p[[2]] I;
 If[hz,
  ListPlot[fasttowerspiral[const, n], PlotRange -> {{-2, 2}, {-2, 2}},
    Axes -> ax,
   Epilog -> {{PointSize[Large], Red,
      Point[complexSplit[N[h[const]]]]}, {Inset[
       Framed[N[h[const]]], {-1, -1.5}]}}]
  , ListPlot[fasttowerspiral[const, n],
   PlotRange -> {{-2, 2}, {-2, 2}}, Axes -> ax]
  ]
 , {{n, 100, "Number of terms"}, 1, 500, 1, Appearance -> "Labeled"}
 , {{ax, True, "Show axis"}, {True, False}}
 , {{hz, True, "Show h(z)"}, {True, False}}
 , {{p, {0, 1.5}}, Locator}
 , Initialization :> (
   SetSystemOptions["CatchMachineUnderflow" -> False];
   complexSplit[x_] := {Re[x], Im[x]};
   fasttowerspiral[base_, size_] :=
    Quiet[Map[complexSplit,
      NestWhileList[N[(base^#)] &, base, MatchQ[#, _Complex] &, 1,
       size, -1]]];
   h[z_] := -ProductLog[-Log[z]]/Log[z];
   )
 ]

Tetration Manipulate
and here’s a video of a zoom into the tetration fractal that I made using spare cycles on Manchester University’s condor pool.

If you liked this blog post then you may also enjoy:

May 21st, 2010

Reading comma separated value (csv) files into MATLAB is trivial as long as the csv file you are trying to import is trivial. For example, say you wanted to import the file very_clean.txt which contains the following data

1031,-948,-76
507,635,-1148
-1031,948,750
-507,-635,114

The following, very simple command, is all that you need

>> veryclean = csvread('very_clean.txt')

veryclean =

        1031        -948         -76
         507         635       -1148
       -1031         948         750
        -507        -635         114

In the real world, however, your data is rarely this nice and clean. One of the most common problems faced by MATLABing data importers is that of header lines. Take the file quite_clean.txt for instance. This is identical to the previous example apart from the fact that it contains two header lines

These are some data that I made using my hand-crafted code
Date:12th July 1996
1031,-948,-76
507,635,-1148
-1031,948,750
-507,-635,114

This is all too much for the csvread command

>> data=csvread('quite_clean.txt')
??? Error using ==> dlmread at 145
Mismatch between file and format string.
Trouble reading number from file (row 1, field 1) ==> This

Error in ==> csvread at 52
    m=dlmread(filename, ',', r, c);

Not to worry, we can just use the more capable importdata command instead

>> quiteclean = importdata('quite_clean.txt')

quiteclean = 

        data: [4x3 double]
    textdata: {2x1 cell}

The result above is a two element structure array and our numerical values are contained in a field called data. Here’s how you get at it.

>> quiteclean.data

ans =

        1031        -948         -76
         507         635       -1148
       -1031         948         750
        -507        -635         114

So far so good. How do we handle a file like messy_data.txt though?

header 1;
header 2;
1031,-948,-76, ,"12"
507,635,-1148, ,"34"
-1031,948,750, ,"45"
-507,-635,114, ,"67"

This is the kind of file encountered by Walking Randomly reader ‘reen’ and it contains exactly the same numerical values as the previous two examples. Unfortunately, it also contains some cruft that makes life more difficult for us. Let’s bring out the big-guns!

Using textscan to import csv files in MATLAB

When the going gets tough, the tough use textscan.  Here’s the incantation for importing messy_data.txt

fid=fopen('messy_data.txt');
data = textscan(fid,'%f %f %f %*s %*s','HeaderLines',2,'Delimiter',',','CollectOutput',1);
fclose(fid)

The result is a one-element cell array that contains an array of doubles.  Let’s get the array of doubles out of the cell

>> data=data{1}
data =
        1031        -948         -76
         507         635       -1148
       -1031         948         750
        -507        -635         114

If the importdata command is a chauffeur then textscan is a Ferrari and I don’t know about you but I’d much rather be driving my own Ferrari than being chauffeured around (John Cook over at The Endeavour has more to say on Ferraris and Chauffeurs).

Let’s de-construct the above set of commands.  The first thing to notice is that, unlike csvread and importdata, you have to explicitly open and close your file when using the textscan command.  So, you open your file using fopen and give it a file ID (which in this example is fid).

fid=fopen('messy_data.txt');

The first argument to textscan is just this file ID, fid. Next you need to supply a conversion specifier which in this case is

'%f %f %f %*s %*s'

The conversion specifier tells textscan what you want each row in your csv file to be converted to. %f means “64 bit double” and %s means “string” so ‘%f %f %f %s %s’ means “3 doubles followed by 2 strings” (we’ll get onto the asterisks in the original specifier later). You can use all sorts of data types in a conversion specifier such as integers, quoted strings and pattern matched strings among others. Check out the MATLAB documentation for textscan for the full list but an abbreviated list is shown below:

%d signed 32bit integer
%u unsigned 32bit integer
%f 64bit double (you'll want this most of the time when using MATLAB)
%s string

Now, in the command I used to import messy_data.txt the conversion specifier contained some asterisks such as %*s so what do these mean?  Quite simply, the asterisk just means ‘ignore’ so %*s means ‘ignore the string in this field’.  So, the full meaning of my conversion specifier ‘%f %f %f %*s %*s’ is “read 3 doubles and ignore 2 strings” and textscan will do this for every row.

The rest of the command is pretty self explanatory but I’ll explain it anyway for the sake of completeness

'HeaderLines',2

The file has 2 headerlines which should be ignored

'Delimiter',','

The fields are delimited (a posh word for separated) by a comma

'CollectOutput',1

If you supply a 1 (which stands for True) to the CollectOutput option then textscan will join consecutive output cells with the same data type into a single array. Since I want all of my doubles to be in a single array then this is the behaviour I went for.

Finally, once you have finished textscanning, don’t forget to close your file

fclose(fid)

That’s pretty much it for this mini-tutorial – I hope you find it useful.

TOP