## Faster GPU Random Number Generators in MATLAB 2012b

March 10th, 2013

Ever since I took a look at GPU accelerating simple Monte Carlo Simulations using MATLAB, I’ve been disappointed with the performance of its GPU random number generator. In MATLAB 2012a, for example, it’s not much faster than the CPU implementation on my GPU hardware.  Consider the following code

```function gpuRandTest2012a(n)

mydev=gpuDevice();
disp('CPU - Mersenne Twister');
tic
CPU = rand(n);
toc

sg = parallel.gpu.RandStream('mrg32k3a','Seed',1);
parallel.gpu.RandStream.setGlobalStream(sg);
disp('GPU - mrg32k3a');
tic
Rg = parallel.gpu.GPUArray.rand(n);
wait(mydev);
toc```

Running this on MATLAB 2012a on my laptop gives me the following typical times (If you try this out yourself, the first run will always be slower for various reasons I’ll not go into here)

```>> gpuRandTest2012a(10000)
CPU - Mersenne Twister
Elapsed time is 1.330505 seconds.
GPU - mrg32k3a
Elapsed time is 1.006842 seconds.```

Running the same code on MATLAB 2012b, however, gives a very pleasant surprise with typical run times looking like this

```CPU - Mersenne Twister
Elapsed time is 1.590764 seconds.
GPU - mrg32k3a
Elapsed time is 0.185686 seconds.```

So, generation of random numbers using the GPU is now over 7 times faster than CPU generation on my laptop hardware–a significant improvment on the previous implementation.

New generators in 2012b

The MATLAB developers went a little further in 2012b though.  Not only have they significantly improved performance of the mrg32k3a combined multiple recursive generator, they have also implemented two new GPU random number generators based on the Random123 library.  Here are the timings for the generation of 100 million random numbers in MATLAB 2012b

```CPU - Mersenne Twister
Elapsed time is 1.370252 seconds.
GPU - mrg32k3a
Elapsed time is 0.186152 seconds.
GPU - Threefry4x64-20
Elapsed time is 0.145144 seconds.
GPU - Philox4x32-10
Elapsed time is 0.129030 seconds.```

Bear in mind that I am running this on the relatively weak GPU of my laptop!  If anyone runs it on something stronger, I’d love to hear of your results.

• Laptop model: Dell XPS L702X
• CPU: Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
• GPU: GeForce GT 555M with 144 CUDA Cores.  Graphics clock: 590Mhz.  Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
• RAM: 8 Gb
• OS: Windows 7 Home Premium 64 bit.
• MATLAB: 2012a/2012b

## Intel’s Xeon Phi – GPU level power without the hassle?

November 13th, 2012

Intel have finally released the Xeon Phi – an accelerator card based on 60 or so customised Intel cores to give around a Teraflop of double precision performance.  That’s comparable to the latest cards from NVIDIA (1.3 Teraflops according to http://www.theregister.co.uk/2012/11/12/nvidia_tesla_k20_k20x_gpu_coprocessors/) but with one key difference—you don’t need to learn any new languages or technologies to take advantage of it (although you can do so if you wish)!

The Xeon Phi uses good, old fashioned High Performance Computing technologies that we’ve been using for years such as OpenMP and MPI.  There’s no need to completely recode your algorithms in CUDA or OpenCL to get a performance boost…just a sprinkling of OpenMP pragmas might be enough in many cases.  Obviously it will take quite a bit of work to squeeze every last drop of performance out of the thing but this might just be the realisation of ‘personal supercomputer’ we’ve all been waiting for.

Here are some links I’ve found so far — would love to see what everyone else has come up with.  I’ll update as I find more

I also note that the Xeon Phi uses AVX extensions but with a wider vector width of 512 bytes so if you’ve been taking advantage of that technology in your code (using one of these techniques perhaps) you’ll reap the benefits there too.

I, for one, am very excited and can’t wait to get my hands on one!  Thoughts, comments and links gratefully received!

## Vectorising code to take advantage of modern CPUs (AVX and SSE)

August 24th, 2012

I’ve been playing with AVX vectorisation on Sandy Bridge CPUs off and on for a while now and thought that I’d write up a little of what I’ve discovered.  The basic idea of vectorisation is that each core in a modern CPU can operate on multiple values (i.e. a vector) simultaneously per instruction cycle.

Sandy bridge (and the newer Ivy Bridge) processors have 256bit wide vector units which means that each CORE can perform certain operations on up to eight 32-bit floats or four 64-bit doubles per clock cycle.  So, on a quad core you have 4 vector units (one per core) and could operate on up to 16 doubles or 32 floats per clock cycle.

This all sounds great so how does a programmer actually make use of this neat hardware trick?  There are many routes:-

Intrinsics

At the ‘close to the metal’ level you code for these vector units using instructions called AVX intrinsics.  This is relatively difficult and leads to none-portable code if you are not careful.

Auto-vectorisation in compilers

Since working with intrinsics is such hard work, why not let the compiler take the strain? Many modern compilers can automatically vectorize your C, C++ or Fortran code including gcc, PGI and Intel. Sometimes all you need to do is add an extra switch at compile time and reap the speed benefits. In truth, vectorization isn’t always automatic and the programmer needs to give the compiler some assistance but it is a lot easier than hand-coding intrinsics.

Intel SPMD Program Compiler (ispc)

There is a midway point between automagic vectorisation and having to use intrinsics. Intel have a free compiler called ispc (http://ispc.github.com/) that allows you to write compute kernels in a modified subset of C. These kernels are then compiled to make use of vectorised instruction sets. Programming using ispc feels a little like using OpenCL or CUDA. I figured out how to hook it up to MATLAB a few months ago and developed a version of the Square Root function that is almost twice as fast as MATLAB’s own version for sandy bridge i7 processors.

Vectorised Libraries

Vendors of numerical libraries are steadily applying vectorisation techniques in order to maximise performance.  If the execution speed of your application depends upon these library functions, you may get a significant speed boost simply by updating to the latest version of the library and recompiling with the relevant compiler flags.

CUDA for x86

Another route to vectorised code is to make use of the PGI Compiler’s support for x86 CUDA.  What you do is take CUDA kernels written for NVIDIA GPUs and use the PGI Compiler to compile these kernels for x86 processors.  The resulting executables take advantage of vectorisation.  In essence, the vector units of the CPU are acting like CUDA cores–which they sort of are anyway!

The PGI compilers also have technology which they call PGI Unified binary which allows you to use NVIDIA GPUs when present or default to using multi-core x86 if no GPU is present.

• PGI CUDA-x86 – PGI’s main page for their CUDA on x86 technologies

OpenCL for x86 processors

Yet another route to vectorisation would be to use Intel’s OpenCL implementation which takes OpenCL kernels and compiles them down to take advantage of vector units (http://software.intel.com/en-us/blogs/2011/09/26/autovectorization-in-intel-opencl-sdk-15/).  The AMD OpenCL implementation may also do this but I haven’t tried it and haven’t had chance to research it yet.

WalkingRandomly posts

I’ve written a couple of blog posts that made use of this technology.

Miscellaneous resources

There is other stuff out there but the above covers everything that I have used so far.  I’ll finish by saying that everyone interested in vectorisation should check out this website…It’s the bible!

Research Articles on SSE/AVX vectorisation

I found the following research articles useful/interesting.  I’ll add to this list over time as I dig out other articles.

April 26th, 2012

Intel have just released their OpenCL Software Development Kit (SDK) for Intel processors.  The good news is that this version targets the on-die GPU as well as the CPU allowing truly heterogeneous programming.  The bad news is that the GPU goodness is for 3rd Generation ‘Ivy Bridge‘ Processors only– us backward Sandy Bridge users have been left in the cold :(

A quick scan through the release notes reveals the following:-

• OpenCL access to the on-die GPU part is currently for Windows only. Linux users only have CPU support at the moment.
• No access to the GPU part of Sandy Bridge Processors via this implementation.
• The GPU part has single precision only (I guess we’ll see many more mixed-precision algorithms from now on)

I don’t have access to an Ivy Bridge processor and so can’t have a play but I’m looking forward to seeing how much performance OpenCL programmers can squeeze out of this new implementation.

Other WalkingRandomly posts on GPU computing

## A bug in Mathematica’s CUDADot in version 8.0.1

March 14th, 2012

After writing my recent article on GPU accelerated Matrix-Matrix multiplication using Maple, I thought that I’d try the same thing in Mathematica.  However, I instantly hit a problem on my 64bit Windows 7 machine running version 8.0.1 of Mathematica.

```In[1]:= a = RandomReal[1, {2, 2}]
Out[1]= {{0.363441, 0.528656}, {0.208881, 0.510232}}

In[2]:= b = RandomReal[1, {2, 2}]
Out[2]= {{0.33536, 0.77615}, {0.537533, 0.788522}}

In[3]:= Dot[a, b]
Out[3]= {{0.406054, 0.698942}, {0.344317, 0.564452}}

Out[5]= {{0.741414, 1.47509}, {0.881849, 1.35297}}```

In short, CUDADot gives the wrong result for floating point numbers (on my machine at least).  An upgrade to version 8.0.4 fixed the problem

## Optimising a correlated asset calculation on MATLAB #3 – Using the GPU via Jacket

February 15th, 2012

This article is the third part of a series where I look at rewriting a particular piece of MATLAB code using various techniques.  The introduction to the series is here and the introduction to the larger series of GPU articles for MATLAB is here.

Last time I used The Mathwork’s Parallel Computing Toolbox in order to modify a simple correlated asset calculation to run on my laptop’s GPU rather than its CPU.  Performance was not as good as I had hoped and I never managed to get my laptop’s GPU (an NVIDIA GT555M) to beat the CPU using functions from the parallel computing toolbox.  Transferring the code to a much more powerful Tesla GPU card resulted in a four times speed-up compared to the CPU in my laptop.

This time I’ll take a look at AccelerEyes’ Jacket, a third party GPU solution for MATLAB.

Attempt 1 – Make as few modifications as possible

I started off just as I did for the parallel computing toolbox GPU port; by taking the best CPU-only code from part 1 (optimised_corr2.m) and changing a load of data-types from double to gdouble in order to get the calculation to run on my laptop’s GPU using Jacket v1.8.1 and MATLAB 2010b.  I also switched to using the GPU versions of various functions such as grandn instead of randn for example.  Functions such as cumprod needed no  modifications at all since they are nicely overloaded; if the argument to cumprod is of type double then the calculation happens on the CPU whereas if it is gdouble then it happens on the GPU.

Now, one thing that I don’t like about Jacket’s implementation is that many of the functions return single precision numbers by default.  For example, if you do

`a=grand(1,10)`

then you end up with 10 numbers of type gsingle.  To get double precision numbers you have to do

`grandn(1,10,'double')`

Now you may be thinking ‘what’s the big deal? – it’s just a bit of syntax so get over it’ and I guess that would be a fair comment.  Supporting single precision also allows users of legacy GPUs to get in on the GPU-accelerated action which is a good thing. The problem as I see it is that almost everything else in MATLAB uses double precision numbers as the default and so it’s easy to get caught out.  I would much rather see functions such as grand return double precision by default with the option to use single precision if required–just like almost every other MATLAB function out there.

The result of my ‘port’ is GPU_jacket_corr1.m

One thing to note in this version, along with all subsequent Jacket versions, is the following command that appears at the very end of the program.

`gsync;`

This is very important if you want to get meaningful timing results since it ensures that all GPU computations have finished before execution continues. See the Jacket documentation on gysnc and this blog post on timing Jacket code for more details.

The other thing I’ll mention is that, in this version, I do this:

```Corr = [1 0.4; 0.4 1];
UpperTriangle=gdouble(chol(Corr));```

```Corr = gdouble([1 0.4; 0.4 1]);
UpperTriangle=chol(Corr);```

In other words, I do the cholesky decomposition on the CPU and move the results to the GPU rather than doing the calculation on the GPU.  This is mainly because I don’t have access to a Jacket DLA license but it’s probably the best thing to do anyway since such a small decomposition probably won’t happen any faster on the GPU.

So, how does it perform.  I ran it three times with the now usual parameters of 100000 simulations done in blocks of 125 (see the CPU version for how I came to choose 125)

```>> tic;GPU_jacket_corr1;toc
Elapsed time is 40.691888 seconds.
>> tic;GPU_jacket_corr1;toc
Elapsed time is 32.096796 seconds.
>> tic;GPU_jacket_corr1;toc
Elapsed time is 32.039982 seconds.```

Just like the Parallel Computing Toolbox, the first run is slower because of GPU warmup overheads. Also, just like the PCT, performance stinks! It’s clearly not enough, in this case at least, to blindly throw in a few gdoubles and hope for the best. It is worth noting, however, that this case is nowhere near as bad as the 900+ seconds we saw in the similar parallel computing toolbox version.

Jacket has punished me for being stupid (or optimistic if you want to be polite to me) but not as much as the PCT did.

Attempt 2 – Convert from a script to a function

When working with the Parallel Computing Toolbox I demonstrated that a switch from a script to a function yielded some speed-up.  This wasn’t the case with the Jacket version of the code.  The functional version, GPU_jacket_corr2.m, showed no discernable speed improvement compared to the script.

```%Warm up run performed previously
>> tic;GPU_jacket_corr2(100000,125);toc
Elapsed time is 32.230638 seconds.
>> tic;GPU_jacket_corr2(100000,125);toc
Elapsed time is 32.179734 seconds.
>> tic;GPU_jacket_corr2(100000,125);toc
Elapsed time is 32.114864 seconds.```

Attempt 3 – One big matrix multiply!

The original version of this calculation performs thousands of very small matrix multiplications and last time we saw that switching to a single, large matrix multiplication brought significant speed improvements on the GPU.  Modifying the code to do this with Jacket is a very similar process to doing it for the PCT so I’ll omit the details and just present the code, GPU_jacket_corr3.m

```%Warm up runs performed previously
>> tic;GPU_jacket_corr3(100000,125);toc
Elapsed time is 2.041111 seconds.
>> tic;GPU_jacket_corr3(100000,125);toc
Elapsed time is 2.025450 seconds.
>> tic;GPU_jacket_corr3(100000,125);toc
Elapsed time is 2.032390 seconds.```

Now that’s more like it! Finally we have a GPU version that runs faster than the CPU on my laptop. We can do better, however, since the block size of 125 was chosen especially for my CPU. With this Jacket version bigger is better and we get much more speed-up by switching to a block size of 25000 (I run out of memory on the GPU if I try even bigger block sizes).

```%Warm up runs performed previously
>> tic;GPU_jacket_corr3(100000,25000);toc
Elapsed time is 0.749945 seconds.
>> tic;GPU_jacket_corr3(100000,25000);toc
Elapsed time is 0.749333 seconds.
>> tic;GPU_jacket_corr3(100000,25000);toc
Elapsed time is 0.749556 seconds.```

Now this is exciting! My laptop GPU with Jacket 1.8.1 is faster than a high-end Tesla card using the parallel computing toolbox for this calculation. My excitement was short lived, however, when I looked at the resulting distribution. The random number generator in Jacket 1.8.1 gave a completely different distribution when compared to generators from other sources (I tried two CPU generators from The Mathworks and one from The Numerical Algorithms Group).  The only difference in the code that generated the results below is the random number generator used.

• The results shown in these plots were for only 20,000 simulations rather than the 100,000 I’ve been using elsewhere in this post. I found this bug in the development stage of these posts where I was using a smaller number of simulations.
• Jacket 1.8.1 is using Jackets old grandn function with the ‘double’ option set
• MATLAB #1 is using MATLAB’s randn using the Comb Recursive algorithm on the CPU
• MATLAB #2 is using MATLAB’s randn using the default Mersenne Twister on the CPU
• NAG is using a Wichman-Hill generator

I sent my code to AccelerEye’s customer support who confirmed that this seemed to be a bug in their random number generator (an in-house Mersenne Twister implementation).  Less than a week later they offered me a new preview of Jacket from their Nightly builds where the old RNG had been replaced with the Mersenne Twister implementation produced by NVidia and I’m happy to confirm that not only does this fix the results for my code but it goes even faster!  Superb customer service!

This new RNG is now the default in version 2.0 of Jacket.  Here’s the distribution I get for 20,000 simulations (to stay in line with the plots shown above).

Switching back to 100,000 simulations to stay in line with all of the other benchmarks in this series gives the following times on my laptop’s GPU

```%Warm up runs performed previously
>> tic;prices=GPU_jacket_corr3(100000,25000);toc
Elapsed time is 0.696891 seconds.
>> tic;prices=GPU_jacket_corr3(100000,25000);toc
Elapsed time is 0.697596 seconds.
>> tic;prices=GPU_jacket_corr3(100000,25000);toc
Elapsed time is 0.697312 seconds.```

This is almost 5 times faster than the 3.42 seconds managed by the best CPU version from part 1. I sent my code to AccelerEyes and asked them to run it on a more powerful GPU, a Tesla C2075, and these are the results they sent back

```Elapsed time is 5.246249 seconds. %warm up run
Elapsed time is 0.158165 seconds.
Elapsed time is 0.156529 seconds.
Elapsed time is 0.156522 seconds.
Elapsed time is 0.156501 seconds.```

So, the Tesla is 4.45 times faster than my laptop’s GPU for this application and a very useful 21.85 times faster than my laptop’s CPU.

Results Summary

• Best CPU Result on laptop (i7-2630GM)with pure MATLAB code – 3.42 seconds
• Best GPU Result with PCT on laptop (GT555M) – 4.04 seconds
• Best GPU Result with PCT on Tesla M2070 – 0.85 seconds
• Best GPU Result with Jacket on laptop (GT555M) – 0.7 seconds
• Best GPU Result with Jacket on Tesla M2075 – 0.16 seconds

Test System Specification

• Laptop model: Dell XPS L702X
• CPU:Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
• GPU:GeForce GT 555M with 144 CUDA Cores.  Graphics clock: 590Mhz.  Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
• RAM: 8 Gb
• OS: Windows 7 Home Premium 64 bit.
• MATLAB: 2011b

Acknowledgements

Thanks to Yong Woong Lee of the Manchester Business School as well as various employees at AccelerEyes for useful discussions and advice.  Any mistakes that remain are all my own.

## A brief look at CUDA support in Maple 15

February 12th, 2012

Maple has had support for NVidia GPUs since version 14 but I’ve not played with it much until recently.  Essentially I was put off by the fact that Maple’s CUDA package seemed to have support for only one function – Matrix-Matrix Multiplication. However, a recent conversation with a Maple developer changed my mind.

It is true that only MatrixMatrixMultiply has been accelerated but when you flip the CUDA switch in Maple, every function in the LinearAlgebra package that calls MatrixMatrixMultiply also gets accelerated.  This leads to the possibility of a lot of speed-ups for very little work.

So, this morning I thought I would take a closer look using my laptop.  Let’s start by timing how long it takes the CPU to multiply two 4000 by 4000 double precision matrices

```with(LinearAlgebra):
CUDA:-Enable(false):
CUDA:-IsEnabled();
a := RandomMatrix(4000, datatype = float[8]):
b := RandomMatrix(4000, datatype = float[8]):
t := time[real]():
c := a.b:
time[real]()-t```

The exact time varied a little from run to run but 3.76 seconds is a typical result. I’m only feeling my way at this stage so not doing any proper benchmarking.

To do this calculation on the GPU, all I need to do is change the line

`CUDA:-Enable(false):`

to

`CUDA:-Enable(true):`

like so

```with(LinearAlgebra):
CUDA:-Enable(true):
CUDA:-IsEnabled();
a := RandomMatrix(4000, datatype = float[8]):
b := RandomMatrix(4000, datatype = float[8]):
t := time[real]():
c := a.b:
time[real]()-t```

Typical execution time was 8.37 seconds so the GPU version is more than 2 times slower than the CPU version on my machine.

Trying different matrix sizes

Not wanting to admit defeat after just a single trial, I timed the above code using different matrix sizes.  Here are the results

• 1000 by 1000: CPU=0.07 seconds GPU=0.17 seconds
• 2000 by 2000: CPU=0.53 seconds GPU=1.07 seconds
• 4000 by 4000: CPU=3.76 seconds GPU=8.37 seconds
• 5000 by 5000: CPU=7.44 seconds GPU=19.48 seconds

Switching to single precision

GPUs do much better with single precision numbers so I had a try with those too.  All you need to do is change

`datatype = float[8]`

to

`datatype = float[4]`

in the above code. The results are:

• 1000 by 1000: CPU=0.03 seconds GPU=0.07 seconds
• 2000 by 2000: CPU=0.35 seconds GPU=0.66 seconds
• 4000 by 4000: CPU=1.86 seconds GPU=2.37 seconds
• 5000 by 5000: CPU=3.81 seconds GPU=5.2 seconds

So the GPU loses in single precision mode too on my hardware.  If I can’t get a speedup with MatrixMatrixMultiply on my system then there is no point in exploring all of the other LinearAlgebra routines since all of them will be slower when moving to CUDA acceleration.

I guess that in this case, my CPU is too powerful and my GPU is too wimpy to see the acceleration I was hoping for.

Thanks to Maplesoft for providing me with a review copy of Maple 15.

Test System Specification

• Laptop model: Dell XPS L702X
• CPU: Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
• GPU: GeForce GT 555M with 144 CUDA Cores.  Graphics clock: 590Mhz.  Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
• RAM: 8 Gb
• OS: Windows 7 Home Premium 64 bit.
• Maple 15

## Optimising a correlated asset calculation on MATLAB #2 – Using the GPU via the PCT

February 9th, 2012

This article is the second part of a series where I look at rewriting a particular piece of MATLAB code using various techniques.  The introduction to the series is here and the introduction to the larger series of GPU articles for MATLAB on WalkingRandomly is here.

Attempt 1 – Make as few modifications as possible

I took my best CPU-only code from last time (optimised_corr2.m) and changed a load of data-types from double to gpuArray in order to get the calculation to run on my laptop’s GPU using the parallel computing toolbox in MATLAB 2010b.  I also switched to using the GPU versions of various functions such as parallel.gpu.GPUArray.randn instead of randn for example.  Functions such as cumprod needed no  modifications at all since they are nicely overloaded; if the argument to cumprod is of type double then the calculation happens on the CPU whereas if it is gpuArray then it happens on the GPU.

The above work took about a minute to do which isn’t bad for a CUDA ‘porting’ effort!  The result, which I’ve called GPU_PCT_corr1.m is available for you to download and try out.

How about performance?  Let’s do a quick tic and toc using my laptop’s NVIDIA GT 555M GPU.

```>> tic;GPU_PCT_corr1;toc
Elapsed time is 950.573743 seconds.```

The CPU version of this code took only 3.42 seconds which means that this GPU version is over 277 times slower! Something has gone horribly, horribly wrong!

Attempt 2 – Switch from script to function

In general functions should be faster than scripts in MATLAB because more automatic optimisations are performed on functions.  I didn’t see any difference in the CPU version of this code (see optimised_corr3.m from part 1 for a CPU function version) and so left it as a script (partly so I had an excuse to discuss it here if I am honest).  This GPU-version, however, benefits noticeably from conversion to a function.  To do this, add the following line to the top of GPU_PCT_corr1.m

`function [SimulPrices] = GPU_PTC_corr2( n,sub_size)`

Next, you need to delete the following two lines

```n=100000;                       %Number of simulations
sub_size = 125;```

Finally, add the following to the end of our new function

`end`

That’s pretty much all I did to get GPU_PCT_corr2.m. Let’s see how that performs using the same parameters as our script (100,000 simulations in blocks of 125).  I used script_vs_func.m to run both twice after a quick warm-up iteration and the results were:

```Warm up
Elapsed time is 1.195806 seconds.
Main event
script
Elapsed time is 950.399920 seconds.
function
Elapsed time is 938.238956 seconds.
script
Elapsed time is 959.420186 seconds.
function
Elapsed time is 939.716443 seconds.```

So, switching to a function has saved us a few seconds but performance is still very bad!

Attempt 3 – One big matrix multiply!

So far all I have done is take a program that works OK on a CPU, and run it exactly as-is on the GPU in the hope that something magical would happen to make it go faster.  Of course, GPUs and CPUs are very different beasts with differing sets of strengths and weaknesses so it is rather naive to think that this might actually work.  What we need to do is to play to the GPUs strengths more and the way to do this is to focus on this piece of code.

```for i=1:sub_size
CorrWiener(:,:,i)=parallel.gpu.GPUArray.randn(T-1,2)*UpperTriangle;
end```

Here, we are performing lots of small matrix multiplications and, as mentioned in part 1, we might hope to get better performance by performing just one large matrix multiplication instead. To do this we can change the above code to

```%Generate correlated random numbers
%using one big multiplication
randoms = parallel.gpu.GPUArray.randn(sub_size*(T-1),2);
CorrWiener = randoms*UpperTriangle;
CorrWiener = reshape(CorrWiener,(T-1),sub_size,2);
%CorrWiener = permute(CorrWiener,[1 3 2]); %Can't do this on the GPU in 2011b or below

%poor man's permute since GPU permute if not available in 2011b
CorrWiener_final = parallel.gpu.GPUArray.zeros(T-1,2,sub_size);
for s = 1:2
CorrWiener_final(:, s, :) = CorrWiener(:, :, s);
end```

The reshape and permute are necessary to get the matrix in the form needed later on. Sadly, MATLAB 2011b doesn’t support permute on GPUArrays and so I had to use the ‘poor mans permute’ instead.

The result of the above is contained in GPU_PCT_corr3.m so let’s see how that does in a fresh instance of MATLAB.

```>> tic;GPU_PCT_corr3(100000,125);toc
Elapsed time is 16.666352 seconds.
>> tic;GPU_PCT_corr3(100000,125);toc
Elapsed time is 8.725997 seconds.
>> tic;GPU_PCT_corr3(100000,125);toc
Elapsed time is 8.778124 seconds.```

The first thing to note is that performance is MUCH better so we appear to be on the right track. The next thing to note is that the first evaluation is much slower than all subsequent ones. This is totally expected and is due to various start-up overheads.

Recall that 125 in the above function calls refers to the block size of our monte-carlo simulation. We are doing 100,000 simulations in blocks of 125– a number chosen because I determined empirically that this was the best choice on my CPU. It turns out we are better off using much larger block sizes on the GPU:

```>> tic;GPU_PCT_corr3(100000,250);toc
Elapsed time is 6.052939 seconds.
>> tic;GPU_PCT_corr3(100000,500);toc
Elapsed time is 4.916741 seconds.
>> tic;GPU_PCT_corr3(100000,1000);toc
Elapsed time is 4.404133 seconds.
>> tic;GPU_PCT_corr3(100000,2000);toc
Elapsed time is 4.223403 seconds.
>> tic;GPU_PCT_corr3(100000,5000);toc
Elapsed time is 4.069734 seconds.
>> tic;GPU_PCT_corr3(100000,10000);toc
Elapsed time is 4.039446 seconds.
>> tic;GPU_PCT_corr3(100000,20000);toc
Elapsed time is 4.068248 seconds.
>> tic;GPU_PCT_corr3(100000,25000);toc
Elapsed time is 4.099588 seconds.```

The above, rather crude, test suggests that block sizes of 10,000 are the best choice on my laptop’s GPU. Sadly, however, it’s STILL slower than the 3.42 seconds I managed on the i7 CPU and represents the best I’ve managed using pure MATLAB code. The profiler tells me that the vast majority of the GPU execution time is spent in the cumprod line and in random number generation (over 40% each).

Trying a better GPU
Of course now that I have code that runs on a GPU I could just throw it at a better GPU and see how that does. I have access to MATLAB 2011b on a Tesla M2070 hooked up to a Linux machine so I ran the code on that. I tried various block sizes and the best time was 0.8489 seconds with the call GPU_PCT_corr3(100000,20000) which is just over 4 times faster than my laptop’s CPU.

Can you do better using just the GPU functionality provided in the Parallel Computing Toolbox (so no bespoke CUDA kernels or Jacket just yet)? I’ll be looking at how AccelerEyes’ Jacket myself in the next post.

Results so far

• Best CPU Result on laptop (i7-2630GM)with pure MATLAB code – 3.42 seconds
• Best GPU Result with PCT on laptop (GT555M) – 4.04 seconds
• Best GPU Result with PCT on Tesla M2070 – 0.85 seconds

Test System Specification

• Laptop model: Dell XPS L702X
• CPU: Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
• GPU: GeForce GT 555M with 144 CUDA Cores.  Graphics clock: 590Mhz.  Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
• RAM: 8 Gb
• OS: Windows 7 Home Premium 64 bit.
• MATLAB: 2011b

Acknowledgements

Thanks to Yong Woong Lee of the Manchester Business School as well as various employees at The Mathworks for useful discussions and advice.  Any mistakes that remain are all my own :)

## Will 2012 be the year when OpenCL adoption really takes off?

January 10th, 2012

From where I sit it seems that the majority of scientific GPU work is being done with NVIDIA’s proprietary CUDA platform.  All the signs point to the possibility of this changing, however, and I wonder if 2012 will be the year when OpenCL comes of age.  Let’s look at some recent and near future events….

Hardware

• AMD have recently released the AMD Radeon HD 7970, said to be the fastest single- GPU graphics card on the planet.  This new card supports both Microsoft’s DirectCompute along with OpenCL and is much faster than the previous generation of AMD card (see here for compute benchmarks) as well as being faster than NVIDIAs current top of the line GTX 580.
• Intel will release their next generation of CPUs – Ivy Bridge – which will include an increased number of built in GPU cores which should be OpenCL compatible.  Although the current Sandy Bridge processors also contain GPU cores, it is not currently possible to target them with Intel’s OpenCL implementation (version 1.5 is strictly for the CPU cores).  I would be very surprised if Intel didn’t update their OpenCL implementation to be able to target the GPUs in Ivy Bridge this year.
• AMDs latest Fusion processors also contain OpenCL compatible GPU cores directly integrated with the CPU which programmers can exploit using AMD’s Accelerated Parallel Processing (APP) SDK.

The practical upshot of the above is that if a software vendor uses OpenCL to accelerate their product then it could potentially benefit more of their customers than if they used CUDA.  Furthermore, if you want your code to run on the fastest GPU around then OpenCL is the way to go right now.

Software

Having the latest, fastest hardware is pointless if the software you run can’t take advantage of it.  Over the last 12 months I have had the opportunity to speak to developers of various commerical scientific and mathematical software products which support GPU acceleration.  With the exception of Wolfram’s Mathematica, all of them only supported CUDA.  When I asked why they don’t support OpenCL, the response of most of these developers could be paraphrased as ‘The mathematical libraries and vendor support for CUDA are far more developed than those of OpenCL so CUDA support is significantly easier to integerate into our product.‘  Despite this, however, OpenCL support is definitely growing in the world of mathematical and scientific software.

OpenCL in Mathematics software and libraries

• ViennaCL, a GPU-accelerated C++ open-source linear algebra library, was updated to version 1.2.0 on December 31st (just missing the deadline for December’s Month of Math Software).  Roughly speaking, ViennaCL is a mixture of Boost.ublas (high-level interface) and MAGMA (GPU-support), yet based on OpenCL rather than CUDA.
• AccelerEyes released a new major version of their GPU accelerated MATLAB toolbox, Jacket, in late December 2011.  The big news as far as this article is concerned is that it includes support for OpenCL; something that is currently missing from The Mathworks’ Parallel Computing Toolbox.
• Not content with bringing OpenCL support to MATLAB, AccelerEyes also realesed ArrayFire– a free (for the basic version at least) library for C, C++, Fortran, and Python that includes support for both CUDA and OpenCL.
• Although it’s not new news, it’s worth bearing in mind that Mathematica has supported OpenCL for a while now– since the relase of version 8 back in  November 2010.

Finite Element Modelling with Abaqus

• Back in May 2011, Version 6.11 of the finite element modelling package, Abaqus, was released and it included support for NVIDIA cards (see here for NVIDIA’s page on it).  In September, GPU support in Abaqus was broadened to include AMD Hardware with an OpenCL compliant release (see here).

Other projects

• In late December 2011, the first alpha version of FortranCL, an OpenCL interface for Fortran 90, was released.

What do you think?  Will OpenCL start to take the lead in scientific and mathematical software applications this year or will CUDA continue to dominate?  Are there any new OpenCL projects that I’ve missed?

## MATLAB GPU / CUDA experiences on my laptop – Elementwise operations on the GPU #2

August 16th, 2011

This is part 2 of an ongoing series of articles about MATLAB programming for GPUs using the Parallel Computing Toolbox.  The introduction and index to the series is at http://www.walkingrandomly.com/?p=3730.   All timings are performed on my laptop (hardware detailed at the end of this article) unless otherwise indicated.

Last time

In the previous article we saw that it is very easy to run MATLAB code on suitable GPUs using the parallel computing toolbox.  We also saw that the GPU has its own area of memory that is completely separate from main memory.  So, if you have a program that runs in main memory on the CPU and you want to accelerate part of it using the GPU then you’ll need to explicitly transfer data to the GPU and this takes time.  If your GPU calculation is particularly simple then the transfer times can completely swamp your performance gains and you may as well not bother.

So, the moral of the story is either ‘Keep data transfers between GPU and host down to a minimum‘ or ‘If you are going to transfer a load of data to or from the GPU then make sure you’ve got a ton of work for the GPU to do.‘  Today, I’m going to consider the latter case.

A more complicated elementwise function – CPU version
Last time we looked at a very simple elementwise calculation– taking the sine of a large array of numbers.  This time we are going to increase the complexity ever so slightly.  Consider trig_series.m which is defined as follows

```function out = myseries(t)
out = 2*sin(t)-sin(2*t)+2/3*sin(3*t)-1/2*sin(4*t)+2/5*sin(5*t)-1/3*sin(6*t)+2/7*sin(7*t);
end```

Let’s generate 75 million random numbers and see how long it takes to apply this function to all of them.

```cpu_x = rand(1,75*1e6)*10*pi;
tic;myseries(cpu_x);toc```

I did the above calculation 20 times using a for loop and took the median of the results to get 4.89 seconds1. Note that this calculation is fully parallelised…all 4 cores of my i7 CPU were working on the problem simultaneously (Many built in MATLAB functions are parallelised these days by the way…No extra toolbox necessary).

GPU version 1 (Or ‘How not to do it’)

One way of performing this calculation on the GPU would be to proceed as follows

```cpu_x =rand(1,75*1e6)*10*pi;
tic
%Transfer data to GPU
gpu_x = gpuArray(cpu_x);
%Do calculation using GPU
gpu_y = myseries(gpu_x);
%Transfer results back to main memory
cpu_y = gather(gpu_y)
toc```

The mean time of the above calculation on my laptop’s GPU is 3.27 seconds, faster than the CPU version but not by much. If you don’t include data transfer times in the calculation then you end up with 2.74 seconds which isn’t even a factor of 2 speed-up compared to the CPU.

GPU version 2 (arrayfun to the rescue)

We can get better performance out of the GPU simply by changing the line that does the calculation from

`gpu_y = myseries(gpu_x);`

to a version that uses MATLAB’s arrayfun command.

`gpu_y = arrayfun(@myseries,gpu_x);`

So, the full version of the code is now

```cpu_x =rand(1,75*1e6)*10*pi;
tic
%Transfer data to GPU
gpu_x = gpuArray(cpu_x);
%Do calculation using GPU via arrayfun
gpu_y = arrayfun(@myseries,gpu_x);
%Transfer results back to main memory
cpu_y = gather(gpu_y)
toc```

This improves the timings quite a bit. If we don’t include data transfer times then the arrayfun version completes in 1.42 seconds down from 2.74 seconds for the original GPU code. Including data transfer, the arrayfun version complete in 1.94 seconds compared to for the 3.27 seconds for the original.

Using arrayfun for the GPU is definitely the way to go! Giving the GPU every disadvantage I can think of (double precision, including transfer times, comparing against multi-thread CPU code etc) we still get a speed-up of just over 2.5 times on my laptop’s hardware. Pretty useful for hardware that was designed for energy-efficient gaming!

Note: Something that I learned while writing this post is that the first call to arrayfun will be slower than all of the rest.  This is because arrayfun compiles the MATLAB function you pass it down to PTX and this can take a while (seconds).  Subsequent calls will be much faster since arrayfun will use the compiled results.  The compiled PTX functions are not saved between MATLAB sessions.

arrayfun – Good for your memory too!
Using the arrayfun function is not only good for performance, it’s also good for memory management. Imagine if I had 100 million elements to operate on instead of only 75 million. On my 3Gb GPU, the following code fails:

```cpu_x = rand(1,100*1e6)*10*pi;
gpu_x = gpuArray(cpu_x);
gpu_y = myseries(gpu_x);
??? Error using ==> GPUArray.mtimes at 27
Out of memory on device. You requested: 762.94Mb, device has 724.21Mb free.

Error in ==> myseries at 3
out =
2*sin(t)-sin(2*t)+2/3*sin(3*t)-1/2*sin(4*t)+2/5*sin(5*t)-1/3*sin(6*t)+2/7*sin(7*t);```

If we use arrayfun, however, then we are in clover. The following executes without complaint.

```cpu_x = rand(1,100*1e6)*10*pi;
gpu_x = gpuArray(cpu_x);
gpu_y = arrayfun(@myseries,gpu_x);```

Some Graphs

Just like last time, I ran this calculation on a range of input arrays from 1 million to 100 million elements on both my laptop’s GT 555M GPU and a Tesla C2050 I have access to.  Unfortunately, the C2050 is running MATLAB 2010b rather than 2011a so it’s not as fair a test as I’d like.  I could only get up to 84 million elements on the Tesla before it exited due to memory issues.  I’m not sure if this is down to the hardware itself or the fact that it was running an older version of MATLAB.

Next, I looked at the actual speed-up shown by the GPUs compared to my laptop’s i7 CPU.  Again, this includes data transfer times, is in full double precision and the CPU version was multi-threaded (No dodgy ‘GPUs are awesome’ techniques used here).  No matter what array size is used I get almost a factor of 3 speed-up using the laptop GPU and more than a factor of 7 speed-up when using the Tesla.  Not too shabby considering that the programming effort to achieve this speed-up was minimal.

Conclusions

• It is VERY easy to modify simple, element-wise functions to take advantage of the GPU in MATLAB using the Parallel Computing Toolbox.
• arrayfun is the most efficient way of dealing with such functions.
• My laptop’s GPU demonstrated almost a 3 times speed-up compared to its CPU.

• Laptop model: Dell XPS L702X
• CPU: Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
• GPU: GeForce GT 555M with 144 CUDA Cores.  Graphics clock: 590Mhz.  Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
• RAM: 8 Gb
• OS: Windows 7 Home Premium 64 bit.  I’m not using Linux because of the lack of official support for Optimus.
• MATLAB: 2011a with the parallel computing toolbox

Code and sample timings (Will grow over time)

You need myseries.m and trigseries_test.m and the Parallel Computing Toolbox.  These are the times given by the following function call for various systems (transfer included and using arrayfun).

`[cpu,~,~,~,gpu] = trigseries_test(10,50*1e6,'mean')`

GPUs

• Tesla C2050, Linux, 2010b – 0.4751 seconds
• NVIDIA GT 555M – 144 CUDA Cores, 3Gb RAM, Windows 7, 2011a – 1.2986 seconds

CPUs

• Intel Core i7-2630QM, Windows 7, 2011a (My laptop’s CPU) – 3.33 seconds
• Intel Core 2 Quad Q9650 @ 3.00GHz, Linux, 2011a – 3.9452 seconds

Footnotes
1 – This is the method used in all subsequent timings and is similar to that used by the File Exchange function, timeit (timeit takes a median, I took a mean). If you prefer to use timeit then the function call would be timeit(@()myseries(cpu_x)). I stick to tic and toc in the article because it makes it clear exactly where timing starts and stops using syntax well known to most MATLABers.

Thanks to various people at The Mathworks for some useful discussions, advice and tutorials while creating this series of articles.