Accelerated SIMD Mersenne Twister random number generator in MATLAB

November 20th, 2018 | Categories: Making MATLAB faster, matlab, random numbers, Scientific Software | Tags:

In a recent blog post, Daniel Lemire explicitly demonstrated that vectorising random number generators using SIMD instructions could give useful speed-ups.  This reminded me of the one of the first times I played with the Julia language where I learned that Julia’s random number generator used a SIMD-accelerated implementation of Mersenne Twister called dSFMT to generate random numbers much faster than MATLAB’s Mersenne Twister implementation.

Just recently, I learned that MATLAB now has its own SIMD accelerated version of Mersenne Twister which can be activated like so:

```seed=1;
rng(seed,'simdTwister')
```

This new Mersenne Twister implementation gives different random variates to the original implementation (which I demonstrated is the same as Numpy’s implementation in an older post) as you might expect

```>> rng(1,'Twister')
>> rand(1,5)
ans =
0.4170 0.7203 0.0001 0.3023 0.1468
>> rng(1,'simdTwister')
>> rand(1,5)
ans =
0.1194 0.9124 0.5032 0.8713 0.5324
```

So it’s clearly a different algorithm and, on CPUs that support the relevant instructions, it’s about twice as fast!  Using my very unscientific test code:

```format compact
number_of_randoms = 10000000
disp('Timing standard twister')
rng(1,'Twister')
tic;rand(1,number_of_randoms);toc
disp('Timing SIMD twister')
rng(1,'simdTwister')
tic;rand(1,number_of_randoms);toc
```

gives the following results for a typical run on my Dell XPS 15 9560 which supports AVX instructions

```number_of_randoms =
10000000
Timing standard twister
Elapsed time is 0.101307 seconds.
Timing SIMD twister
Elapsed time is 0.057441 seconds
```

The MATLAB documentation does not tell us which algorithm their implementation is based on but it seems to be different from Julia’s. In Julia, if we set the seed to 1 as we did for MATLAB and ask for 5 random numbers, we get something different from MATLAB:

```julia> using Random
julia> Random.seed!(1);
julia> rand(1,5)
1×5 Array{Float64,2}:
0.236033  0.346517  0.312707  0.00790928  0.48861
```

The performance of MATLAB’s new generator is on-par with Julia’s although I’ll repeat that these timing tests are far far from rigorous.

```julia> Random.seed!(1);

julia> @time rand(1,10000000);
0.052981 seconds (6 allocations: 76.294 MiB, 29.40% gc time)
```

‘Do your buttons do what you think they do?’ One interface designer’s response to ‘Is your Research Software Correct?’

October 1st, 2018 | Categories: programming, RSE | Tags:

A guest blog-post by Catherine Smith of University of Birmingham

In early 2017 I was in the audience at one of Mike Croucher’s ‘Is your research software correct?’ presentations. One of the first questions posed in the talk is ‘how reproducible is a mouse click?’. The answer, of course, is that it isn’t and therefore research processes should be automated and not involve anyone pressing any buttons. This posed something of a challenge to my own work which is primarily about making buttons for researchers to press (and select, drag and drop etc.) in order to present their data in the appropriate scholarly way. This software, for making electronic editions of texts preserved in multiple sources, assists with the alignment and editing of material. Even so, the editor is always in control and that is the way it should be. The lack of automation means reproducibility is a problem for my software but as Peter Shillingsburg, one of the pioneers of digital editing, says ‘editing is an art not a science’: maybe art can therefore be excused, to an extent, from the constraints of automation and, despite their introduction of human decisions, the buttons may be permitted to stay. Nevertheless I still want to know that my software doing what I think it is doing even if I can’t automate what editors choose to do with it. In the discussion that followed the paper I was talking about the complication of testing my interface-heavy software. Mike agreed that it was a complex situation but concluded by saying “if you go away from here and write one test you will have made the world a better place”.

I did just that. In fact I did very little else for the next three months. What started with one Python unit test has so far led to 65 Python unit tests, 82 Javascript unit tests and 54 functional tests using Selenium. The timing of all of this was perfect in that I had just begun a project to migrate all of our web applications to Django. I had one application partially migrated and so I tested that one and even did some test-driven development on the sections that were not yet complete.

The tests themselves are great to have. This was my first project using Django and I made lots of mistakes in the first application. The tests have been invaluable in ensuring that, as I learned more and made improvements, the older code kept pace with those changes. Now that I have tests for some things I want tests for everything and I have developed a healthy fear of editing code that is not yet tested. There are other advantages as well. When I sat down to write my first test it very quickly became clear that the code I had written was not easily testable. I had to break down the large Django views into smaller chunks of code that could each be unit tested. I now write better structured code because of that time I invested in testing just some of it. I also learned a lot about how to approach migrating all of the remaining applications while writing the detailed tests for every aspect of the first one.T

Django has an integrated test framework based on the python unittest module but with the additional benefit of automatically creating a test database using the models from the project to which test data can be added. It was very straightforward to set up and run (see the Django docs https://docs.djangoproject.com/en/2.1/topics/testing/). I found Javascript unit testing less straight forward. There was not much Javascript in this first application so I used the qunit test framework and sinon.js for mocking. I have never automated the running of these tests and instead just open them in the browser. It’s not ideal but it works for now. I have other applications which are far more Javascript heavy and for those I will need to have automated tests, there are plenty of frameworks around to choose from so I will investigate those when I start writing the tests.

Probably the most important tests I have are the functional tests which are written in Selenium. I had already heard of Selenium, having attended a Test Driven Development workshop several years ago by Harry Percival. I used his book, Test-Driven Development with Python, as a tutorial for all of the Selenium tests and some of the Django and Javascript tests too. Selenium tests are automated browser tests which really do allow you to test what happens when a user presses a button, types text into a text box, selects an item from a list, moves an element by dragging it etc.. The result of every interaction in an interface can be tested with Selenium. The content of each page can also be checked. It is generally not necessary to test static html but I did test the contents of several dynamic pages which loaded different content depending on the permissions granted to a user. Selenium is also integrated within Django using the LiveServerTestCase which means it has access to a copy of the database just like the Django unit tests. Selenium tests can be complex and there are several things to watch out for. Selenium doesn’t automatically wait for a page to load before executing the test statements against it, at every point data is loaded Selenium must be told to wait until a given condition is fulfilled up to a maximum time limit before continuing. I still have tests which occasionally fail because, on that particular run, a page or an ajax call is taking longer to load than I have allowed for. Run it another five times and it may well pass on every one. It is also important to make sure the browser is told to scroll to a point where an element can be seen before the instruction to interact with that element is given. It’s not difficult to do and is more predictable that waiting for a page to load but it still has to be remembered every time.

The functional tests are by far the most complex of all the tests I wrote in my three month testing marathon but they are the most important. I can’t automate the entire creation of a digital edition but with tests I can make sure my interface is presenting the correct data in the right way to the editors and that when they interact with that data everything behaves as it should. I really can say that the buttons and other interactive elements I have tested do exactly what I think they do. Now I just need to test all the rest of the buttons – one test at a time!

Research Computing Times #1 – August 2018

August 29th, 2018 | Categories: Research Computing Times | Tags:

Over the years I’ve been blogging, I have run a few recurring series of blogposts.  In the early days, there was the shortlived Problem of The Week.  Sometime later I inherited The Carnival of Maths which I looked after for a couple of years before passing it over to Aperiodical.com who have looked after it ever since.  I also ran a series called A Month of Math Software for 2 and a half years before my enthusiasm for the topic ran out.

I am currently the Head of Research Computing at The University of Leeds — a senior management role that puts me in the fortunate position of being reasonably well-informed about the world of research computing.  Software, hardware, cloud, data science, dealing with sensitive data…everyone, it seems, has something to tell me.  I’m also continuing with my EPSRC fellowship part-time which means that I’m rather more hands on than a typical member of an executive leadership team.

While at JuliaCon 2018, I had the extremely flattering experience of a few people telling me that they had been long time readers of WalkingRandomly and that they were disappointed that I didn’t post as often

All of this has led to the desire to start a new regular series.  One where I look at all aspects of research computing and compile it into a series of monthly posts.  If you have anything you’d like including in next months’ post — contact me via the usual channels.

Botched code causes seven-year scientific argument

For the last couple of years, I have given a talk around the UK and Europe called ‘Is your Research Software Correct‘ (unlike this other talk of mine, it has not yet been recorded but I’ll soon remedy that! Let me know if you can offer a venue with good recording facilities).

I start off by asking the audience to Imagine….imagine that the results of your latest simulation or data analysis are in and they are amazing.  Your heart beats faster, this result changes everything and you know it. This is why you entered science, this is what you always hoped for. Papers in journals like Nature or Science — no problem. Huge grant to follow up this work…professorship….maybe, you dare to dream, this could lead to a nobel prize.  Only one minor problem — the code is completely wrong and you just haven’t figured it out yet.

In the talk (based originally on an old blog post here) I go on to suggest and discuss some simple practices that might help the situation.  Scripting and coding instead of pointy-clicky solutions, version control, testing, open source your software, software citation etc.  None of it is mind blowing but I firmly believe that if all of the advice was taken, we’d have fewer situations like this one…..

Long story short, Two groups were investigating what happens when you super-freeze water.  They disagreed and much shouting happened for 7 years.  There was a bug in the code of one group.  A great article discussing the saga is available over at Physics Today:

https://physicstoday.scitation.org/do/10.1063/PT.6.1.20180822a/full/

Standout quotes that may well end up in a future version of Is Your Research Software Correct include

“One of the real travesties,” he says, is that “there’s no way you could have reproduced [the Berkeley team’s] algorithm—the way they had implemented their code—from reading their paper.” Presumably, he adds, “if this had been disclosed, this saga might not have gone on for seven years”

and

Limmer maintains that he and his mentor weren’t trying to hide anything. “I had and was very willing to share the code,” he says. What he didn’t have, he says, was the time or personnel to prepare the code in a form that could be useful to an outsider. “When Debenedetti’s group was making their code available,” Limmer explains, “he brought people in to clean up the code and document and run tests on it. He had resources available to him to be able to do that.” At Berkeley, “it was just me trying to get that done myself.”

Which is a case study for asking for Research Software Engineer support in a grant if ever I saw one.

Julia gets all grown up — version 1.0 released at JuliaCon 2018

One of the highlights of the JuliaCon 2018 conference was the release of Julia version 1.0 — a milestone that signifies that the new-language-on-the block has reached a certain level of maturity.   We celebrated the release at University of Leeds by installing it on our most recent HPC system – ARC3.

In case you don’t know, Julia is a relatively new free and open source language for technical computing.  It works on everything from Raspberry Pi up to HPC systems with thousands of Cores.  It’s the reason for the letters Ju in Project Jupyter and aims to be an easy to use language (along the lines of Python, R or MATLAB) with the performance of languages like Fortran or C.

UK Research Software Engineering Association Webinar series

The UK Research Software Engineering Association is starting a new webinar series this month with a series of planned topics including an Introduction to Object-Oriented Design for Scientists, Interfacing to/from Python with C, FORTRAN or C++ and Meltdown for Dummies.

These webinars are free to join, and you do not need to register in advance. Full details including the link to join the webinar are available below.

For more information on the RSE webinar series, including information on how to propose a webinar and information on upcoming webinars, please see:

https://rse.ac.uk/events/rse-webinar-series

Verification and Modernisation of Fortran Codes using the NAG Fortran Compiler

There is still a huge amount of research software written in Fortran. Indeed, software written in Fortran are, by far, the most popular codes run on the UK’s national supercomputer service, Archer (See http://www.archer.ac.uk/status/codes/ for up to date stats).

Fortran compilers are not created equally and many professional Fortran developers will suggest that you develop against more than one.  Gfortran is probably essential if you want your code to be usable by all but the Intel Fortran Compiler can often produce the fastest executables on x86 hardware and the NAG Compiler is useful for ensuring correctness.

This webinar by NAG’s Wadud Miah promises to show what the NAG Fortran Compiler can do for your Fortran code.

New Macbook Pro has 6 CPU cores but….

Apple’s new Macbook Pro laptops have a fantastic looking CPUs in them with the top of the line boasting 6 cores and turbo boost up to 4.8Ghz.  Sounds amazing for simulation work but it seems that there are some thermal issues that prevent it from running at top speed for long.

Contact me to get your news included next month

That’s all I have for this first article in the series.  If you have any research computing news that you’d like included in the next edition, contact me.

Video of my talk: Rise of the Research Software Engineer

August 24th, 2018 | Categories: Julia, RSE, Scientific Software, walking randomly | Tags:

Audiences can be brutal

I still have nightmares about the first talk I ever gave as a PhD student. I was not a strong presenter, my grasp of the subject matter was still very tenuous and I was as nervous as hell. Six months or so into my studentship, I was to give a survey of the field I was studying to a bunch of very experienced researchers.  I spent weeks preparing…practicing…honing my slides…hoping that it would all be good enough.

The audience was not kind to me! Even though it was only a small group of around 12 people, they were brutal! I felt like they leaped upon every mistake I made, relished in pointing out every misunderstanding I had and all-round gave me a very hard time.  I had nothing like the robustness I have now and very nearly quit my PhD the very next day. I can only thank my office mates and enough beer to kill a pony for collectively talking me out of quitting.

I remember stopping three quarters of the way through saying ‘That’s all I want to say on the subject’ only for one of the senior members of the audience to point out that ‘You have not talked about all the topics you promised’.  He made me go back to the slide that said something like ‘Things I will talk about’ or ‘Agenda’ or whatever else I called the stupid thing and say ‘Look….you’ve not mentioned points X,Y and Z’ [1].

Everyone agreed and so my torture continued for another 15 minutes or so.

Practice makes you tougher

Since that horrible day, I have given hundreds of talks to audiences that range in size from 5 up to 300+ and this amount of practice has very much changed how I view these events.  I always enjoy them…always!  Even when they go badly!

In the worst case scenario, the most that can happen is that I get given a mildly bad time for an hour or so of my life but I know I’ll get over it. I’ve gotten over it before. No big deal! Academic presentations on topics such as research computing rarely lead to life threatening outcomes.

But what if it was recorded?!

Anyone who has worked with me for an appreciable amount of time will know of my pathological fear of having one of my talks recorded. Point a camera at me and the confident, experienced speaker vanishes and is replaced by someone much closer to the terrified PhD student of my youth.

I struggle to put into words what I’m so afraid of but I wonder if it ultimately comes down to the fact that if that PhD talk had been recorded and put online, I would never have been able to get away from it. My humiliation would be there for all to see…forever.

JuliaCon 2018 and Rise of the Research Software Engineer

When the organizers of JuliaCon 2018 invited me to be a keynote speaker on the topic of Research Software Engineering, my answer was an enthusiastic ‘Yes’. As soon as I learned that they would be live streaming and recording all talks, however, my enthusiasm was greatly dampened.

‘Would you mind if my talk wasn’t live streamed and recorded’ I asked them.  ‘Sure, no problem’ was the answer….

Problem averted. No need to face my fears this week!

A fellow delegate of the conference pointed out to me that my talk would be the only one that wouldn’t be on the live stream. That would look weird and not in a good way.

‘Can I just be live streamed but not recorded’ I asked the organisers.  ‘Sure, no problem’ [2] was the reply….

Later on the technician told me that I could have it recorded but it would be instantly hidden from the world until I had watched it and agreed it wasn’t too terrible.  Maybe this would be a nice first step in my record-a-talk-a-phobia therapy he suggested.

So…on I went and it turned out not to be as terrible as I had imagined it might be.  So we published it. I learned that I say ‘err’ and ‘um’ a lot [3] which I find a little embarrassing but perhaps now that I know I have that problem, it’s something I can work on.

Rise of the Research Software Engineer

Anyway, here’s the video of the talk. It’s about some of the history of The Research Software Engineering movement and how I worked with some awesome people at The University of Sheffield to create a RSE group. If you are the computer-person in your research group who likes software more than papers, you may be one of us. Come join the tribe!

Slide deck at mikecroucher.github.io/juliacon2018/

Thanks to the infinitely patient and wonderful organisers of JuliaCon 2018 for the opportunity to beat one of my long standing fears.

Footnotes

[1] Pro-Tip: Never do one of these ‘Agenda’ slides…give yourself leeway to alter the course of your presentation midway through depending on how well it is going.

[2] So patient! Such a lovely team!

[3] Like A LOT! My mum watched the video and said ‘No idea what you were talking about but OMG can you cut out the ummms and ahhs’

Who are Research Software Engineers?

August 9th, 2018 | Categories: RSE | Tags:

Technological development in software is more like a cliff-face than a ladder – there are many routes to the top, to a solution. Further, the cliff face is dynamic – constantly and quickly changing as new technologies emerge and decline. Determining which technologies to deploy and how best to deploy them is in itself a specialist domain, with many features of traditional research.

Researchers need empowerment and training to give them confidence with the available equipment and the challenges they face. This role, akin to that of an Alpine guide, involves support, guidance, and load carrying. When optimally performed it results in a researcher who knows what challenges they can attack alone, and where they need appropriate support. Guides can help decide whether to exploit well-trodden paths or explore new possibilities as they navigate through this dynamic environment.

These guides are highly trained, technology-centric, research-aware individuals who have a curiosity driven nature dedicated to supporting researchers by forging a research software support career. Such Research Software Engineers (RSEs) guide researchers through the technological landscape and form a human interface between scientist and computer. A well-functioning RSE group will not just add to an organisation’s effectiveness, it will have a multiplicative effect since it will make every individual researcher more effective. It has the potential to improve the quality of research done across all University departments and faculties.

String sorting in R appears to use different ordering from everyone else

April 12th, 2018 | Categories: C/C++, matlab, programming, python, R | Tags:

Update
A discussion on twitter determined that this was an issue with Locales. The practical upshot is that we can make R act the same way as the others by doing

``Sys.setlocale("LC_COLLATE", "C")``
``` ```

which may or may not be what you should do!

Original post

While working on a project that involves using multiple languages, I noticed some tests failing in one language and not the other. Further investigation revealed that this was essentially because R's default sort order for strings is different from everyone else's.

I have no idea how to say to R 'Use the sort order that everyone else is using'. Suggestions welcomed.

R 3.3.2

``````sort(c("#b","-b","-a","#a","a","b"))

[1] "-a" "-b" "#a" "#b" "a" "b"
``````

Python 3.6

``````sorted({"#b","-b","-a","#a","a","b"})

['#a', '#b', '-a', '-b', 'a', 'b']
``````
``` ```

MATLAB 2018a

``````sort([{'#b'},{'-b'},{'-a'},{'#a'},{'a'},{'b'}])

ans =
1×6 cell array
{'#a'} {'#b'} {'-a'} {'-b'} {'a'} {'b'}
``````
``` ```

C++

``````int main(){

std::string mystrs[] = {"#b","-b","-a","#a","a","b"};
std::vector<std::string> stringarray(mystrs,mystrs+6);
std::vector<std::string>::iterator it;

std::sort(stringarray.begin(),stringarray.end());

for(it=stringarray.begin(); it!=stringarray.end();++it) {
std::cout << *it << " ";
}

return 0;
}
``````

Result:

``````#a #b -a -b a b
``````

Research Software Engineer: A New Career Track?

March 2nd, 2018 | Categories: RSE | Tags:

Along with fellow Fellow Chris Richardson, we wrote an article over at Siam News about the emerging Research Software Engineering profession.  Head over to Research Software Engineer: A New Career Track? to check it out.

If this has whetted your appetite for learning more about Research Software Engineering then feel free to read the RSE 2017 State of the Nation report from last year.  Finally, I urge you to join the UK RSE association if you have any interest in this area.

Strange MATLAB performance issue on Microsoft Azure F72s_v2 instances

March 1st, 2018 | Categories: Cloud Computing, HPC, Making MATLAB faster, matlab | Tags:

I’m working on some MATLAB code at the moment that I’ve managed to reduce down to a bunch of implicitly parallel functions. This is nice because the data that we’ll eventually throw at it will be represented as a lot of huge matrices.  As such, I’m expecting that if we throw a lot of cores at it, we’ll get a lot of speed-up.  Preliminary testing on local HPC nodes shows that I’m probably right.

During testing and profiling on a smaller data set I thought that it would be fun to run the code on the most powerful single node I can lay my hands on.  In my case that’s an Azure F72s_v2 which I currently get for free thanks to a Microsoft Azure for Research grant I won.

These Azure F72s_v2 machines are NICE! Running Intel Xeon Platinum 8168 CPUs with 72 virtual cores and 144GB of RAM, they put my Macbook Pro to shame! Theoretically, they should be more powerful than any of the nodes I can access on my University HPC system.

So, you can imagine my surprise when the production code ran almost 3 times slower than on my Macbook Pro!

Here’s a Microbenchmark, extracted from the production code, running on MATLAB 2017b on a few machines to show the kind of slowdown I experienced on these super powerful virtual machines.

``` test_t = rand(8755,1); test_c = rand(5799,1); tic;test_res = bsxfun(@times,test_t,test_c');toc tic;test_res = bsxfun(@times,test_t,test_c');toc ```

I ran the bsxfun twice and report the fastest since the first call to any function in MATLAB is often slower than subsequent ones for various reasons. This quick and dirty benchmark isn’t exactly rigorous but its good enough to show the issue.

• Azure F72s_v2 (72 vcpus, 144 GB memory) running Windows Server 2016: 0.3 seconds
• Azure F32s_v2 (32 vcpus, 64 GB memory) running Windows Server 2016: 0.29 seconds
• 2014 Macbook Pro running OS X: 0.11 seconds
• Dell XPS 15 9560 laptop running Windows 10: 0.11 seconds
• 8 cores on a node of Sheffield University’s Linux HPC cluster: 0.03 seconds
• 16 cores on a node of Sheffield University’s Linux HPC cluster: 0.015 seconds

After a conversation on twitter, I ran it on Azure twice — once on a 72 vCPU instance and once on a 32 vCPU instance. This was to test if the issue was related to having 2 physical CPUs. The results were pretty much identical.

The results from the University HPC cluster are more in line with what I expected to see — faster than a laptop and good scaling with respect to number of cores.  I tried running it on 32 cores but the benchmark is still in the queue ;)

What’s going on?

I have no idea! I’m stumped to be honest.  Here are some thoughts that occur to me in no particular order

• Maybe it’s an issue with Windows Server 2016. Is there some environment variable I should have set or security option I could have changed? Maybe the Windows version of MATLAB doesn’t get on well with large core counts? I can only test up to 4 on my own hardware and that’s using Windows 10 rather than Windows server.  I need to repeat the experiment using a Linux guest OS.
• Is it an issue related to the fact that there isn’t a 1:1 mapping between physical hardware and virtual cores? Intel Xeon Platinum 8168 CPUs have 24 cores giving 48 hyperthreads so two of them would give me 48 cores and 96 hyperthreads.  They appear to the virtualised OS as 2 x 18 core CPUs with 72 hyperthreads in total.   Does this matter in any way?

Creating a temporary, customised, multi-user HPC cluster for teaching using Amazon AWS and Alces Flight

February 21st, 2018 | Categories: Cloud Computing, HPC | Tags:

In a previous blog post, I told the story of how I used Amazon AWS and AlcesFlight to create a temporary multi-user HPC cluster for use in a training course.  Here are the details of how I actually did it.

Note that I have only ever used this configuration as a training cluster.  I am not suggesting that the customisations are suitable for real work.

Before you start

Before attempting to use AlcesFlight on AWS, I suggest that you ensure that you have the following things working

Customizing the HPC cluster on AWS

AlcesFlight provides a CloudFormation template for launching cluster instances on Amazon AWS.  The practical upshot of this is that you answer a bunch of questions on a web form to customise your cluster and then you launch it.

We are going to use this CloudFormation template along with some bash scripts that provide additional customisation.

Get the customisation scripts

The first step is to get some customization scripts in an S3 bucket. You could use your own or you could use the ones I created.

If you use mine, make sure you take a good look at them first to make sure you are happy with what I’ve done!  It’s probably worth using your own fork of my repo so you can customise your cluster further.

It’s the bash scripts that allow the creation of a bunch of user accounts for trainees with randomized passwords.  My scripts do some other things too and I’ve listed everything in the github README.md.
``` git clone https://github.com/mikecroucher/alces_flight_customisation cd alces_flight_customisation ```

Now you need to upload these to an s3 bucket. I called mine walkingrandomly-aws-cluster
``` aws s3api create-bucket --bucket walkingrandomly-aws-cluster --region eu-west-2 --create-bucket-configuration LocationConstraint=eu-west-2 aws s3 sync . s3://walkingrandomly-aws-cluster --delete ```

Set up the CloudFormation template

• Head over to Alces Flight Solo (Community Edition) and click on continue to subscribe
• Choose the region you want to create the cluster in, select Personal HPC compute cluster and click on Launch with CloudFormationConsole

• Go through the CloudFormation template screens, creating the cluster as you want it until you get to the S3 bubcket for customization profiles box where you fill in the name of the S3 bucket you created earlier.
• Enable the default profile

• Continue answering the questions asked by the web form.  For this simple training cluster, I just accepted all of the defaults and it worked fine

When the CloudFormation stack has been fully created, you can log into your new cluster as an administrator.  To get the connection details of the headnode, go to the EC2 management console in your web-browser, select the headnode and click on Connect.

This is probably not great sysadmin practice but worked on the day.  If anyone can come up with a better way, Pull Requests are welcomed!

Try a training account

At this point, I suggest that you try logging in as one of the training user accounts and make sure you can successfully submit a job.  When I first tried all of this, the default scheduler on the created cluster was SunGridEngine and my first attempt at customisation left me with user accounts that couldn’t submit jobs.

The current scripts have been battle tested with Sun Grid Engine, including MPI job submission and I’ve also done a very basic test with Slurm. However, you really should check that a user account can submit all of the types of job you expect to use in class.

Troubleshooting

When I first tried to do this, things didn’t go completely smoothly.  Here are some things I learned to help diagnose the problems

Full documentation is available at http://docs.alces-flight.com/en/stable/customisation/customisation.html

On the cluster, we can see where its looking for its customisation scripts with the alces about command

``` alces about customizer Customizer bucket prefix: s3://walkingrandomly-aws-cluster/customizer ```

The log file at /var/log/clusterware/instance.log on both the head node and worker nodes is very useful.

Once, I did all of this using a Windows CMD bash prompt and the customisation scripts failed to run.  The logs showed this error
``` /bin/bash^M: bad interpreter: No such file or directory```

This is a classic dos2unix error and could be avoided, for example, by using the Windows Subsystem for linux instead of CMD.exe.

Bespoke High Performance Computing Clusters in the Cloud with Alces Flight

February 21st, 2018 | Categories: Cloud Computing, HPC | Tags:

I needed a supercomputer…..quickly!

One of the things that we do in Sheffield’s Research Software Engineering Group is host training courses delivered by external providers.  One such course is on parallel programming using MPI for which we turn to the experts at NAG (Numerical Algorithms Group).  A few days before turning up to deliver the course, the trainer got in touch with me to ask for details about our HPC cluster.

Because Croucher’s law, I had forgotten to let our HPC sysadmin know that I’d need a bunch of training accounts and around 128 cores set-aside for us to play around with for a couple of days.

In other words, I was hosting a supercomputing course and had forgotten the supercomputer.

Building a HPC cluster in the cloud

AlcesFlight is a relatively new product that allows you to spin up a traditional-looking High Performance Computing cluster on cloud computing substrates such as Microsoft Azure or Amazon AWS.  You get a head node, a bunch of worker nodes and a job scheduler such as Slurm or Sun Grid Engine. It looks just the systems that The University of Sheffield provides for its researchers!

You also get lots of nice features such as the ability to scale the number of worker nodes according to demand, a metric ton of available applications and the ability to customise the cluster at start up.

The supercomputing budget was less than the coffee budget

…and I only bought coffee for myself and the two trainers over the two days!  The attendees had to buy their own (In my defence…the course was free for attendees!).

I used the following

• A head node of:  t2.large (2 vCPUs, 8Gb RAM)
• Initial worker nodes: 4 of c4.4xlarge (16 vCPUs and 30GB RAM each)
• Maximum worker nodes: 8 of c4.4xlarge (16 vCPUs and 30GB RAM each)

This gave me a cluster with between 64 and 128 virtual cores depending on the amount that the class were using it.  Much of the time, only 4 nodes were up and running – the others spun up automatically when the class needed them and vanished when they hadn’t been used for a while.

I was using the EU (Ireland) region and the prices at the time were

• Head node: On demand pricing of \$0.101 per hour
• Worker nodes: \$0.24 (ish) using spot pricing. Each one about twice as powerful as a 2014 Macbook Pro according to this benchmark.

HPC cost: As such, the maximum cost of this cluster was \$2.73 per hour when all nodes were up and running. The class ran from 10am to 5pm for two days so we needed it for 14 hours.  Maximum cost would have been \$38.22.

Coffee cost: 2 instructors and me needed coffee twice a day. So that’s 12 coffees in total.  Around £2.50 or \$3.37 per coffee so \$40.44

The HPC cost was probably less than that since we didn’t use 128 cores all the time and the coffee probably cost a little more.

Setting up the cluster

Technical details of how I configured the cluster can be found in the follow up post at http://www.walkingrandomly.com/?p=6431