January 23rd, 2019 | Categories: C/C++, Linux, programming | Tags:

I have been an advocate of the Windows Subsytem for Linux ever since it was released (See Bash on Windows: The scripting game just changed) since it allows me to use the best of Linux from my windows laptop.  I no longer dual boot on my personal machines and rarely need to use Linux VMs on them either thanks to this technology.  I still use full-blown Linux a lot of course but these days it tends to be only on servers and HPC systems.

I recently needed to compile and play with some code that was based on the GNU Scientific Library. Using the Ubuntu 18.04 version of the WSL this is very easy. Install the GSL with

sudo apt-get install libgsl-dev

A simple code that evaluates Dawson’s integral over a range of x values is shown below. Call this dawson.cpp

#include<iostream>
#include<vector>
#include<gsl/gsl_sf.h>

int main(){

double range = 6; // max/min values
int N = 100000; // Number of evaluations
double step = 2 * range / N;
std::vector<double> x(N);
std::vector<double> result(N);

for (int i=0;i<=N;i++){
     x[i] = -range + i*step;
     result[i] = gsl_sf_dawson(x[i]);
}


for (int i=0;i<=N;i++){
	std::cout << x[i] << "," << result[i] << std::endl;
}

return 0;
}

Compile with

g++ -std=c++11 dawson.cpp -o ./dawson -lgsl -lgslcblas -lm

Run it and generate some results

./dawson > results.txt

If we plot results.txt we get

dawson

This code is also available on GitHub: https://github.com/mikecroucher/GSL_example

November 20th, 2018 | Categories: Making MATLAB faster, matlab, random numbers, Scientific Software | Tags:

In a recent blog post, Daniel Lemire explicitly demonstrated that vectorising random number generators using SIMD instructions could give useful speed-ups.  This reminded me of the one of the first times I played with the Julia language where I learned that Julia’s random number generator used a SIMD-accelerated implementation of Mersenne Twister called dSFMT to generate random numbers much faster than MATLAB’s Mersenne Twister implementation.

Just recently, I learned that MATLAB now has its own SIMD accelerated version of Mersenne Twister which can be activated like so:

seed=1;
rng(seed,'simdTwister')

This new Mersenne Twister implementation gives different random variates to the original implementation (which I demonstrated is the same as Numpy’s implementation in an older post) as you might expect

>> rng(1,'Twister')
>> rand(1,5)
ans =
0.4170 0.7203 0.0001 0.3023 0.1468
>> rng(1,'simdTwister')
>> rand(1,5)
ans =
0.1194 0.9124 0.5032 0.8713 0.5324

So it’s clearly a different algorithm and, on CPUs that support the relevant instructions, it’s about twice as fast!  Using my very unscientific test code:

format compact
number_of_randoms = 10000000
disp('Timing standard twister')
rng(1,'Twister')
tic;rand(1,number_of_randoms);toc
disp('Timing SIMD twister')
rng(1,'simdTwister')
tic;rand(1,number_of_randoms);toc

gives the following results for a typical run on my Dell XPS 15 9560 which supports AVX instructions

number_of_randoms =
    10000000
Timing standard twister
Elapsed time is 0.101307 seconds.
Timing SIMD twister
Elapsed time is 0.057441 seconds

The MATLAB documentation does not tell us which algorithm their implementation is based on but it seems to be different from Julia’s. In Julia, if we set the seed to 1 as we did for MATLAB and ask for 5 random numbers, we get something different from MATLAB:

julia> using Random
julia> Random.seed!(1);
julia> rand(1,5)
1×5 Array{Float64,2}:
 0.236033  0.346517  0.312707  0.00790928  0.48861

The performance of MATLAB’s new generator is on-par with Julia’s although I’ll repeat that these timing tests are far far from rigorous.

julia> Random.seed!(1);

julia> @time rand(1,10000000);
  0.052981 seconds (6 allocations: 76.294 MiB, 29.40% gc time)
October 1st, 2018 | Categories: programming, RSE | Tags:

A guest blog-post by Catherine Smith of University of Birmingham

In early 2017 I was in the audience at one of Mike Croucher’s ‘Is your research software correct?’ presentations. One of the first questions posed in the talk is ‘how reproducible is a mouse click?’. The answer, of course, is that it isn’t and therefore research processes should be automated and not involve anyone pressing any buttons. This posed something of a challenge to my own work which is primarily about making buttons for researchers to press (and select, drag and drop etc.) in order to present their data in the appropriate scholarly way. This software, for making electronic editions of texts preserved in multiple sources, assists with the alignment and editing of material. Even so, the editor is always in control and that is the way it should be. The lack of automation means reproducibility is a problem for my software but as Peter Shillingsburg, one of the pioneers of digital editing, says ‘editing is an art not a science’: maybe art can therefore be excused, to an extent, from the constraints of automation and, despite their introduction of human decisions, the buttons may be permitted to stay. Nevertheless I still want to know that my software doing what I think it is doing even if I can’t automate what editors choose to do with it. In the discussion that followed the paper I was talking about the complication of testing my interface-heavy software. Mike agreed that it was a complex situation but concluded by saying “if you go away from here and write one test you will have made the world a better place”.

I did just that. In fact I did very little else for the next three months. What started with one Python unit test has so far led to 65 Python unit tests, 82 Javascript unit tests and 54 functional tests using Selenium. The timing of all of this was perfect in that I had just begun a project to migrate all of our web applications to Django. I had one application partially migrated and so I tested that one and even did some test-driven development on the sections that were not yet complete.

The tests themselves are great to have. This was my first project using Django and I made lots of mistakes in the first application. The tests have been invaluable in ensuring that, as I learned more and made improvements, the older code kept pace with those changes. Now that I have tests for some things I want tests for everything and I have developed a healthy fear of editing code that is not yet tested. There are other advantages as well. When I sat down to write my first test it very quickly became clear that the code I had written was not easily testable. I had to break down the large Django views into smaller chunks of code that could each be unit tested. I now write better structured code because of that time I invested in testing just some of it. I also learned a lot about how to approach migrating all of the remaining applications while writing the detailed tests for every aspect of the first one.T

Django has an integrated test framework based on the python unittest module but with the additional benefit of automatically creating a test database using the models from the project to which test data can be added. It was very straightforward to set up and run (see the Django docs https://docs.djangoproject.com/en/2.1/topics/testing/). I found Javascript unit testing less straight forward. There was not much Javascript in this first application so I used the qunit test framework and sinon.js for mocking. I have never automated the running of these tests and instead just open them in the browser. It’s not ideal but it works for now. I have other applications which are far more Javascript heavy and for those I will need to have automated tests, there are plenty of frameworks around to choose from so I will investigate those when I start writing the tests.

Probably the most important tests I have are the functional tests which are written in Selenium. I had already heard of Selenium, having attended a Test Driven Development workshop several years ago by Harry Percival. I used his book, Test-Driven Development with Python, as a tutorial for all of the Selenium tests and some of the Django and Javascript tests too. Selenium tests are automated browser tests which really do allow you to test what happens when a user presses a button, types text into a text box, selects an item from a list, moves an element by dragging it etc.. The result of every interaction in an interface can be tested with Selenium. The content of each page can also be checked. It is generally not necessary to test static html but I did test the contents of several dynamic pages which loaded different content depending on the permissions granted to a user. Selenium is also integrated within Django using the LiveServerTestCase which means it has access to a copy of the database just like the Django unit tests. Selenium tests can be complex and there are several things to watch out for. Selenium doesn’t automatically wait for a page to load before executing the test statements against it, at every point data is loaded Selenium must be told to wait until a given condition is fulfilled up to a maximum time limit before continuing. I still have tests which occasionally fail because, on that particular run, a page or an ajax call is taking longer to load than I have allowed for. Run it another five times and it may well pass on every one. It is also important to make sure the browser is told to scroll to a point where an element can be seen before the instruction to interact with that element is given. It’s not difficult to do and is more predictable that waiting for a page to load but it still has to be remembered every time.

The functional tests are by far the most complex of all the tests I wrote in my three month testing marathon but they are the most important. I can’t automate the entire creation of a digital edition but with tests I can make sure my interface is presenting the correct data in the right way to the editors and that when they interact with that data everything behaves as it should. I really can say that the buttons and other interactive elements I have tested do exactly what I think they do. Now I just need to test all the rest of the buttons – one test at a time!

August 29th, 2018 | Categories: Research Computing Times | Tags:

Over the years I’ve been blogging, I have run a few recurring series of blogposts.  In the early days, there was the shortlived Problem of The Week.  Sometime later I inherited The Carnival of Maths which I looked after for a couple of years before passing it over to Aperiodical.com who have looked after it ever since.  I also ran a series called A Month of Math Software for 2 and a half years before my enthusiasm for the topic ran out.

I am currently the Head of Research Computing at The University of Leeds — a senior management role that puts me in the fortunate position of being reasonably well-informed about the world of research computing.  Software, hardware, cloud, data science, dealing with sensitive data…everyone, it seems, has something to tell me.  I’m also continuing with my EPSRC fellowship part-time which means that I’m rather more hands on than a typical member of an executive leadership team.

While at JuliaCon 2018, I had the extremely flattering experience of a few people telling me that they had been long time readers of WalkingRandomly and that they were disappointed that I didn’t post as often

All of this has led to the desire to start a new regular series.  One where I look at all aspects of research computing and compile it into a series of monthly posts.  If you have anything you’d like including in next months’ post — contact me via the usual channels.

Botched code causes seven-year scientific argument

For the last couple of years, I have given a talk around the UK and Europe called ‘Is your Research Software Correct‘ (unlike this other talk of mine, it has not yet been recorded but I’ll soon remedy that! Let me know if you can offer a venue with good recording facilities).

I start off by asking the audience to Imagine….imagine that the results of your latest simulation or data analysis are in and they are amazing.  Your heart beats faster, this result changes everything and you know it. This is why you entered science, this is what you always hoped for. Papers in journals like Nature or Science — no problem. Huge grant to follow up this work…professorship….maybe, you dare to dream, this could lead to a nobel prize.  Only one minor problem — the code is completely wrong and you just haven’t figured it out yet.

In the talk (based originally on an old blog post here) I go on to suggest and discuss some simple practices that might help the situation.  Scripting and coding instead of pointy-clicky solutions, version control, testing, open source your software, software citation etc.  None of it is mind blowing but I firmly believe that if all of the advice was taken, we’d have fewer situations like this one…..

Long story short, Two groups were investigating what happens when you super-freeze water.  They disagreed and much shouting happened for 7 years.  There was a bug in the code of one group.  A great article discussing the saga is available over at Physics Today:

https://physicstoday.scitation.org/do/10.1063/PT.6.1.20180822a/full/

Standout quotes that may well end up in a future version of Is Your Research Software Correct include

“One of the real travesties,” he says, is that “there’s no way you could have reproduced [the Berkeley team’s] algorithm—the way they had implemented their code—from reading their paper.” Presumably, he adds, “if this had been disclosed, this saga might not have gone on for seven years”

and

Limmer maintains that he and his mentor weren’t trying to hide anything. “I had and was very willing to share the code,” he says. What he didn’t have, he says, was the time or personnel to prepare the code in a form that could be useful to an outsider. “When Debenedetti’s group was making their code available,” Limmer explains, “he brought people in to clean up the code and document and run tests on it. He had resources available to him to be able to do that.” At Berkeley, “it was just me trying to get that done myself.”

Which is a case study for asking for Research Software Engineer support in a grant if ever I saw one.

Julia gets all grown up — version 1.0 released at JuliaCon 2018

One of the highlights of the JuliaCon 2018 conference was the release of Julia version 1.0 — a milestone that signifies that the new-language-on-the block has reached a certain level of maturity.   We celebrated the release at University of Leeds by installing it on our most recent HPC system – ARC3.

In case you don’t know, Julia is a relatively new free and open source language for technical computing.  It works on everything from Raspberry Pi up to HPC systems with thousands of Cores.  It’s the reason for the letters Ju in Project Jupyter and aims to be an easy to use language (along the lines of Python, R or MATLAB) with the performance of languages like Fortran or C.

UK Research Software Engineering Association Webinar series

The UK Research Software Engineering Association is starting a new webinar series this month with a series of planned topics including an Introduction to Object-Oriented Design for Scientists, Interfacing to/from Python with C, FORTRAN or C++ and Meltdown for Dummies.

These webinars are free to join, and you do not need to register in advance. Full details including the link to join the webinar are available below.

For more information on the RSE webinar series, including information on how to propose a webinar and information on upcoming webinars, please see:

https://rse.ac.uk/events/rse-webinar-series

This page will also have links to recordings of past webinars when they become available.

Verification and Modernisation of Fortran Codes using the NAG Fortran Compiler

There is still a huge amount of research software written in Fortran. Indeed, software written in Fortran are, by far, the most popular codes run on the UK’s national supercomputer service, Archer (See http://www.archer.ac.uk/status/codes/ for up to date stats).

Fortran compilers are not created equally and many professional Fortran developers will suggest that you develop against more than one.  Gfortran is probably essential if you want your code to be usable by all but the Intel Fortran Compiler can often produce the fastest executables on x86 hardware and the NAG Compiler is useful for ensuring correctness.

This webinar by NAG’s Wadud Miah promises to show what the NAG Fortran Compiler can do for your Fortran code.

New Macbook Pro has 6 CPU cores but….

Apple’s new Macbook Pro laptops have a fantastic looking CPUs in them with the top of the line boasting 6 cores and turbo boost up to 4.8Ghz.  Sounds amazing for simulation work but it seems that there are some thermal issues that prevent it from running at top speed for long.

Contact me to get your news included next month

That’s all I have for this first article in the series.  If you have any research computing news that you’d like included in the next edition, contact me.

August 24th, 2018 | Categories: Julia, RSE, Scientific Software, walking randomly | Tags:

Audiences can be brutal

I still have nightmares about the first talk I ever gave as a PhD student. I was not a strong presenter, my grasp of the subject matter was still very tenuous and I was as nervous as hell. Six months or so into my studentship, I was to give a survey of the field I was studying to a bunch of very experienced researchers.  I spent weeks preparing…practicing…honing my slides…hoping that it would all be good enough.

The audience was not kind to me! Even though it was only a small group of around 12 people, they were brutal! I felt like they leaped upon every mistake I made, relished in pointing out every misunderstanding I had and all-round gave me a very hard time.  I had nothing like the robustness I have now and very nearly quit my PhD the very next day. I can only thank my office mates and enough beer to kill a pony for collectively talking me out of quitting.

I remember stopping three quarters of the way through saying ‘That’s all I want to say on the subject’ only for one of the senior members of the audience to point out that ‘You have not talked about all the topics you promised’.  He made me go back to the slide that said something like ‘Things I will talk about’ or ‘Agenda’ or whatever else I called the stupid thing and say ‘Look….you’ve not mentioned points X,Y and Z’ [1].

Everyone agreed and so my torture continued for another 15 minutes or so.

Practice makes you tougher

Since that horrible day, I have given hundreds of talks to audiences that range in size from 5 up to 300+ and this amount of practice has very much changed how I view these events.  I always enjoy them…always!  Even when they go badly!

In the worst case scenario, the most that can happen is that I get given a mildly bad time for an hour or so of my life but I know I’ll get over it. I’ve gotten over it before. No big deal! Academic presentations on topics such as research computing rarely lead to life threatening outcomes.

But what if it was recorded?!

Anyone who has worked with me for an appreciable amount of time will know of my pathological fear of having one of my talks recorded. Point a camera at me and the confident, experienced speaker vanishes and is replaced by someone much closer to the terrified PhD student of my youth.

I struggle to put into words what I’m so afraid of but I wonder if it ultimately comes down to the fact that if that PhD talk had been recorded and put online, I would never have been able to get away from it. My humiliation would be there for all to see…forever.

JuliaCon 2018 and Rise of the Research Software Engineer

When the organizers of JuliaCon 2018 invited me to be a keynote speaker on the topic of Research Software Engineering, my answer was an enthusiastic ‘Yes’. As soon as I learned that they would be live streaming and recording all talks, however, my enthusiasm was greatly dampened.

‘Would you mind if my talk wasn’t live streamed and recorded’ I asked them.  ‘Sure, no problem’ was the answer….

Problem averted. No need to face my fears this week!

A fellow delegate of the conference pointed out to me that my talk would be the only one that wouldn’t be on the live stream. That would look weird and not in a good way.

‘Can I just be live streamed but not recorded’ I asked the organisers.  ‘Sure, no problem’ [2] was the reply….

Later on the technician told me that I could have it recorded but it would be instantly hidden from the world until I had watched it and agreed it wasn’t too terrible.  Maybe this would be a nice first step in my record-a-talk-a-phobia therapy he suggested.

So…on I went and it turned out not to be as terrible as I had imagined it might be.  So we published it. I learned that I say ‘err’ and ‘um’ a lot [3] which I find a little embarrassing but perhaps now that I know I have that problem, it’s something I can work on.

Rise of the Research Software Engineer

Anyway, here’s the video of the talk. It’s about some of the history of The Research Software Engineering movement and how I worked with some awesome people at The University of Sheffield to create a RSE group. If you are the computer-person in your research group who likes software more than papers, you may be one of us. Come join the tribe!

Slide deck at mikecroucher.github.io/juliacon2018/

Feel free to talk to me on twitter about it: @walkingrandomly

Thanks to the infinitely patient and wonderful organisers of JuliaCon 2018 for the opportunity to beat one of my long standing fears.

Footnotes

[1] Pro-Tip: Never do one of these ‘Agenda’ slides…give yourself leeway to alter the course of your presentation midway through depending on how well it is going.

[2] So patient! Such a lovely team!

[3] Like A LOT! My mum watched the video and said ‘No idea what you were talking about but OMG can you cut out the ummms and ahhs’

August 9th, 2018 | Categories: RSE | Tags:

Technological development in software is more like a cliff-face than a ladder – there are many routes to the top, to a solution. Further, the cliff face is dynamic – constantly and quickly changing as new technologies emerge and decline. Determining which technologies to deploy and how best to deploy them is in itself a specialist domain, with many features of traditional research.

Researchers need empowerment and training to give them confidence with the available equipment and the challenges they face. This role, akin to that of an Alpine guide, involves support, guidance, and load carrying. When optimally performed it results in a researcher who knows what challenges they can attack alone, and where they need appropriate support. Guides can help decide whether to exploit well-trodden paths or explore new possibilities as they navigate through this dynamic environment.

These guides are highly trained, technology-centric, research-aware individuals who have a curiosity driven nature dedicated to supporting researchers by forging a research software support career. Such Research Software Engineers (RSEs) guide researchers through the technological landscape and form a human interface between scientist and computer. A well-functioning RSE group will not just add to an organisation’s effectiveness, it will have a multiplicative effect since it will make every individual researcher more effective. It has the potential to improve the quality of research done across all University departments and faculties.

rse

Further Reading

April 12th, 2018 | Categories: C/C++, matlab, programming, python, R | Tags:

Update
A discussion on twitter determined that this was an issue with Locales. The practical upshot is that we can make R act the same way as the others by doing

Sys.setlocale("LC_COLLATE", "C")

which may or may not be what you should do!

Original post

While working on a project that involves using multiple languages, I noticed some tests failing in one language and not the other. Further investigation revealed that this was essentially because R's default sort order for strings is different from everyone else's.

I have no idea how to say to R 'Use the sort order that everyone else is using'. Suggestions welcomed.

R 3.3.2

sort(c("#b","-b","-a","#a","a","b"))

[1] "-a" "-b" "#a" "#b" "a" "b"

Python 3.6

sorted({"#b","-b","-a","#a","a","b"})

['#a', '#b', '-a', '-b', 'a', 'b']


MATLAB 2018a

sort([{'#b'},{'-b'},{'-a'},{'#a'},{'a'},{'b'}])

ans =
1×6 cell array
{'#a'} {'#b'} {'-a'} {'-b'} {'a'} {'b'}

C++

int main(){ 

std::string mystrs[] = {"#b","-b","-a","#a","a","b"}; 
std::vector<std::string> stringarray(mystrs,mystrs+6);
std::vector<std::string>::iterator it; 

std::sort(stringarray.begin(),stringarray.end());

for(it=stringarray.begin(); it!=stringarray.end();++it) {
   std::cout << *it << " "; 
} 

return 0;
} 

Result:

#a #b -a -b a b
March 2nd, 2018 | Categories: RSE | Tags:

Along with fellow Fellow Chris Richardson, we wrote an article over at Siam News about the emerging Research Software Engineering profession.  Head over to Research Software Engineer: A New Career Track? to check it out.

If this has whetted your appetite for learning more about Research Software Engineering then feel free to read the RSE 2017 State of the Nation report from last year.  Finally, I urge you to join the UK RSE association if you have any interest in this area.

 

March 1st, 2018 | Categories: Cloud Computing, HPC, Making MATLAB faster, matlab | Tags:

I’m working on some MATLAB code at the moment that I’ve managed to reduce down to a bunch of implicitly parallel functions. This is nice because the data that we’ll eventually throw at it will be represented as a lot of huge matrices.  As such, I’m expecting that if we throw a lot of cores at it, we’ll get a lot of speed-up.  Preliminary testing on local HPC nodes shows that I’m probably right.

During testing and profiling on a smaller data set I thought that it would be fun to run the code on the most powerful single node I can lay my hands on.  In my case that’s an Azure F72s_v2 which I currently get for free thanks to a Microsoft Azure for Research grant I won.

These Azure F72s_v2 machines are NICE! Running Intel Xeon Platinum 8168 CPUs with 72 virtual cores and 144GB of RAM, they put my Macbook Pro to shame! Theoretically, they should be more powerful than any of the nodes I can access on my University HPC system.

So, you can imagine my surprise when the production code ran almost 3 times slower than on my Macbook Pro!

Here’s a Microbenchmark, extracted from the production code, running on MATLAB 2017b on a few machines to show the kind of slowdown I experienced on these super powerful virtual machines.


test_t = rand(8755,1);
test_c = rand(5799,1);
tic;test_res = bsxfun(@times,test_t,test_c');toc
tic;test_res = bsxfun(@times,test_t,test_c');toc

I ran the bsxfun twice and report the fastest since the first call to any function in MATLAB is often slower than subsequent ones for various reasons. This quick and dirty benchmark isn’t exactly rigorous but its good enough to show the issue.

  • Azure F72s_v2 (72 vcpus, 144 GB memory) running Windows Server 2016: 0.3 seconds
  • Azure F32s_v2 (32 vcpus, 64 GB memory) running Windows Server 2016: 0.29 seconds
  • 2014 Macbook Pro running OS X: 0.11 seconds
  • Dell XPS 15 9560 laptop running Windows 10: 0.11 seconds
  • 8 cores on a node of Sheffield University’s Linux HPC cluster: 0.03 seconds
  • 16 cores on a node of Sheffield University’s Linux HPC cluster: 0.015 seconds

After a conversation on twitter, I ran it on Azure twice — once on a 72 vCPU instance and once on a 32 vCPU instance. This was to test if the issue was related to having 2 physical CPUs. The results were pretty much identical.

The results from the University HPC cluster are more in line with what I expected to see — faster than a laptop and good scaling with respect to number of cores.  I tried running it on 32 cores but the benchmark is still in the queue ;)

What’s going on?

I have no idea! I’m stumped to be honest.  Here are some thoughts that occur to me in no particular order

  • Maybe it’s an issue with Windows Server 2016. Is there some environment variable I should have set or security option I could have changed? Maybe the Windows version of MATLAB doesn’t get on well with large core counts? I can only test up to 4 on my own hardware and that’s using Windows 10 rather than Windows server.  I need to repeat the experiment using a Linux guest OS.
  • Is it an issue related to the fact that there isn’t a 1:1 mapping between physical hardware and virtual cores? Intel Xeon Platinum 8168 CPUs have 24 cores giving 48 hyperthreads so two of them would give me 48 cores and 96 hyperthreads.  They appear to the virtualised OS as 2 x 18 core CPUs with 72 hyperthreads in total.   Does this matter in any way?

 

February 21st, 2018 | Categories: Cloud Computing, HPC | Tags:

In a previous blog post, I told the story of how I used Amazon AWS and AlcesFlight to create a temporary multi-user HPC cluster for use in a training course.  Here are the details of how I actually did it.

Note that I have only ever used this configuration as a training cluster.  I am not suggesting that the customisations are suitable for real work.

Before you start

Before attempting to use AlcesFlight on AWS, I suggest that you ensure that you have the following things working

Customizing the HPC cluster on AWS

AlcesFlight provides a CloudFormation template for launching cluster instances on Amazon AWS.  The practical upshot of this is that you answer a bunch of questions on a web form to customise your cluster and then you launch it.

We are going to use this CloudFormation template along with some bash scripts that provide additional customisation.

Get the customisation scripts

The first step is to get some customization scripts in an S3 bucket. You could use your own or you could use the ones I created.

If you use mine, make sure you take a good look at them first to make sure you are happy with what I’ve done!  It’s probably worth using your own fork of my repo so you can customise your cluster further.

It’s the bash scripts that allow the creation of a bunch of user accounts for trainees with randomized passwords.  My scripts do some other things too and I’ve listed everything in the github README.md.

git clone https://github.com/mikecroucher/alces_flight_customisation
cd alces_flight_customisation

Now you need to upload these to an s3 bucket. I called mine walkingrandomly-aws-cluster

aws s3api create-bucket --bucket walkingrandomly-aws-cluster --region eu-west-2 --create-bucket-configuration LocationConstraint=eu-west-2
aws s3 sync . s3://walkingrandomly-aws-cluster --delete

Set up the CloudFormation template

  • Head over to Alces Flight Solo (Community Edition) and click on continue to subscribe
  • Choose the region you want to create the cluster in, select Personal HPC compute cluster and click on Launch with CloudFormationConsole

personal_hpc_compute_cluster

  • Go through the CloudFormation template screens, creating the cluster as you want it until you get to the S3 bubcket for customization profiles box where you fill in the name of the S3 bucket you created earlier.
  • Enable the default profile

flight_customisation

  • Continue answering the questions asked by the web form.  For this simple training cluster, I just accepted all of the defaults and it worked fine

When the CloudFormation stack has been fully created, you can log into your new cluster as an administrator.  To get the connection details of the headnode, go to the EC2 management console in your web-browser, select the headnode and click on Connect.

When you log in to the cluster as administrator, the usernames and passwords for your training cohort will be in directory specified by the password_file variable in the configure.d/run_me.sh script. I set my administrator account to be called walkingrandomly and so put the password file in /home/walkingrandomly/users.txt.  I could then print this out and distribute the usernames and passwords to each training delegate.

This is probably not great sysadmin practice but worked on the day.  If anyone can come up with a better way, Pull Requests are welcomed!

alces_flight_login

Try a training account

At this point, I suggest that you try logging in as one of the training user accounts and make sure you can successfully submit a job.  When I first tried all of this, the default scheduler on the created cluster was SunGridEngine and my first attempt at customisation left me with user accounts that couldn’t submit jobs.

The current scripts have been battle tested with Sun Grid Engine, including MPI job submission and I’ve also done a very basic test with Slurm. However, you really should check that a user account can submit all of the types of job you expect to use in class.

Troubleshooting

When I first tried to do this, things didn’t go completely smoothly.  Here are some things I learned to help diagnose the problems

Full documentation is available at http://docs.alces-flight.com/en/stable/customisation/customisation.html

On the cluster, we can see where its looking for its customisation scripts with the alces about command


alces about customizer
Customizer bucket prefix: s3://walkingrandomly-aws-cluster/customizer

The log file at /var/log/clusterware/instance.log on both the head node and worker nodes is very useful.

Once, I did all of this using a Windows CMD bash prompt and the customisation scripts failed to run.  The logs showed this error

/bin/bash^M: bad interpreter: No such file or directory

This is a classic dos2unix error and could be avoided, for example, by using the Windows Subsystem for linux instead of CMD.exe.