Archive for the ‘programming’ Category

May 23rd, 2017

I’m working on optimising some R code written by a researcher at University of Sheffield and its very much a war of attrition! There’s no easily optimisable hotspot and there’s no obvious way to leverage parallelism. Progress is being made by steadily identifying places here and there where we can do a little better. 10% here and 20% there can eventually add up to something worth shouting about.

One such micro-optimisation we discovered involved multiplying two matrices together where one of them needed to be transposed. Here’s a minimal example.

#Set random seed for reproducibility
set.seed(3)

# Generate two random n by n matrices
n = 10
a = matrix(runif(n*n,0,1),n,n)
b = matrix(runif(n*n,0,1),n,n)

# Multiply the matrix a by the transpose of b
c = a %*% t(b)

When the speed of linear algebra computations are an issue in R, it makes sense to use a version that is linked to a fast implementation of BLAS and LAPACK and we are already doing that on our HPC system.

Here, I am using version 3.3.3 of Microsoft R Open which links to Intel’s MKL (an implementation of BLAS and LAPACK) on a Windows laptop.

In R, there is another way to do the computation c = a %*% t(b)  — we can make use of the tcrossprod function (There is also a crossprod function for when you want to do t(a) %*% b)

 c_new = tcrossprod(a,b)

Let’s check for equality

c_new == c
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Sometimes, when comparing the two methods you may find that some of those entries are FALSE which may worry you!
If that happens, computing the difference between the two results should convince you that all is OK and that the differences are just because of numerical noise. This happens sometimes when dealing with floating point arithmetic (For example, see http://www.walkingrandomly.com/?p=5380).

Let’s time the two methods using the microbenchmark package.

install.packages('microbenchmark')
library(microbenchmark)

We time just the matrix multiplication part of the code above:

microbenchmark(
original = a %*% t(b),
tcrossprod = tcrossprod(a,b)
)


Unit: nanoseconds
expr min lq mean median uq max neval
original 2918 3283 3491.312 3283 3647 18599 1000
tcrossprod 365 730 756.278 730 730 10576 1000

We are only saving microseconds here but that’s more than a factor of 4 speed-up in this small matrix case. If that computation is being performed a lot in a tight loop (and for our real application, it was), it can add up to quite a difference.

As the matrices get bigger, the speed-benefit in percentage terms gets lower but tcrossprod always seems to be the faster method. For example, here are the results for 1000 x 1000 matrices

#Set random seed for reproducibility
set.seed(3)

# Generate two random n by n matrices
n = 1000
a = matrix(runif(n*n,0,1),n,n)
b = matrix(runif(n*n,0,1),n,n)

microbenchmark(
original = a %*% t(b),
tcrossprod = tcrossprod(a,b)
)

Unit: milliseconds
expr min lq mean median uq max neval
original 18.93015 26.65027 31.55521 29.17599 31.90593 71.95318 100
tcrossprod 13.27372 18.76386 24.12531 21.68015 23.71739 61.65373 100

The cost of not using an optimised version of BLAS and LAPACK

While writing this blog post, I accidentally used the CRAN version of R.  The recently released version 3.4. Unlike Microsoft R Open, this is not linked to the Intel MKL and so matrix multiplication is rather slower.

For our original 10 x 10 matrix example we have:

library(microbenchmark)
#Set random seed for reproducibility
set.seed(3)

# Generate two random n by n matrices
n = 10
a = matrix(runif(n*n,0,1),n,n)
b = matrix(runif(n*n,0,1),n,n)

microbenchmark(
original = a %*% t(b),
tcrossprod = tcrossprod(a,b)
)

Unit: microseconds
       expr   min    lq    mean median     uq    max neval
   original 3.647 3.648 4.22727  4.012 4.1945 22.611   100
 tcrossprod 1.094 1.459 1.52494  1.459 1.4600  3.282   100

Everything is a little slower as you might expect and the conclusion of this article — tcrossprod(a,b) is faster than a %*% t(b) — seems to still be valid.

However, when we move to 1000 x 1000 matrices, this changes

library(microbenchmark)
#Set random seed for reproducibility
set.seed(3)

# Generate two random n by n matrices
n = 1000
a = matrix(runif(n*n,0,1),n,n)
b = matrix(runif(n*n,0,1),n,n)

microbenchmark(
original = a %*% t(b),
tcrossprod = tcrossprod(a,b)
)

Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval
   original 546.6008 587.1680 634.7154 602.6745 658.2387  957.5995   100
 tcrossprod 560.4784 614.9787 658.3069 634.7664 685.8005 1013.2289   100

As expected, both results are much slower than when using the Intel MKL-lined version of R (~600 milliseconds vs ~31 milliseconds) — nothing new there.  More disappointingly, however, is that now tcrossprod is slightly slower than explicitly taking the transpose.

As such, this particular micro-optimisation might not be as effective as we might like for all versions of R.

May 15th, 2017

For a while now, Microsoft have provided a free Jupyter Notebook service on Microsoft Azure. At the moment they provide compute kernels for Python, R and F# providing up to 4Gb of memory per session. Anyone with a Microsoft account can upload their own notebooks, share notebooks with others and start computing or doing data science for free.

They University of Cambridge uses them for teaching, and they’ve also been used by the LIGO people  (gravitational waves) for dissemination purposes.

This got me wondering. How much power does Microsoft provide for free within these notebooks?  Computing is pretty cheap these days what with the Raspberry Pi and so on but what do you get for nothing? The memory limit is 4GB but how about the computational power?

To find out, I created a simple benchmark notebook that finds out how quickly a computer multiplies matrices together of various sizes.

Matrix-Matrix multiplication is often used as a benchmark because it’s a common operation in many scientific domains and it has been optimised to within an inch of it’s life.  I have lost count of the number of times where my contribution to a researcher’s computational workflow has amounted to little more than ‘don’t multiply matrices together like that, do it like this…it’s much faster’

So how do Azure notebooks perform when doing this important operation? It turns out that they max out at 263 Gigaflops! azure_free_notebook

For context, here are some other results:

As you can see, we are getting quite a lot of compute power for nothing from Azure Notebooks. Of course, one of the limiting factors of the free notebook service is that we are limited to 4GB of RAM but that was more than I had on my own laptops until 2011 and I got along just fine.

Another fun fact is that according to https://www.top500.org/statistics/perfdevel/, 263 Gigaflops would have made it the fastest computer in the world until 1994. It would have stayed in the top 500 supercomputers of the world until June 2003 [1].

Not bad for free!

[1] The top 500 list is compiled using a different benchmark called LINPACK  so a direct comparison isn’t strictly valid…I’m using a little poetic license here.

March 29th, 2017

UK to launch 6 major HPC centres

Tomorrow, I’ll be attending the launch event for the UK’s new HPC centres and have been asked to deliver a short talk as part of the program. As someone who paddles in the shallow-end of the HPC pool I find this both flattering and more than a little terrifying. What can little-ole-me say to the national HPC glitterati that might be useful?

This blog post is an attempt at gathering my thoughts together for that talk.

The technology gap in research computing

Broadly speaking, my role in academia is to hang out with researchers, compute providers (cloud and HPC) and software vendors in an attempt to be vaguely useful in the area of research software. I’m a Research Software Engineer with a focus on Long Tail Science: The large number of very small research groups who do a huge amount of modern research.

Time and again, what I see can be summarized in this quote by Greg Wilsongwilson

This is very true in the world of High Performance Computing.

Geek Top Gear

I love technology and I love HPC in particular. I love to geek out on Flops, Ghz, SIMD instructions, GPUs, FPGAs…..all that stuff. I help support The University of Sheffield’s local HPC service and worked in Research IT at The University of Manchester for around a decade before moving to Sheffield.

I’ve given and seen many a HPC-related talk in my time and have certainly been guilty of delivering what I now refer to as the ‘Geek Top Gear’ speech.  For maximum effect, you need to do it in a Jeremy Clarkson voice and, if you’re feeling really macho, kiss your bicep at the point where you tell the audience how many Petaflops your system can do in Linpack.

*Begin Jeremy Clarkson Impression*

Our new HPC system has got 100,000 of the latest Intel Kaby Lake cores...which is a lot!

Usually running at 2.6Ghz, these cores can turbo-boost to 3.2Ghz for those moments when we need that extra boost of power. Obviously, being Kaby Lake, these cores have all the instruction extensions you’d expect with AVX2, FMA, SSE, ABM and many many other TLAs for all your SIMD needs. Of course every HPC system needs accelerators…..and we have the best of them: Xeon Phis with 68 cores each and NVIDIA GPUs with thousands of tiny little cores will handle every massively parallel job you can throw at them….Easily. We connect these many many cores together with high-speed interconnect fashioned from threads of pure unicorn hair and cool the whole thing with the tears of virigin nerds.

YEEEEEES! Our new HPC system is the best one since the last one and, achieving over a Gajillion Petaflops in the Linpack test (kiss bicep), it will change your life forever.

more_power

Any questions?

Audience member 1: What’s a core?
Audience member 2: Why does it run my R script slower than my laptop?
Audience member 3: Do you have Excel installed on it?

There is a huge gap between what many HPC providers like to focus on and what the typical long-tail researcher wants or needs. I propose that the best bridge for this gap is the Research Software Engineer (RSE).

Research Software Engineer as Alpine guide

In my fellowship proposal, I compared the role of a Research Software Engineer to that of an alpine guide:

Technological development in software is more like a cliff-face than a ladder – there are many routes to the top, to a solution. Further, the cliff face is dynamic – constantly and quickly changing as new technologies emerge and decline. Determining which technologies to deploy and how best to deploy them is in itself a specialist domain, with many features of traditional research.

Researchers need empowerment and training to give them confidence with the available equipment and the challenges they face. This role, akin to that of an Alpine guide, involves support, guidance, and load carrying. When optimally performed it results in a researcher who knows what challenges they can attack alone, and where they need appropriate support. Guides can help decide whether to exploit well-trodden paths or explore new possibilities as they navigate through this dynamic environment.

At Sheffield, we have built a pool of these Research Software Engineers to provide exactly this kind of support and it’s working extremely well so far. Not only are we helping individual research groups but we are also using our experiences in the field to help shape the University HPC environment in collaboration with the IT department.

Supercomputing: Irrelevant to many?

“Never bring an anecdote to a data-fight” so the saying goes and all I have from my own experiences are a bucket load of anecdotes, case studies and cursory log-mining experiments that indicate that even those who DO use HPC are not doing so effectively. Fortunately, others have stepped up to the plate and we have survey and interview data on how researchers are using compute resources.

How Do Scientists Develop and Use Scientific Software? is a report on a 2009 survey of 1972 researchers from around the world. They found that “79.9% of the scientists never use scientific software on a supercomputer

When I first learned of this number, I found it faintly depressing. This technology that I love so much and for which University IT departments dedicate special days to seems to be pretty much irrelevant to the majority of researchers. Could it be that even in an era of big data, machine learning and research software engineering that most people only need a laptop?

Only ever needing a laptop certainly doesn’t fit with what I’ve seen while working in the trenches. Almost every researcher I’ve met who does computational research wishes it was faster or that they had more memory to allow them to do larger problems. Speed is the easiest thing to sell to researchers in the world of RSE. They come for faster execution and leave with a side-order of version control, testing and documentation. A combination of software development and migration to even a small HPC system can easily result in 100x or even 1000x speed-ups for many researchers.

In my experience, it’s not that researchers don’t need HPC, it’s that the jump from their laptop-based workflow to one that makes good use of a HPC system is too large for them to bridge without a little help. Providing that help can result in some great partnerships such as the recent one between the Sheffield RSE group and the Sheffield Faculty of Arts and Humanities.

language

Want to know how that partnership started? I compiled an experimental R/Rcpp package that they were struggling with and then took them for coffee and said ‘That code took a while to run. Here’s how we can make it go faster….Now…what exactly are you doing because it looks cool?’ Fast forward a year or so and we are on the cusp of starting a great new project that will include traditional HPC and cloud computing as part of their R-based workflow.

My experiences seem to be reflected in the data. In  their 2011 article, A Survey of the Practice of Computational Science, Prabhu et al interviewed 114 randomly selected researchers from Princeton University. Princeton have a very strong, well supported HPC centre which provides both resources and the expertise to use them. Even at such a well equipped institution, the authors write that  ‘Despite enormous wait times, many scientists run their programs only on desktops’ although they did report much higher HPC usage compared to the Hanny et al survey.

Other salient quotes from the Prabhu interviews include

“only 18% of researchers who optimize code leveraged profiling tools to inform their optimization plans”

“only 7% of researchers leveraged any form of thread based shared memory CPU parallelism”

“Only 11% of researchers utilized loop based parallelism”

“Currently, many researchers fit their scientific models to only a subset of available parameters for faster program runs.”

“Across disciplines, an order of magnitude performance improvement was cited as a requirement for significant changes in research quality”

HPC: There’s plenty of room at the bottom

Potential users of HPC look different to those of 20 years ago. The popularity explosion of languages such as MATLAB, Python and R have democratized programming and the world is awash with inefficient research software. Most of the time, this lack of efficiency is not a problem (see ‘In defense of inefficient scientific code‘) but if a researcher needs to scale up what they are doing, it can become limiting. Researchers might wait for days for the results to come in and limit the scope of their investigations to fit the hardware they have access to — their laptop usually.

The paper of Prabu et al said that an order of magnitude (10x) speed up was cited by researchers as a requirement for significant changes in research quality. For an experienced Research Software Engineer with access to cloud or HPC facilities, a 10x speed-up is usually pretty easy to achieve for this new audience. 100x or even 1000x can be achieved fairly frequently if you employ multiple hardware and software techniques. Compared to squeezing out a few percent more performance from HPC-centric code such as LAPACK or CASTEP, it’s not even all that difficult. I recently sped up one researcher’s MATLAB code by a factor of 800x in a couple of days and I’m a fairly middling developer if I’m brutally honest.

The whole point of High Performance Computing is to accelerate science and right now there is more computational science around than there has ever been before. Furthermore, it’s easier than ever to accelerate! There’s plenty of room at the bottom.

Closing the computational gap with people, training and compute power

The UK’s 6 new HPC centers represent the cutting edge of hardware technology. They provide a crucial component of our national hardware infrastructure, will contribute to research in HPC itself and will doubtless be of huge benefit to computational science. Furthermore, all of the funded proposals include significant engagement with the national Research Software Engineering community – the vital bridge between many researchers and HPC.

Co-development of research software with effort from both RSEs and researchers can be an extremely powerful model. Combine this with further collaboration between RSEs and compute providers and we have an environment that I think is both very exciting and capable of helping to close the rich/poor compute divide.

As an RSE who works with both researchers and University-level HPC providers, I ask for 3 things to be considered by these new regional centres.

  • Enjoy your new compute-ferraris. I look forward to seeing how hard you can push them.
  • You will be learning new good practice in how to provide HPC services. Disseminate this to those of us running smaller services.
  • There’s plenty of room at the bottom! Help us to support the new wave of computational researchers.

Thanks to languages such as MATLAB, Python and R, general purpose programming has been fully democratized. I look forward to working with these new centres to help democratize high performance computing.

 

January 19th, 2017

There are lots of Widgets in ipywidgets. Here’s how to list them

from ipywidgets import *
widget.Widget.widget_types

At the time of writing, this gave me

{'Jupyter.Accordion': ipywidgets.widgets.widget_selectioncontainer.Accordion,
 'Jupyter.BoundedFloatText': ipywidgets.widgets.widget_float.BoundedFloatText,
 'Jupyter.BoundedIntText': ipywidgets.widgets.widget_int.BoundedIntText,
 'Jupyter.Box': ipywidgets.widgets.widget_box.Box,
 'Jupyter.Button': ipywidgets.widgets.widget_button.Button,
 'Jupyter.Checkbox': ipywidgets.widgets.widget_bool.Checkbox,
 'Jupyter.ColorPicker': ipywidgets.widgets.widget_color.ColorPicker,
 'Jupyter.Controller': ipywidgets.widgets.widget_controller.Controller,
 'Jupyter.ControllerAxis': ipywidgets.widgets.widget_controller.Axis,
 'Jupyter.ControllerButton': ipywidgets.widgets.widget_controller.Button,
 'Jupyter.Dropdown': ipywidgets.widgets.widget_selection.Dropdown,
 'Jupyter.FlexBox': ipywidgets.widgets.widget_box.FlexBox,
 'Jupyter.FloatProgress': ipywidgets.widgets.widget_float.FloatProgress,
 'Jupyter.FloatRangeSlider': ipywidgets.widgets.widget_float.FloatRangeSlider,
 'Jupyter.FloatSlider': ipywidgets.widgets.widget_float.FloatSlider,
 'Jupyter.FloatText': ipywidgets.widgets.widget_float.FloatText,
 'Jupyter.HTML': ipywidgets.widgets.widget_string.HTML,
 'Jupyter.Image': ipywidgets.widgets.widget_image.Image,
 'Jupyter.IntProgress': ipywidgets.widgets.widget_int.IntProgress,
 'Jupyter.IntRangeSlider': ipywidgets.widgets.widget_int.IntRangeSlider,
 'Jupyter.IntSlider': ipywidgets.widgets.widget_int.IntSlider,
 'Jupyter.IntText': ipywidgets.widgets.widget_int.IntText,
 'Jupyter.Label': ipywidgets.widgets.widget_string.Label,
 'Jupyter.PlaceProxy': ipywidgets.widgets.widget_box.PlaceProxy,
 'Jupyter.Play': ipywidgets.widgets.widget_int.Play,
 'Jupyter.Proxy': ipywidgets.widgets.widget_box.Proxy,
 'Jupyter.RadioButtons': ipywidgets.widgets.widget_selection.RadioButtons,
 'Jupyter.Select': ipywidgets.widgets.widget_selection.Select,
 'Jupyter.SelectMultiple': ipywidgets.widgets.widget_selection.SelectMultiple,
 'Jupyter.SelectionSlider': ipywidgets.widgets.widget_selection.SelectionSlider,
 'Jupyter.Tab': ipywidgets.widgets.widget_selectioncontainer.Tab,
 'Jupyter.Text': ipywidgets.widgets.widget_string.Text,
 'Jupyter.Textarea': ipywidgets.widgets.widget_string.Textarea,
 'Jupyter.ToggleButton': ipywidgets.widgets.widget_bool.ToggleButton,
 'Jupyter.ToggleButtons': ipywidgets.widgets.widget_selection.ToggleButtons,
 'Jupyter.Valid': ipywidgets.widgets.widget_bool.Valid,
 'jupyter.DirectionalLink': ipywidgets.widgets.widget_link.DirectionalLink,
 'jupyter.Link': ipywidgets.widgets.widget_link.Link}
January 12th, 2017

If you are a researcher and are currently writing scripts or developing code then I have a suggestion for you. If you haven’t done it already, get yourself a willing volunteer and send them your code/analysis/simulation/voodoo and ask them to run it on their machine to see what happens. Bonus points are awarded for choosing someone who uses a different operating system from you!

This simple act is one of the things I recommend in my talk Is Your Research Software Correct and it can often help improve both code and workflow.

It quickly exposes patterns that are not good practice. For example, scattered references to ‘/home/walkingrandomly/mydata.dat’ suddenly don’t seem like a great idea when your code buddy is running windows. The ‘minimal tweaking’ required to move your analysis from your machine to theirs starts to feel a lot less minimal as you get to the bottom of the second page of instructions.

Crashy McCrashFace

When I start working with someone new, the first thing I ask them to do is to provide access to their code and simple script called runme or similar that will build and run their code and spit out an answer that we agree is OK. Many projects stumble at this hurdle! Perhaps my compiler is different to theirs and objects to their abuse (or otherwise) of the standards or maybe they’ve forgotten to include vital dependencies or input data.

Email ping-pong ensues as we attempt to get the latest version…zip files with names like PhD_code_ver1b_ForMike_withdata_fixed.zip get thrown about while everyone wonders where Bob is because he totally got it working on Windows back in 2009.

git clone

‘Hey Mike, just clone the git repo and run the test suite. It should be fine because the latest continuous integration run didn’t throw up any issues. The benchmark code and data we’d like you to optimise is in the benchmarks folder along with the timings and results from our most recent tests. Ignore the papers folder, that just reproduces all of the results from our recent papers and links to Zenodo DOIs’

‘…………’

‘Are you OK Mike?’

‘I’m…..fine. Just have something in my eye’

 

August 24th, 2016

I sometimes give a talk called Is Your Research Software correct (github repo, slide deck) where I attempt to give a (hopefully) entertaining overview of some of the basic issues in modern research software practice and what can be done to make the world a little better.

One section of this talk is a look at some case studies where software errors caused problems in research. Ideally, I try to concentrate on simple errors that led to profound scientific screw-ups. I want the audience to think ‘Damn! *I* could have made that mistake in my code‘.

Curating this talk has turned me into an interested collector of such stories. This is not an exercise in naming and shaming (after all, the odds are that its only a matter of time before I, or one of my collaborators, makes it into the list — why set myself up for a beating?). Instead, it is an exercise in observing the problems that other people have had and using them to enhance our own working practices.

Thus begins a new recurring WalkingRandomly feature.

Excel corrupts genetics data

Today’s entry comes courtesy of a recent paper by Mark Ziemann, Yotam Eren and Assam El-OstaEmail – ‘Gene name errors are widespread in the scientific literature‘ where they demonstrate that the supplementary data files for hundreds of papers in genetics have been corrupted by Microsoft Excel which has helpfully turned gene symbols into dates and floating point numbers.

The paper gives advice to reviewers on how to spot this particular error and the authors have also published the code used for the analysis. I’ve not run it myself so can only attest to its existence, not it’s accuracy.

I’ve not dealt with genetic data directly myself so ask you — what would you have used instead of Excel? (my gut tells me R or Python but I have no details to offer).

Do you have a story to contribute?

If you are interested in contributing a story where a software glitch caused problems in research, please contact me to discuss details.

Update (31st August 2016)

One of the authors of the paper, Mark Ziemann, has written a follow up of the Excel work on his blog: http://genomespot.blogspot.co.uk/2016/08/my-personal-thoughts-on-gene-name-errors.html

August 11th, 2016

This is my rant on import *. There are many like it, but this one is mine.

I tend to work with scientists so I’ll use something from mathematics as my example.  What is the result of executing the following line of Python code?

result = sqrt(-1)

Of course, you have no idea if you don’t know which module sqrt came from. Let’s look at a few possibilities. Perhaps you’ll get an exception:

In [1]: import math
In [2]: math.sqrt(-1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
----> 1 math.sqrt(-1)

ValueError: math domain error

Or maybe you’ll just get a warning and a nan

In [3]: import numpy
In [4]: numpy.sqrt(-1)
/Users/walkingrandomly/anaconda/bin/ipython:1: RuntimeWarning: invalid value encountered in sqrt
#!/bin/bash /Users/walkingrandomly/anaconda/bin/python.app
Out[4]: nan

You might get an answer but the datatype of your answer could be all sorts of strange and wonderful stuff.

In [5]: import cmath
In [6]: cmath.sqrt(-1)
Out[6]: 1j
In [7]: type(cmath.sqrt(-1))
Out[7]: complex

In [8]: import scipy
In [9]: scipy.sqrt(-1)
Out[9]: 1j
In [10]: type(scipy.sqrt(-1))
Out[10]: numpy.complex128

In [11]: import sympy
In [12]: sympy.sqrt(-1)
Out[12]: I
In [13]: type(sympy.sqrt(-1))
Out[13]: sympy.core.numbers.ImaginaryUnit

Even the humble square root function behaves very differently when imported from different modules! There are probably other sqrt functions, with yet more behaviours that I’ve missed.

Sometimes, they seem to behave in very similar ways:-

In [16]: math.sqrt(2)
Out[16]: 1.4142135623730951

In [17]: numpy.sqrt(2)
Out[17]: 1.4142135623730951

In [18]: scipy.sqrt(2)
Out[18]: 1.4142135623730951

Let’s invent some trivial code.

from scipy import sqrt

x = float(input('enter a number\n'))
y = sqrt(x)

# important things happen after here. Complex numbers are fine!

I can input -1 just fine. Then, someone comes along and decides that they need a function from math in the ‘important bit’. They use import *

from scipy import sqrt
from math import *

x = float(input('enter a number\n'))
y = sqrt(x)

# important things happen after here. Complex numbers are fine!

They test using inputs like 2 and 4 and everything works (we don’t have automated tests — we suck!). Of course it breaks for -1 now though. This is easy to diagnose when you’ve got a few lines of code but it causes a lot of grief when there’s hundreds…or, horror of horrors, if the ‘from math import *’ was done somewhere in the middle of the source file!

I’m sometimes accused of being obsessive and maybe I’m labouring the point a little but I see this stuff, in various guises, all the time!

So, yeah, don’t use import *.

April 18th, 2016

The Engineering and Physical Sciences Research Council (EPSRC) is the UK’s main agency for funding research in engineering and the physical sciences. In 2015, they made a very unusual type of fellowship call – one that was targeted specifically at Research Software Engineers. This was the first fellowship of its kind in the world  and I believe it represents a strong commitment by EPSRC to the improvement of research software.

Research Software Engineers are the people behind research software. They make a huge contribution to science but often lack reward and recognition for the work that they do. This fellowship is a huge step in the right direction to providing some of that recognition. Quoting from the call document:

This call will support Research Software Engineer (RSE) Fellowships for a period of up to five years. The RSE Fellowship describes exceptional individuals with combined expertise in programming and a solid knowledge of the research environment. The Research Software Engineer works with researchers to gain an understanding of the problems they face, and then develops, maintains and extends software to provide the answers.

201 people responded to the call with an ‘Intent to submit’ outline application. Of these, 7 were successful. As part of my work with the EPSRC funded Research Software Engineering Network (RSE-N), I got in touch with the new cohort of RSE fellows and interviewed them about their projects and careers.

Follow the links below to see what they had to say.

April 18th, 2016

This interview with The University of Bristol’s Chrys Woods is part of my series of interviews on the new cohort of EPSRC Research Software Engineering Fellows.

Could you tell us a little about yourself and how you became a Research Software Engineer?

I have been coding since preschool when my Dad bought me a Texas Instruments TI-99/4A. This had a simple BASIC, but no tape or disk storage, meaning that all of the code was lost when the computer was switched off. After that, I had an Amiga as a teenager, and had fun coding little games in my spare time. I grew up in a seaside town on the East Coast, and the industry there was just fishing and making frozen food, so it didn’t occur to me that I could do programming as a job. It was just for fun. It was only when I went to University (Southampton) that I saw that programming could be useful for science. I undertook a 3rd-year computational chemistry research project with Jon Essex at Southampton, and from there I was hooked and wanted to become a computational chemist. In Jon’s group in the late 1990’s I helped to build Beowulf compute clusters from scratch (assembling shelves, doing all the cabling, building the cluster installer disks, job schedulers etc.), as well as developing lots of software in first Fortran 77 and then C++ and Python. From there, I moved to Bristol, and wrote lots of grant applications and managed to work for about 10 years on a series of EPSRC and BBSRC funded software development projects (sincere thanks to both funders for the grants). These all culminated in a framework for molecular simulation, called Sire (http://siremol.org), around which a reasonable community has formed (about 20+ people have developed the code over the years).

The experience of working with this community made me realise that software engineering was about helping other people develop and play with code. It showed me the importance of leading by example, e.g. adding in tests, using clean designs and APIs, and writing clear documentation (although I readily admit that I am a really bad documenter).

About 2 years ago I was offered a job in BrisSynBio (a BBSRC/EPSRC Synthetic Biology Research Centre), as a technical lead, systems administrator and RSE. I have really enjoyed this position, as it made me step back from my research and really work in “services” to support other researchers. This gave me a completely new perspective on research, as I saw the world from the view of e.g. administrators, finance, procurement and technical services. This showed me that successful research depends on a whole team of energised, committed and dedicated people, and that research software engineers can play an important role as part of the research development team.

What do you think is the role of a Research Software Engineer? Is it different from a ‘normal’ researcher?

For me, research software engineering is about helping junior researchers develop their code right the first time. Helping them to structure their code so that it is easier for them to write flexible, trustably correct and performance portable software without them having to be overburdened with learning a lot of computer science. Following this, good RSE work is helping researchers build flexible frameworks that allow researchers to play with their scientific ideas. The code should allow them to prototype and play with new ideas, to get them running quickly and efficiently first time without having to have people come and re-engineer everything later.

You’ve recently won an EPSRC RSE Fellowship – congratulations! Can you give a brief overview of your project?

My project is about providing a new member of a research team – the research software engineer. This will enable a new way of doing research. Research is team based, and I want to help change the culture so that RSEs are seen as a member of the research team and not a service. The EPSRC RSE project provides funding for me as an RSE to be embedded within research groups to work with them to develop new research. Twenty projects will initially be supported; 10 in the first wave that have been allocated, and then 10 that will be allocated in response to a call. The projects cover everything from modelling chemical reactions, designing new optical machines, creating great visualisations in an interactive 3D planetarium, modelling bacterial factories and engineering new scaffolds for future vaccines.

cwoods

How long did it take you to write your Fellowship application (Any other thoughts/advice on the application process?)

I found out about the call when on holiday in Switzerland from a friend. That got me started thinking about a model for how RSEs could be added to a research team. Once I’d worked out the model, I found the proposal to be very easy to write. Indeed, it was the opportunity to write the proposal that I had always wanted to write – to put software development and good software engineering front and centre in the proposal itself. When I got back from from holiday I alerted HPC users at Bristol about what I was planning, and then met up with researchers from across the University in 10 minute quick flash talks set out my proposal for RSE projects. People contributed projects quickly, and I was soon oversubscribed. Then, writing the proposal was just about putting it all down on paper.

The strangest part was that I had consciously left an academic role when I moved into BrisSynBio, and had accepted that I was never going to become “an academic”. The hardest part was talking with my wife and persuading her that I should go back into that world.

Where do you want to be in 5 years?

I want to be running a large and successful RSE group and contributing to the development of computational science/engineering as a complete discipline, i.e. being on the path to having departments with faculty, teaching of undergraduates in good software engineering best practice, researchers in software engineering, collaborating with scientists as parts of teams to develop the next generation of well-engineered code to support 21st century science. I also want to help inspire the next generation of potential RSEs and help (1) raise awareness that programming a computer can help you leave a seaside town and travel the world, and (2) maths, physics and programming are useful skills, that there is stable career pathway for scientific software developers and RSEs, that this is an exciting and dynamic career choice, it does let you work with intelligent and energetic people, and most importantly, it puts you in a position to shape how the technology of the future will be designed and developed.

Who are your project partners?

Cresset, a company that writes software for the pharmaceutical industry, and the Software Sustainability Institute. Also, all of the researchers who will be supported by the RSE projects.

Tell me about your RSE group.

We are now building the RSE group at Bristol. Currently it is me, a new junior RSE to be appointed, and some graduates on our new Graduate Accelerator Programme (GAP) who will be appointed later this year. We anticipate growing further over the next few years.

Which programming languages and technologies do you regularly use?

C++ is my favourite, especially the functional coding support in C++11/14, closely followed by Python. I find the combination of C++ and Python is extremely powerful. It allows easy writing of fast, performant parallel code in the C++ layer, yet retains flexibility in the scripting layer which treats C++ as a library of building blocks. All unit testing can be via Python scripts that stress these building blocks.

I teach a lot of python and strongly recommend it to newcomers, e.g. see http://chryswoods.com/main/courses.html

Are there any languages/technologies that you used to use a lot but have now moved away from? Why?

Fortran. I have a soft spot for F77, but it is very 20th century. It is missing modern containers, generics, templates, virtual functions, task based parallelism, easy wrapping with scripting languages, integration with unit testing suites, etc. etc. It is also missing easy handling of low-level memory, while also providing a high level memory interface.

Perl. I loved Perl. I teach Perl, but no-one comes any more. Python is better, and it is difficult to argue against. And then the Perl community turned in on itself in going from Perl 5 to Perl 6.

Is there anything on your ‘to-learn’ list?

Management. How to Teach. Any new programming paradigm. LLVM and stuff that bridges the gap between scripted and compiled languages. Anything else in the programming world that is cool.

And, MATLAB, R, etc., as I need to learn how to interface with the communities that use those tools.

Humility. We don’t always know what is best (even if we do think we are right).

Do you have any advice for anyone who wants to become a Research Software Engineer?

My advice applies to anyone who wants to work in a university as an academic. Never forget that each grant and each award is a gift from the public. You are not given this gift so that you are employed there forever. It is given so that, in some way, you can make a difference to society. Be in research because you want to make a difference. The counter to this, is that there is a life outside academia, and it is not a failure to move on to other roles. Sometimes, like my BrisSynBio position, they can make you stronger

For research software engineering, I would say learn to communicate with people. Being able to talk with people is just as important as being able to talk with the computer. Also, learn the paradigms of programming (structural, object, functional etc.), as once you get these, different computer languages are just different syntax. Finally, learn some maths and science. They may be harder to learn, but they are fundamental, and without understanding these, it is very hard to really appreciate the complexities of research code, or to see the potential optimisations or approximations that may be available.

April 18th, 2016

This interview with the University of Sheffield’s Paul Richmond is part of my series of interviews on the new cohort of EPSRC Research Software Engineering Fellows.

Could you tell us a little about yourself and how you became a Research Software Engineer?

I have been working as a Research Associate (early career research) since I completed my PhD in 2010. During this time I have been working on the fringe of both novel computer science research and the application of emerging parallel computing architectures to various areas of science and engineering. Whilst carving a reasonably successful career as an early career researcher it became clear that in order to progress within the academic environment it would require me to become more specialised in novel research than to participate in applying my skills of parallel computing to broader research domains. The role of the research software engineer is one that means different things to different people. For me it is the role of applying my specialist skills of parallel computing to a wide range of domains. It is a position which encourages the development of my novel research software (FLAME GPU) giving me the flexibility to embed it as broadly as possible.

What do you think is the role of a Research Software Engineer? Is it different from a ‘normal’ researcher?

To me the role of RSE is one which is about facilitating research. This can be through hands on help or the provision of software, skills or a community which provides a specific researching computing need. Having worked both as a researcher and in my new role as a self recognised RSE my view is that it is important that people are able to transcend the boundary between the two. Many RSEs come from support backgrounds rather than research however there are countless researchers who work on providing research software or multi-disciplinary research computing skills. I feel that researchers should be encouraged to move into the roles of RSEs where appropriate but also that this shouldn’t be seen as a career limiting move. RSEs should be free to transition back into academia as a when the research requests it. I hope to demonstrate throughout my position as a RSE Fellow that it is possible to exist alongside this boundary delivering typical academic outputs whilst working collaboratively in a facilitation role.

You’ve recently won an EPSRC RSE Fellowship – congratulations! Can you give a brief overview of your project?

My RSE Fellowship is all about changing the way people think about coding and the way in which they use workstation and HPC computing. In the future computers will be highly parallel with hugh numbers of cores, we are already seeing this pattern emerge today through accelerated computing in the form of GPUs. The traditional “serial” was of thinking needs to change and parallelism needs to be incorporated into computational research from the ground up to enable researchers to target future computing systems. To ensure that this happens my fellowship proposed 1) a combination of software and tools targeting many core architectures, 2) upskilling of researchers on a national scale and embedding of parallel programming techniques within the undergraduate, and postgraduate curriculums and 3) a local and national community in which researchers can receive software consultancy and work collaboratively to embed accelerated computing into their research. As a result of this fellowship researchers will gain access to unprecedented levels of compute performance enabling them utilise scalable computational approaches to solve scientific chand challenges.

prichmond

How long did it take you to write your Fellowship application (Any other thoughts/advice on the application process?)

The turnaround for the fellowship application was actually very quick. From consultation with colleagues it is normally expected that an EPSRC fellowship application should take about 6 months to complete, undergoing many rounds of internal review before submission. Following notification to continue to full submission (after expressions of interest) there was only just over a month and I had a weeks holiday booked the week before the final deadline… Trying to pick-up internet on holiday in Ireland, actually on Craggy Island (where Father Ted was filmed) was somewhat of a challenge. Fortunately every other applicant was in the same boat and I already had a good amount of material prepared on my ambitions for accelerated computing. Once through to the interview stage having a number of mock interviews helped tremendously in calming my nerves and polishing my pitch.

Who are your project partners?

NVIDIA, ARCHER, EPCC, OCR, ACRC, Oxford e-Science, WRG and N8, TRansport Systems Catapult, DNV.GL.

Bradford University, EPCC (Edinbugh), UCL, Oxford University

Tell me about your RSE group.

Sheffield has two EPSRC RSE fellows and we’ve teamed up to form the Sheffield Research Software Engineering group. We’ve only existed for a month! At the moment its just us but we have funds to recruit a few more people so watch this space.

Which programming languages and technologies do you regularly use?

It’s probably easier to list those that I don’t regularly use! My GOTO: languages are (get it? but seriously I don’t program with GOTO’s);

  • C
  • C++ (note not the same as above and should not be referred to as C/C++ #endrant)
  • CUDA
  • OpenMP/OpenACC/OpenGL
  • Assembly (ARM and PTX)
  • C++ extensions: e.g. Qt (for UI dev), Boost and C++11.

Other languages I use slightly less regularly are;

  • Python,
  • Java (if I have to, which I do especially for Eclipse plugins)
  • Fortran,
  • Javascript

Are there any languages/technologies that you used to use a lot but have now moved away from? Why?

I use java less and less now. It was taught on my undergraduate curriculum very heavily and at one point was the future for OpenGL on the web. There is just no need for java applets any more….

Is there anything on your ‘to-learn’ list?

Ohh yes. Vulkan is at the top of my list. As the successor to OpenGL for graphics with multi core support I look forward to integrating it with accelerated simulation models.

Do you have any advice for anyone who wants to become a Research Software Engineer?

Software underpins everything and is embedded within almost every research domain. Don’t let anyone tell you that there is no career progression for RSEs in academia. They (the research community) need us just as much as we need them and it’s up to you (and the collective us) to show the world how vital RSEs are in the academic environment.