Yesterday saw the biggest highlight in the technical calendar so far for me… Microsoft has brought the Linux command line to Windows.
Excited by the command line
OK, so I’ll admit it….improvements to command line tools seriously excite me! Windows 10 brought in a couple of minor improvements to the command line last year and I acted like it was a second birthday. Imagine then, the excitement I felt while watching Microsoft’s announcement of linux integration into Windows (Spin to 2:24 to see it). I could barely form sentences! When my wife came into the kitchen to see what the commotion was, all I could manage was “Bash! On Windows!….GIT! OMG! Not a VM! vi..sed..awk…gcc! OMG!’
So why so much excitement?
Shell scripts are more than mere automation, they are repositories of knowledge – they do some task for you and also explain how that task is done. They are wonderful things that can educate, take the drudgery out of a researcher’s life and lead to more reproducible research. The problem is that Linux and Mac users speak a different scripting language (Bash) to Windows users (Windows batch or more, recently, PowerShell).
In short, a script written on a Linux or Mac machine wouldn’t run on a Windows machine unless you jumped through some hoops.
Bash just became a cross-platform scripting solution
Various solutions for running Bash scripts on Windows have existed for a while. Cygwin, for example, compiles linux tools to run on Windows which works very effectively for many situations. Additionally, the Windows version of git comes with an emulated Bash mode that’s good enough to teach the scripting lesson from Software Carpentry. Neither of these solutions are perfect, however, and to me they’ve always felt like slightly awkward patches. The resulting binaries in projects such as Cygwin are necessarily different from those used in Linux Land.
This new collaboration between Canonical and Microsoft changes the game! Now, Linux tools appear like first-class citizens in the Windows world. When you run Bash on Windows, it will be the exact same Bash that’s run on Linux. An automated research analysis developed on a Linux machine will work exactly the same way on Windows.
The same skills apply, from tablet to supercomputer
Furthermore, when we teach introductory shell scripting to researchers we will be teaching them tools that allows them to work on all operating systems and all hardware types.
The same skills will apply to Mac, Linux and Windows from tablets to High Performance Computing clusters and that’s a wonderful thing.
I recently tried to use the XE 2015 Update 6 version of the Intel C++ compiler with Visual Studio Community Edition 2015 Update 1. Even a simple Hello World console application didn’t work. I got lots of compilation errors that looked like this
1>C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\exception(248): error : expected an attribute name 1> [[noreturn]] _CRTIMP2_PURE void __CLRCALL_PURE_OR_CDECL __ExceptionPtrRethrow(_In_ const void*); 1> ^
There are two solutions to this problem.
- Solution 1. Downgrade Visual Studio Community Edition 2015 to the RTM version. The way I did this was to uninstall the Update 1 version and then install the RTM version using the .iso file at https://www.microsoft.com/en-us/download/details.aspx?id=48146 (Thanks to the Visual Studio twitter team for this link!)
- Solution 2. Upgrade the Intel C++ Compiler to XE 2016.
Getting academic credit for impactful software
The topic of this year’s Software Sustainability Institute Collaborations Workshop is Software and Credit where luminaries from the Research Software Engineering world will get together and examine the problem of getting academic credit for the development of software. An example of the problem at hand is the story of Michael Double and BoneJ. Quoting from the SSI’s Software and Credit article:
Michael Double of the Royal Veterinary College, London (RVC) is the lead on the BoneJ project, his 2010 paper describing BoneJ is the mostly highly cited paper at the RVC gaining 2 new citations per week on average(!). However it was not deemed the right shape to submit to the UK Research Evaluation Framework committee by the RVC management even though they admitted it was highly impactful.
It’s a huge problem! Software is essential for modern research but developing it isn’t often recognised as a valid research output. If developers of highly impactful software such as BoneJ struggle, what hope is there for those of us who are rather lower in the research software food chain?
Embarrassing rash? The code doctor will see you now!
When I attend these annual collaboration workshops I find that my imposter syndrome, a feeling that’s always lurking under the surface of my psyche, starts to reach a kind of fever pitch. I feel like a country-bumpkin doctor who suddenly finds himself thrust into an advanced medical research conference. Surrounded by specialist developers of new drugs, surgical techniques and high-tech scanning machines, I’m just the guy who applies ointment to patient’s embarrassing rashes.
My path to Research Software Engineer has been via the IT support route where I’ve had many job titles throughout a career defined by restructure after painful restructure. Whatever my job title was, my role has always been the same — Researchers come to me with their code problems and I do my best to solve them (or convince them that they really want a different but equivalent problem solved).
These problems include things like:-
- My code is too slow! Like 10,000 times too slow. Can you help?
- My code sometimes explodes spectacularly, can you help? It’s 10,000 lines of VBA…..with some Fortran thrown in for luck.
- How do I get my code to run on the supercomputer? Why isn’t it faster when it’s finally on it? What’s Linux?
- Um…Our paper says the answer is 0.435 but the current version of our code, the one on Bob’s pendrive, says its 3.6 Billion. Can you help us reproduce our own results?
- How the hell do you do <insert task here> in <insert inappropriate technology here>
- People keep talking about code but all I’ve ever needed is this spreadsheet. How do I make my spreadsheet do <insert thing that REALLY needs to be done some other way>. I’m a Prof who’s best friends with your boss — your answer had better be spreadsheet-y!
- I can’t code, can you help?
- We’ve got code written in <old technology> but now we want it in <new technology>, can you help?
- We do our research in <insert expensively licensed software> and now need to run it 1000s of times simultaneously. This will cost more than the GDP of China. Can you help?
- Our research is based on this thing that’s a Fortran 77 kernel wrapped in MATLAB that’s been wrapped in Perl that we call from Python. It’s been in development for over 10 years and the computer it worked on has died. We are struggling to get it working on our new machine…can you help?
- We wrote some experimental code and it worked. REALLY well! We’ve now got lots of users and suggestions for improvements but have no process to deal with all of this. Can you help?
…and so it goes on. I work with researchers from almost every field of study and at every career stage — from Undergraduate project student through to professor and everything in between. The role is something like a mix of IT support, sysadmin, software developer, teacher, consultant, alpine guide and therapist.
It’s hard, dirty work and I love it!
Doing the job that’s put in front of you
Like many of my Research Software Engineer colleagues I have Opinions (capitalisation intended! They are strong but weakly held!) on the way things should ideally be done in research software development. These opinions have been formed from years of working in the trenches, observing the trouble people get themselves into and what’s required to get them out of it. They’ve also been informed by listening to the latest research on good practice from masters of the field.
Just as your local doctor might prescribe a healthy diet, exercise and cutting down on alcohol, I prescribe things such as version control, automation and making your code open. Despite knowing all of this advice, of course, many choose to ignore it for one reason or another and get themselves in a bit of a pickle.
One of the reasons I am so proud of the United Kingdom’s National Health Service is that no matter what you’ve done to yourself, no matter how rich or poor you are, no matter how much advice you’ve ignored, their fantastic doctors and nurses will fix you. Sure, they wish that the world was a better place and that people would take better care of themselves but, ultimately, they’ll do the job that’s put in front of them — not the one they wish they had!
I guess that when I do my job, I try to emulate this behaviour.
But where’s the impact?
You’ll not find my name on any research papers and I don’t have a big project such as BoneJ to stand behind. It is exceedingly difficult to demonstrate impact, in the accepted academic sense of the word, of roles like this but I am convinced that they are vital part of the research community.
One ‘solution’ would be to only offer my services to those who have already sipped from the Research Software Engineering Kool-Aid. Now that I have my RSE fellowship to stand behind, I could easily take this route and only focus on helping projects that smell a certain way…the right way. I could work on projects that have RSE time costed in from the beginning, using only the finest, freshest ingredients and released in ways that make it easy to demonstrate impact.
This route is very tempting! I could use only the technologies I love and work in an environment where my contributions were recognised at every level — funding bodies and promotion panels in particular. Thanks to the efforts of organisations such as the software sustainability institute, I believe that developers of quality projects such as BoneJ will eventually get the recognition they deserve. By focusing on such quality, high profile projects, my career would be assured!
To my mind, however, this is akin to the NHS only providing its services to rich, well-informed patients who take heed of all the good advice leaving the rest of us to suffer.
So…I’m probably not going to do that
The Accident and Emergency of Research Software Engineering
Over time, I have come to think of my particular style of work as the accident and emergency of Research Software Engineering. It’s usually unglamorous work with little hope of formal recognition and the threat of cuts (or being reassigned to printer support!) hangs over your head every day. As anyone who’s desperately needed their services at 3am can attest, however, an A+E department is exceedingly impactful!
Despite the success of the Research Software Engineering idea, I believe that the need for the A+E type of work will increase over time along along with the percentage of researchers who need to write at least a little code. This tweet from Southampton University’s Ian Hawke, quoting Software Carpentry’s Greg Wilson, summarises my thoughts on this matter perfectly.
Here we go with @gvwilson : the gulf between the computing scientific “elite” and those emailing spreadsheets is growing, and that’s bad.
— Ian Hawke (@IanHawke) November 10, 2015
This observation closely matches my own experiences from the front-line of RSE support. Part of the role of practitioners such as me is to help close that gap by raising the game of those who email spreadsheets. I feel that I’m making progress in this area although I often wish I had more staff!
Another part of the role is to figure out how to get credit and demonstrate impact for this kind of work or we’ll risk losing it in the next round of cuts. I’m struggling with that aspect to be honest!
If you feel that your work has elements of an Accident + Emergency Research Software Engineer in it, feel free to speak up in the comments.
A couple of weeks ago, a small group of us hit on the idea of running research programming tutorials in a cafe. The ‘plan’ was that we’d develop some self-paced programming tutorial material, take over a section of the main campus Cafe (Coffee Revolution) for a couple of hours in the evening and invite some researchers to come and learn something new for free.
For our first session, we chose to do a very gentle introduction to R. The students worked through the material, which started right at the beginning with installing R and RStudio, while a group of volunteer facilitators walked the room answering questions, solving problems and forming collaborations.
I can’t stress the importance of the facilitators enough! There is no chance that this format would have worked well without a group of skilled facilitators. On the day, I was joined by
I also had support of a few other people in the development stages. Thanks so much to all!
It was a lot of fun to do and student feedback has been fantastic! My favourite comment came from a medical doctor who said ‘I had no idea about computer programming and I don’t think I would be brave enough to try it on my own. Yesterday, I realised that R can be something useful and not really hard to learn.’
I find interactions like this to be hugely motivational!
There was a real buzz in the room, everyone seemed to learn something useful and I walked away from the evening with a couple of interesting follow-up collaborations in the bag. There were lots of calls for future sessions on topics ranging from more advanced R through to Python, MATLAB, Mathematica and High Performance Computing. The self-paced, flipped-classroom style of teaching was also a great hit!
So, that’s what went right. What about what went wrong?
We deliberately allowed time for the installation of R in the session. Ensuring that the attendees had a working install of R and RStudio on their own kit was part of the point. Before the session, I did trial installs on Windows and Mac and everything went without a hitch. Other members of the team tried fresh installs on Linux.
“Installation’s going to be a doddle…no worries” I thought.
The very first attendee who called me over for help couldn’t get RStudio started on her Mac. It crapped out with an error message I’d never seen before. A bit of googling determined that it was because she had several old versions of R already installed and RStudio took exception to this.
We also had Linux users of various flavours and most of them had problems. A user of Arch Linux gave up on trying to install RStudio and used the command line instead. One linux user called me over after he started installing the ggplot2 package asking ‘This has been compiling for ages, is that normal?’ Fortunately, we were in a cafe so he could go get himself a brew while waiting.
Some people already had versions of R and RStudio installed from waaaaay back and so didn’t feel it necessary to upgrade to the latest versions. These people discovered that they couldn’t install packages because ‘foo isn’t available for R version whatever’.
It was all rather painful to be honest! We were in full technical-support mode…but at least people left the session with working, up to date versions R and R Studio….mostly!
There wen’t many power sockets. We didn’t think much about this in advance. Ball dropped!
For a feeble attempt at a defence I’ll mention that the battery on my laptop is superb and I spend hours working in the host cafe without worrying about power. Since I’ve been so spoiled, I’ve forgotten how important a mains socket is when your battery sucks.
This session was an experiment — something quickly spun up to see if it might work. I’m happy to report that it did!
Our main problem is that we’ve now created demand. Demand for repeats of this session for new audiences, demand for new material and demand for further consultancy. How fortunate for us at Sheffield that we have a newly created Research Software Engineering group to help meet this demand.
Say you have two vectors in R (These are taken from my tutorial Simple nonlinear least squares curve fitting in R)
xdata = c(-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9) ydata = c(0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001)
We put these in a data frame with
data = data.frame(xdata=xdata,ydata=ydata)
This looks like this in R
xdata ydata 1 -2.00 0.699369 2 -1.64 0.700462 3 -1.33 0.695354 4 -0.70 1.039050 5 0.00 1.973890 6 0.45 2.411430 7 1.20 1.910910 8 1.64 0.919576 9 2.32 -0.730975 10 2.90 -1.420010
Exporting to a .csv file is done using the standard R function, write.csv
The resulting .csv file looks like this:
"","xdata","ydata" "1",-2,0.699369 "2",-1.64,0.700462 "3",-1.33,0.695354 "4",-0.7,1.03905 "5",0,1.97389 "6",0.45,2.41143 "7",1.2,1.91091 "8",1.64,0.919576 "9",2.32,-0.730975 "10",2.9,-1.42001
I don’t want to include the row numbers in my output. To achieve this, we do
This gets us a file that looks like this:
"xdata","ydata" -2,0.699369 -1.64,0.700462 -1.33,0.695354 -0.7,1.03905 0,1.97389 0.45,2.41143 1.2,1.91091 1.64,0.919576 2.32,-0.730975 2.9,-1.42001
I can also remove the quotes around xdata and ydata with quote=FALSE
giving the file below
xdata,ydata -2,0.699369 -1.64,0.700462 -1.33,0.695354 -0.7,1.03905 0,1.97389 0.45,2.41143 1.2,1.91091 1.64,0.919576 2.32,-0.730975 2.9,-1.42001
Changing the separator
Despite the fact that they are asking R to write a comma separated file, some people try to change the separator. Perhaps you’d like to try changing it to a tab for example. The following looks reasonable:
Although it understands what you are trying to do, R will completely ignore your request!
Warning message: In write.csv(data, file = "example_data.csv", row.names = FALSE, : attempt to set 'sep' ignored
This is because write.csv is designed to ensure that some standard .csv conventions are followed. It’s trying to protect you against yourself!
In the UK, the convention for .csv files is to use . for a decimal point and , as a separator and that’s the convention that write.csv sticks to. Other countries have a different convention – they use a , for the decimal point and a ; for the separator. The function write.csv2 takes care of that for you.
If you absolutely must change the separator to something else, make use of write.table instead:
Now, the file will come out like this:
xdata ydata -2 0.699369 -1.64 0.700462 -1.33 0.695354 -0.7 1.03905 0 1.97389 0.45 2.41143 1.2 1.91091 1.64 0.919576 2.32 -0.730975 2.9 -1.42001
Further reading: Official write.table documentation in R