Archive for the ‘Open Data Science’ Category

December 12th, 2016

I was in Stockholm last week to give an invited talk at the Workshop on Nordic Big Biomedical Data for Action. I was representing the Software Sustainability Institute and delivered the latest version of my talk Is Your Research Software Correct? screen-shot-2016-12-11-at-12-47-14

It was a great event which introduced me to some nice initiatives going on waaaay up north. Initiatives such as Code Refinery who’s aims align well with those of the UK’s software sustainability Institute. Code refinery was introduced by Radovan Bast — Slide deck at


Other talks included the introduction of a scalable, parallel version of BLAST, Big Data Processing for Genomics and Delivering Bioinformatics Software as Virtual Machine images. I also got chance to geek out with some High Performance Computing and Bioinformatics people over interesting Swedish food.

Slides from most of the talks are available at

September 5th, 2016

One of the great things about being a Research Software Engineer is the diversity of work you can get involved with. I specialise in smaller interventions which means that I can be working with physicists on Monday, engineers on Tuesday, geneticists on Wednesday….you get the idea.

Last month, I got to work with some Ecologists along with Anna Krystalli. We undertook the arduous journey from Sheffield down to Exeter to deliver talks and workshops at a post-conference symposium on reproducibility in science, organised by Malika Ihle and Isabel Winney, at the International Symposium on Behavioural Ecology.

I gave my talk, Is your research software correct?, and also delivered a workshop on using projects and version control using R and RStudio in the Code Cafe style. For the full write up of the day, see the excellent blog post by Anna over at the Mozilla Science Lab blog.

Updates : More resources

May 10th, 2016

I learned about entropy as part of my undergraduate Physics education but it turns out that the concept of entropy turns up in many fields including linguistics, themodynamics, information theory, chemistry and artificial intelligence.

As part of Sheffield’s Open Data Science Initiative, computer scientist, Neil Lawrence, has teamed up with linguist, Dagmar Divjak, to organise a cross-faculty discussion meeting on the subject of entropy.

For more details on the day’s events, and to register, see


I wasted a little time producing the above logo for the event using Mathematica.

Here’s the source code:-

(*consider column one pixel at a time. Invert the pixel if a random number is below some threshold*)
flipbit[col_, prob_] := Module[{result, temp, x},
  result = col;
   If[RandomReal[] <=  prob,
    If[result[[x]] == 1, result[[x]] = 0, result[[x]] = 1];
   , {x, 1, Length[col]}

text = "Entropy";
image = Rasterize[Text[Style[text, White, Italic, 190]], 
   Background -> Black];
imageData = ImageData[Binarize[image]];
const = 1/Dimensions[imageData][[2]]*0.42;
(*Apply flipbit to all columns. Increase probability of flipping as you move along the x-axis*)
logo = 
   MapIndexed[flipbit[#1, const*#2[[1]]] &, Transpose[imageData]]];

Finally, I found this quote about entropy that I quite like:

You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.

John von Neumann to Claude Shannon a name for his new uncertainty function. Source: Wikiquotes

December 8th, 2015

Back in October, I wrote about the Open Data Science events we’ve starting running at the University of Sheffield. These evening events, held at Sheffield’s The Hide are attended by researchers, students and the occasional random who have an interest in data science (collectively referred to as Data Hipsters by some).

At their core, these events are just an excuse for researchers from many disciplines to get together and explore common interests in an informal way and they’ve been a great success.

This month, we’ve gone big and not just in the ‘big data’ sense. We have two free data science events:

October 13th, 2015

The Sheffield Open Data Science Initiative

The University of Sheffield Open Data Science Initiative (ODSI) is really starting to take off. So what is it?

From the website, the aims of the ODSI are:

  1. Make new analysis methodologies available as widely and rapidly as possible with as few conditions on their use as possible (see the ML@SITraN group software pages and the local software page).
  2. Educate our commercial, scientific and medical partners in the use of these latest methodologies (see
  3. Act to achieve a balance between data sharing for societal benefit and the right of an individual to own their data. (see our summary of our efforts in public understanding and debate)

My role within this initiative is to work on various aspects of research software throughout the University of Sheffield (and beyond!). I am a fellow of the Software Sustainability Institute and you could sum up everything I try to do with their motto Better Software, Better Research.

Join us on October 20th 2015

We have just started a programme of events which aims to bring together a wide variety of people interested in data, machine learning and research software (my favourite part!). The first such event is at The Data Hide on October 2015 at University of Sheffield.

There will be talk on Research Data Management for Computational Science by @ctjacobs_uk as well as lightning talks: What Kind of AI are we Creating? by @lawrenndMachine Learning for Chemical Simulations by Chris Handley and a demonstration of how great Reveal.js is by me.

This will be followed by food, beer and an opportunity to chat and geek out.

We would be honoured if you would join us.

Would you like to present at a future event?

Contact me to see what we can do together.

April 27th, 2015

Back in December 2014, I learned that I’d be moving from The University of Manchester to The University of Sheffield to do the type of thing I’ve always done which is a combination of research software engineering and research software support.

I’ve been in Sheffield for two months now and am having a blast! There’s so much cool stuff going on here that it makes my head spin a little and the community at Sheffield have welcomed me with open arms. It truly is a wonderful place in which to work.

One of the departments I’ve started working with is The Sheffield Institute for Translation Neuroscience (SITraN). My contributions have been relatively minor so far – A bit of Python coding for a machine learning project called GPy and some code speed-up work in R for Winston Hide and his collaborators. When I hang out in SITraN, I usually sit with the machine learning people and listen in on their conversions about Python, MATLAB, GPUs, C++ and R — it’s essentially Nerdvana for someone like me.

On to the point of this blog post. SITraN now have their own blog called SITraNsmissions where they’ll be discussing various aspects of their work and how it applies the principles of neuroscience to help treat diseases such as Motor Neurone Disease (MND). In the video below, taken from SITraNsmissions first blog post, Professor Pamela Shaw gives an overview of the work that SITraN does.

January 15th, 2015

I recently had the good fortune to be involved in the creation of a European H2020 grant proposal called OpenDreamKit along with an international team from 15 institutions. My own contributions to this proposal were extremely modest and it was my first ever experience of being directly involved in an academic grant proposal. It’s the very first thing I’ve been involved with as part of my new appointment at The University of Sheffield.

Quoting from the proposal:

OpenDreamKit will deliver a flexible toolkit enabling research groups to set up Virtual Research Environments, customised to meet the varied needs of research projects in pure mathematics and applications and supporting the full research life-cycle from exploration, through proof and publication, to archival and sharing of data and code.

One of the many things that’s so great about this proposal is how it was written. Co-ordinated by Nicolas Thiéry, 33 contributors wrote it in LaTeX with version control provided by git and github. The video below, produced using gource,  is a visualation of the github repo over time and shows how we all danced around and with each other. My new manager, Neil Lawrence, who was much more deeply involved than I has good things to say about the process too.

The proposal was submitted yesterday after a lot of hard work and, as Nicholas Thiery commented in one of his emails to the group, is “Open from start to end :-)”

The Sage Facebook page summed up my thoughts about this project perfectly: “See the collaboration behind the *proposal*, and imagine the collaboration in the software!”