In defense of inefficient scientific code
One part of my job that I really enjoy is the optimisation of researcher’s code. Typically, the code comes to me in a language such as MATLAB or Mathematica and may take anywhere from a couple of hours to several weeks to run. I’ve had some nice successes recently in areas as diverse as finance, computer science, applied math and chemical engineering among others. The size of the speed-up can vary from 10% right up to 5000% (yes, 50 times faster!) and that’s before I break out the big guns such as Manchester’s Condor pool or turn the code over to our HPC specialists for some SERIOUS (yet more time consuming in terms of developer time) optimisations.
Reporting these speed-ups to colleagues (along with the techniques I used) gets various responses such as ‘Well, they shouldn’t do time-consuming computing using high level languages. They should rewrite the whole thing in Fortran’ or words to that effect. I disagree!
In my opinion, high level programming languages such as Mathematica, MATLAB and Python have democratised scientific programming. Now, almost anyone who can think logically can turn their scientific ideas into working code. I’ve seen people who have had no formal programming training at all whip up models, get results and move on with their research. Let’s be clear here – It’s results that matter not how you coded them.
It comes down to this. CPU time is cheap. Very cheap. Human time, particularly specialised human time, is expensive.
Here’s an example: Earlier this year I was working with a biologist who had put together some MATLAB code to analyse her data. She had written the code in less than a day and it gave the correct results but it ran too slowly for her tastes. Her sole programming experience came from reading the MATLAB manual and yet she could cook up useful code in next to no time. Sure, it was slow and (to my eyes) badly written but give the gal a break…she’s a professional biologist and not a professional programmer. Her programming is a lot better than my biology!
In less than two hours I gave her a crash course in MATLAB code optimisation; how to use the profiler, vectorisation and so on. We identified the hotspot in the code and, between us, recoded it so that it was an order of magnitude faster. This was more than fast enough for her needs, she could now analyse data significantly faster than she could collect it. I realised that I could make it even faster by using parallelised mex functions but it would probably take a few more hours work. She declined my offer…the code was fast enough.
In my opinion, this is an optimal use of resources. I spend my days obsessing about mathematical software and she spends her days obsessing about experimental biology. She doesn’t need a formal course in how to write uber-efficient code because her code runs as fast as she needs it to (with a little help from her friends). The solution we eventually reached might not be the most CPU-efficient one but it is a good trade off between CPU-efficient and developer-efficient.
It was easy…trivial even..for someone like me to take her inefficient code and turn it into something that was efficient enough. However, the whole endeavour relied on her producing working code in the first place. Say high-level languages such as MATLAB didn’t exist….then her only options would be to hire a professional programmer (cash expensive) or spend a load of time learning how to code in a low level language such as Fortran or C (time expensive).
Also, because she is a beginner programmer, her C or Fortran code would almost certainly be crappy and one thing I am sure of is ‘Crappy MATLAB/Python/Mathematica/R code is a heck of a lot easier to debug and optimise than crappy C code.’ Segfault anyone?