## Archive for the ‘programming’ Category

A couple of weeks ago, a small group of us hit on the idea of running research programming tutorials in a cafe. The ‘plan’ was that we’d develop some self-paced programming tutorial material, take over a section of the main campus Cafe (Coffee Revolution) for a couple of hours in the evening and invite some researchers to come and learn something new for free.

For our first session, we chose to do a **very** gentle introduction to R. The students worked through the material, which started right at the beginning with installing R and RStudio, while a group of volunteer facilitators walked the room answering questions, solving problems and forming collaborations.

I can’t stress the importance of the facilitators enough! There is no chance that this format would have worked well without a group of skilled facilitators. On the day, I was joined by

- Research Software Engineer, David Jones
- PhD Student, Claire Green
- Post Doc, Will Furnass

I also had support of a few other people in the development stages. Thanks so much to all!

It was a lot of fun to do and student feedback has been fantastic! My favourite comment came from a medical doctor who said ‘**I had no idea about computer programming and I don’t think I would be brave enough to try it on my own. Yesterday, I realised that R can be something useful and not really hard to learn.’ **

I find interactions like this to be hugely motivational!

There was a real buzz in the room, everyone seemed to learn something useful and I walked away from the evening with a couple of interesting follow-up collaborations in the bag. There were lots of calls for future sessions on topics ranging from more advanced R through to Python, MATLAB, Mathematica and High Performance Computing. The self-paced, flipped-classroom style of teaching was also a great hit!

So, that’s what went right. What about what went wrong?

**Installation Problems**

We deliberately allowed time for the installation of R in the session. Ensuring that the attendees had a working install of R and RStudio on their own kit was part of the point. Before the session, I did trial installs on Windows and Mac and everything went without a hitch. Other members of the team tried fresh installs on Linux.

*“Installation’s going to be a doddle…no worries” I thought.*

The very first attendee who called me over for help couldn’t get RStudio started on her Mac. It crapped out with an error message I’d never seen before. A bit of googling determined that it was because she had several old versions of R already installed and RStudio took exception to this.

We also had Linux users of various flavours and most of them had problems. A user of Arch Linux gave up on trying to install RStudio and used the command line instead. One linux user called me over after he started installing the ggplot2 package asking ‘This has been compiling for **ages**, is that normal?’ Fortunately, we were in a cafe so he could go get himself a brew while waiting.

Some people already had versions of R and RStudio installed from waaaaay back and so didn’t feel it necessary to upgrade to the latest versions. These people discovered that they couldn’t install packages because ‘foo isn’t available for R version whatever’.

It was all rather painful to be honest! We were in full technical-support mode…but at least people left the session with working, up to date versions R and R Studio….mostly!

**Power!**

There wen’t many power sockets. We didn’t think much about this in advance. Ball dropped!

For a feeble attempt at a defence I’ll mention that the battery on my laptop is superb and I spend hours working in the host cafe without worrying about power. Since I’ve been so spoiled, I’ve forgotten how important a mains socket is when your battery sucks.

**Next Steps**

This session was an experiment — something quickly spun up to see if it might work. I’m happy to report that it did!

Our main problem is that we’ve now created demand. Demand for repeats of this session for new audiences, demand for new material and demand for further consultancy. How fortunate for us at Sheffield that we have a newly created Research Software Engineering group to help meet this demand.

Say you have two vectors in R (These are taken from my tutorial Simple nonlinear least squares curve fitting in R)

xdata = c(-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9) ydata = c(0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001)

We put these in a data frame with

data = data.frame(xdata=xdata,ydata=ydata)

This looks like this in R

xdata ydata 1 -2.00 0.699369 2 -1.64 0.700462 3 -1.33 0.695354 4 -0.70 1.039050 5 0.00 1.973890 6 0.45 2.411430 7 1.20 1.910910 8 1.64 0.919576 9 2.32 -0.730975 10 2.90 -1.420010

Exporting to a .csv file is done using the standard R function, **write.csv**

write.csv(data,file='example_data.csv')

The resulting .csv file looks like this:

"","xdata","ydata" "1",-2,0.699369 "2",-1.64,0.700462 "3",-1.33,0.695354 "4",-0.7,1.03905 "5",0,1.97389 "6",0.45,2.41143 "7",1.2,1.91091 "8",1.64,0.919576 "9",2.32,-0.730975 "10",2.9,-1.42001

I don’t want to include the row numbers in my output. To achieve this, we do

write.csv(data,file='example_data.csv',row.names=FALSE)

This gets us a file that looks like this:

"xdata","ydata" -2,0.699369 -1.64,0.700462 -1.33,0.695354 -0.7,1.03905 0,1.97389 0.45,2.41143 1.2,1.91091 1.64,0.919576 2.32,-0.730975 2.9,-1.42001

I can also remove the quotes around xdata and ydata with **quote=FALSE**

write.csv(data,file='example_data.csv',row.names=FALSE,quote=FALSE)

giving the file below

xdata,ydata -2,0.699369 -1.64,0.700462 -1.33,0.695354 -0.7,1.03905 0,1.97389 0.45,2.41143 1.2,1.91091 1.64,0.919576 2.32,-0.730975 2.9,-1.42001

**Changing the separator**

Despite the fact that they are asking R to write a **comma** separated file, some people try to change the separator. Perhaps you’d like to try changing it to a tab for example. The following looks reasonable:

write.csv(data,file='example_data.csv',row.names=FALSE,quote=FALSE,sep="\t")

Although it understands what you are trying to do, R will completely ignore your request!

Warning message: In write.csv(data, file = "example_data.csv", row.names = FALSE, : attempt to set 'sep' ignored

This is because **write.csv** is designed to ensure that some standard .csv conventions are followed. It’s trying to protect you against yourself!

In the UK, the convention for .csv files is to use . for a decimal point and , as a separator and that’s the convention that **write.csv** sticks to. Other countries have a different convention – they use a , for the decimal point and a ; for the separator. The function **write.csv2** takes care of that for you.

If you absolutely must change the separator to something else, make use of **write.table** instead:

write.table(data,file='example_data.csv',row.names=FALSE,quote=FALSE,sep="\t")

Now, the file will come out like this:

xdata ydata -2 0.699369 -1.64 0.700462 -1.33 0.695354 -0.7 1.03905 0 1.97389 0.45 2.41143 1.2 1.91091 1.64 0.919576 2.32 -0.730975 2.9 -1.42001

Further reading: Official write.table documentation in R

I sometimes give a talk on basic research software engineering called **‘Is your research correct?’** (slides here). Near the beginning of this talk I refer to what I’ve modestly named ‘Croucher’s Law’

CROUCHER’S LAW

I CAN BE AN IDIOT AND WILL MAKE MISTAKES.

Croucher’s law has a corollary:

YOU ARE NO DIFFERENT!

The idea is that once you accept this aspect of yourself, you can start to adopt working practices to mitigate against it. In the context of programming, it includes things such as automation, version control, adopting testing and so on.

For me, this isn’t just a law for programming — it’s a law that can be applied to every aspect of life. Unlike my parents, for example, I automate the payment of my bills by using direct debit because I know I’ll eventually forget to pay something otherwise.

The genesis of Croucher’s law demonstrate’s its truth. While sat in a talk given by Jos Martin of The Mathworks, he suddenly stopped and said ‘Mike. We need to talk about Croucher’s law!’ before moving to his next slide which had the title ‘Martin’s Law’. It was very similar to ‘mine’ and it turns out that I had seen his talk years before and had subconsciously ripped him off!

The fact that I had forgotten this demonstrates to me that Croucher’s law is the stronger result :)

**Other relevant posts from WalkingRandomly**

- On failure – I fail all the time…and that’s OK.
- Is your research software correct?

While waiting for the rain to stop before heading home, I started messing around with the heart equation described in an old WalkingRandomly post. Playing code golf with myself, I worked to get the code tweetable. In Python:

from pylab import *

x=r_[-2:2:0.001]

show(plot((sqrt(cos(x))*cos(200*x)+sqrt(abs(x))-0.7)*(4-x*x)**0.01)) pic.twitter.com/gbOTbYSaIG— Mike Croucher (@walkingrandomly) February 8, 2016

In R:

x=seq(-2,2,0.001)

y=Re((sqrt(cos(x))*cos(200*x)+sqrt(abs(x))-0.7)*(4-x*x)^0.01)

plot(x,y)#rstats pic.twitter.com/trpgEnNna4— Mike Croucher (@walkingrandomly) February 8, 2016

I liked the look of the default plot in R so animated it by turning 200 into a parameter that ranged from 1 to 200. The result was this animation:

Finding this animation based on previous tweets oddly mesmerising #rstats pic.twitter.com/e3q6lZqWcP

— Mike Croucher (@walkingrandomly) February 8, 2016

The code for the above isn’t quite tweetable:

options(warn=-1) for(num in seq(1,200,1)) { filename = paste("rplot" ,sprintf("%03d", num),'.jpg',sep='') jpeg(filename) x=seq(-2,2,0.001) y=Re((sqrt(cos(x))*cos(num*x)+sqrt(abs(x))-0.7)*(4-x*x)^0.01) plot(x,y,axes=FALSE,ann=FALSE) dev.off() }

This produces a lot of .jpg files which I turned into the animated gif with ImageMagick:

convert -delay 12 -layers OptimizeTransparency -colors 8 -loop 0 *.jpg animated.gif

John D Cook published a great article on automation recently. He discusses the commonly-held idea that the primary reason to automate things is to save time. As anyone who’s actually gone through this process will tell you, this strategy can often backfire and John points to a comic from the ever-wonderful xkcd that illustrates this perfectly.

John suggests that another reason to automate is to save mental energy rather than time and I completely agree! This is a great reason to automate. When you are under pressure to complete a task that has to be done right first time, being able to simply push the big red button and KNOW that it will work is worth a great deal.

**Automation as knowledge storage and transfer**

Another use of automation is as a way to store and transfer the knowledge of how to get things done.

I work with a huge array of technologies, spending a large part of my working day poring through manuals, documentation, textbooks and google searches figuring out how to do some task, foo. By the end of the project, I’ll be an expert at doing foo but I know that this expertise won’t last. I’ll soon be moving onto the next project, the next set of technologies and my hard-won knowledge will leak from my brain-cache as quickly as it was filled.

I often find that the fastest way to distill my knowledge of how to do something is to write a script that automates it. It’s often more concise and quicker to write than documentation and is usually useful to me and possibly others. It also serves as a great launching point for relearning the material if ever I revisit this particular set of technologies and tasks.

**Automate to improve your processes**

Having an automated script also allows others to easily reproduce what I have done. You want what I have? Run this thing and it’s yours. A favour from me to you!

Initially, this looks and feels like an act of pure altruism. I put in a large amount of hard work and someone else benefits. In my experience, however, payback always comes my way when those who use my work give me feedback on how to do it better.

Way back in 2008, I wrote a few blog posts about using mathematical software to generate christmas cards:

- Mathematical Christmas Cards – Walking Randomly Christmas Challenge
- A MATLAB Christmas card
- Christmas geetings – SAGE style

I’ve started moving the code from these to a github repository. If you’ve never contributed to an open source project before and want some practice using git or github, feel free to write some code for a christmas message along similar lines and submit a Pull Request.

A recent trend on Facebook is to create a wordcloud of all of your posts using an external service. I chose not to use it because I tend to use Facebook for personal interactions among close friends and I didn’t want to send all of my data to another external company.

Twitter is a different matter, however! All of the data is open and it’s very easy to write a computer program to generate Twitter world clouds without the need for an external service.

I wrote a simple script in R that generates a wordcloud from the most recent 3200 tweets and outputs the top 200 words (get the code on github). The script removes many of the uninteresting words such as the, of, and that would otherwise dominate the cloud. These stopwords come from the Top100Words list of the R package qdap but I also added a few more such as ‘just’ and ‘me’ that I seem to use a lot.

This is the current wordcloud for my twitter account, walkingrandomly. Click on the image to see a bigger version. My main interests are very clear – Python programming, research software, data and anything that’s new!

Once I had seen my wordcloud, I wondered how things would look for other twitter users who I pay a lot of attention to. This is how it looks for Manchester University’s Nick Higham. Clearly he’s big on SIAM, Manchester, and Matrix Analysis!

I then looked at my manager at Sheffield University, Neil Lawrence. Neil finds data and the city of Sheffield very important and also writes about workshops, science, blog posts and machine learning a lot.

The R code that generated these wordclouds is available on github but it won’t work out of the box. You’ll need to register with twitter for app development (It’s free and fairly straightforward) and get various access keys before you can use the code.

It is possible to write quick, interactive demonstrations in a variety of languages these days. Functions such as Mathematica’s Manipulate, Sage Math’s interact and IPython’s interact allow programmers to write functional graphical user interfaces with just a few lines of code.

Earlier this week, I hosted a session in the Faculty of Engineering at The University of Sheffield where Maplesoft showed us, among other things, their version of this technology. This blog post is an extension of my notes from this part of the session.

- The Maple Worksheet for this blog post is available on github.

The series command expands a function as a power series around a point. For example, let’s expand sin(x) as a power series around the point x=0.

series(sin(x), x = 0, 10)

If we try to plot this, we get an error message

plot(series(sin(x), x = 0, 10), x = -2*Pi .. 2*Pi, y = -3 .. 3) Warning, unable to evaluate the function to numeric values in the region; see the plotting command's help page to ensure the calling sequence is correct

This is because the output of the series command is a series data structure — something that the plot function cannot handle. We can, however, convert this to a polynomial which is something that the plot function can handle

convert(series(sin(x), x = 0, 10), polynom)

Wrapping the above with plot gives:

plot(convert(series(sin(x), x = 0, 10), polynom), x = -2*Pi .. 2*Pi, y = -3 .. 3);

Let’s see how close this is to the sin(x) curve by plotting them both together

plot([sin(x), convert(series(sin(x), x = 0, 10), polynom)], x = -2*Pi .. 2*Pi, y = -3 .. 3);

It would be nice if we could see how the approximation varies as we vary the number of terms in the expansion. Change the value 10 to a parameter a, pass the whole thing to the Explore function and we get an interactive widget.

Explore(plot([sin(x), convert(series(sin(x), x = 0, a), polynom)], x = -2*Pi .. 2*Pi, y = -3 .. 3), parameters = [a = 2 .. 20]);

Here’s a screenshot of it:

**Adding extra parameters**

It would also be nice to vary the point we expand around. Change the value 0 to b and add an extra parameter to Explore to get two sliders instead of one:

Explore(plot([sin(x), convert(series(sin(x), x = b, a), polynom)], x = -2*Pi .. 2*Pi, y = -3 .. 3), parameters = [a = 2 .. 20, b = -2*Pi .. 2*Pi]);

To see what this looks like, open the companion worksheet in Maple.

**Adding labels to the sliders**

We can change the labels on the sliders as follows

Explore(plot([sin(x), convert(series(sin(x), x = b, a), polynom)], x = -2*Pi .. 2*Pi, y = -3 .. 3), parameters = [[a = 2 .. 20, label = `Number Of Terms`], [b = -2*Pi .. 2*Pi, label = `Expansion location`]]);

To see what this looks like, open the companion worksheet in Maple.

**Adding initial values**

Finally, let’s set some starting values for each slider

Explore(plot([sin(x), convert(series(sin(x), x = b, a), polynom)], x = -2*Pi .. 2*Pi, y = -3 .. 3), parameters = [[a = 2 .. 20, label = `Number Of Terms`], [b = -2*Pi .. 2*Pi, label = `Expansion location`]], initialvalues = [a = 2, b = 1]);

The resulting interactive widget looks like this:

Not bad for one line of code!

**Upload to the Maple Cloud**

At The University of Sheffield, we are lucky because all of our staff and students have access to Maple on both university-owned and personally-owned equipment. If your audience isn’t as fortunate, they can access the resulting worksheet on the Maple Cloud.

The ever-superb John D. Cook recently found this lovely looking curve in a book he’s currently reading

John posted some Python code that reproduced this curve. I ~~stole~~ borrowed his code, put it in a Jupyter notebook and wrapped it in an interactive widget to allow me to play with the parameters and see what other curves I could come up with. The result looks like this.

If you’d like something where those sliders work, you need to run the notebook I’ve created in Project Jupyter. Here are 2 ways to do that.

- Download the notebook from github
- Method 1: Upload this notebook to Try Jupyter.
- Method 2: Install Anaconda Python on your machine. Launch the notebook and open the file downloaded above.

Once you have the notebook open, click on **Cell**->**Run All **and play with the sliders that pop up.

Other posts about these curves:

- Random cyclic curves – Includes code written in Haskell
- A geogebra applet

**Update: 2nd July 2015 **The code in github has moved on a little since this post was written so I changed the link in the text below to the exact commit that produced the results discussed here.

Imagine that you are a very new MATLAB programmer and you have to create an N x N matrix called A where A(i,j) = i+j

Your first attempt at a solution might look like this

N=2000 % Generate a N-by-N matrix where A(i,j) = i + j; for ii = 1:N for jj = 1:N A(ii,jj) = ii + jj; end end

On my current machine (Macbook Pro bought in early 2015), the above loop takes **2.03 seconds.** You might think that this is a long time for something so simple and complain that MATLAB is slow. The person you complain to points out that you should preallocate your array before assigning to it.

N=2000 % Generate a N-by-N matrix where A(i,j) = i + j; A=zeros(N,N); for ii = 1:N for jj = 1:N A(ii,jj) = ii + jj; end end

This now takes **0.049 seconds** on my machine – a speed up of over 41 times! MATLAB suddenly doesn’t seem so slow after all.

Word gets around about your problem, however, and seasoned MATLAB-ers see that nested loop, make a funny face twitch and start muttering ‘vectorise, vectorise, vectorise’. Emails soon pile in with vectorised solutions

% Method 1: MESHGRID. [X, Y] = meshgrid(1:N, 1:N); A = X + Y;

This takes **0.025 seconds** on my machine — a healthy speed-up on the loop-with-preallocation solution. You have to understand the meshgrid command, however, in order to understand what’s going on here. It’s still clear (to me at least) what its doing but not as clear as the nice,obvious double loop. Call me old fashioned but I like loops…I understand them.

% Method 2: Matrix multiplication. A = (1:N).' * ones(1, N) + ones(N, 1) * (1:N);

This one is MUCH harder to read but you don’t worry about it too much because at **0.032 seconds** it’s slower than meshgrid.

% Method 3: REPMAT. A = repmat(1:N, N, 1) + repmat((1:N).', 1, N);

This one appears to be interesting! **At 0.009 seconds**, it’s the fastest so far – by a healthy amount!

% Method 4: CUMSUM. A = cumsum(ones(N)) + cumsum(ones(N), 2);

Coming in at **0.052 seconds**, this cumsum solution is slower than the preallocated loop.

% Method 5: BSXFUN. A = bsxfun(@plus, 1:N, (1:N).');

Ahhh, bsxfun or ‘The Widow-maker function’ as I sometimes refer to it. Responsible for some of the fastest and most unreadable vectorised MATLAB code I’ve ever written. In this case, it brings execution time down to **0.0045 seconds**.

Whenever I see something that can be vectorised with a repmat, I try to figure out if I can rewrite it as a bsxfun. The result is usually horrible to look at so I tend to keep the original loop commented out above it as an explanation! This particular example isn’t too bad but bsxfun can quickly get hairy.

**Conclusion**

Loops in MATLAB aren’t anywhere near as bad as they used to be thanks to advances in JIT compilation but it can often pay, speed-wise, to switch to vectorisation. The price you often pay for this speed-up is that vectorised code can become very difficult to read.

If you’d like the code I ran to get the timings above, it’s on github (link refers to the exact commit used for this post) . Here’s the output from the run I referred to in this post.

Original loop time is 2.025441 Preallocate and loop is 0.048643 Meshgrid time is 0.025277 Matmul time is 0.032069 Repmat time is 0.009030 Cumsum time is 0.051966 bsxfun time is 0.004527

- MATLAB Version: 2015a
- Early 2015 Macbook Pro with 16Gb RAM
- CPU: 2.8Ghz quad core i7 Haswell

This post is based on a demonstration given by Mathwork’s Ken Deeley during a recent session at The University of Sheffield.