Archive for the ‘programming’ Category

November 12th, 2014

I follow a lot of software developers on twitter so I get to see a lot of opinions and comments about git.  Here are a few of my recent favourites

Opinions on git
I find that git is a technology that polarizes people. They either love it or they hate it…often both at the same time.

 

Git Tutorials
I find that I learn something new every time I read a different tutorial

Git tips and tricks
No matter how long you’ve used git, you can always learn a new trick or two.

November 11th, 2014

RLink is Mathematica’s interface to the R language – a feature that has been extremely popular since its debut in Mathematica version 9. It’s a great package but has one or two issues. For example, RLink makes use of a built in version of R which is currently stuck at the rather old version 2.14. Official support for the use of external versions of R and adding third-party libraries varies by operating system and version of Mathematica. Windows support is great — OS X support, not so much.

Expert Mathematica user Szabolcs Horvát has written the definitive guide on how to get RLink up and running with the latest version of R on all three major operating systems, building on earlier work by Leonid Shifrin and members of the Mathematica Stack Exchange community. Thanks to this work, we can now enjoy any version of R we like with Mathematica!

June 16th, 2014

When porting code between MATLAB and Python, it is sometimes useful to produce the exact same set of random numbers for testing purposes.  Both Python and MATLAB currently use the Mersenne Twister generator by default so one assumes this should be easy…and it is…provided you use the generator in Numpy and avoid the seed 0.

Generate some random numbers in MATLAB

Here, we generate the first 5 numbers for 3 different seeds in MATLAB. Our aim is to reproduce these in Python.

>> format long
>> rng(0)
>> rand(1,5)'

ans =

   0.814723686393179
   0.905791937075619
   0.126986816293506
   0.913375856139019
   0.632359246225410

>> rng(1)
>> rand(1,5)'

ans =

   0.417022004702574
   0.720324493442158
   0.000114374817345
   0.302332572631840
   0.146755890817113

>> rng(2)
>> rand(1,5)'

ans =

   0.435994902142004
   0.025926231827891
   0.549662477878709
   0.435322392618277
   0.420367802087489

Python’s default random module

According to the documentation,Python’s random module uses the Mersenne Twister algorithm but the implementation seems to be different from MATLAB’s since the results are different.  Here’s the output from a fresh ipython session:

In [1]: import random

In [2]: random.seed(0)

In [3]: [random.random() for _ in range(5)]
Out[3]: 
[0.8444218515250481,
 0.7579544029403025,
 0.420571580830845,
 0.25891675029296335,
 0.5112747213686085]

In [4]: random.seed(1)

In [5]: [random.random() for _ in range(5)]
Out[5]: 
[0.13436424411240122,
 0.8474337369372327,
 0.763774618976614,
 0.2550690257394217,
 0.49543508709194095]

In [6]: random.seed(2)

In [7]: [random.random() for _ in range(5)]
Out[7]: 
[0.9560342718892494,
 0.9478274870593494,
 0.05655136772680869,
 0.08487199515892163,
 0.8354988781294496]

The Numpy random module

Numpy’s random module, on the other hand, seems to use an identical implementation to MATLAB for seeds other than 0. In the below, notice that for seeds 1 and 2, the results are identical to MATLAB’s. For a seed of zero, they are different.

In [1]: import numpy as np

In [2]: np.set_printoptions(suppress=True)

In [3]: np.set_printoptions(precision=15)

In [4]: np.random.seed(0)

In [5]: np.random.random((5,1))
Out[5]: 
array([[ 0.548813503927325],
       [ 0.715189366372419],
       [ 0.602763376071644],
       [ 0.544883182996897],
       [ 0.423654799338905]])

In [6]: np.random.seed(1)

In [7]: np.random.random((5,1))
Out[7]: 
array([[ 0.417022004702574],
       [ 0.720324493442158],
       [ 0.000114374817345],
       [ 0.30233257263184 ],
       [ 0.146755890817113]])

In [8]: np.random.seed(2)

In [9]: np.random.random((5,1))
Out[9]: 
array([[ 0.435994902142004],
       [ 0.025926231827891],
       [ 0.549662477878709],
       [ 0.435322392618277],
       [ 0.420367802087489]])

Checking a lot more seeds

Although the above interactive experiments look convincing, I wanted to check a few more seeds. All seeds from 0 to 1 million would be a good start so I wrote a MATLAB script that generated 10 random numbers for each seed from 0 to 1 million and saved the results as a .mat file.

A subsequent Python script loads the .mat file and ensures that numpy generates the same set of numbers for each seed. It outputs every seed for which Python and MATLAB differ.

On my mac, I opened a bash prompt and ran the two scripts as follows

matlab -nodisplay -nodesktop -r "generate_matlab_randoms"
python python_randoms.py

The output was

MATLAB file contains 1000001 seeds and 10 samples per seed
Random numbers for seed 0 differ between MATLAB and Numpy

System details

  • Late 2013 Macbook Air
  • MATLAB 2014a
  • Python 2.7.7
  • Numpy 1.8.1
May 16th, 2014

One of my favourite MATLAB books is The MATLAB Guide by Desmond and Nicholas Higham. The first chapter, called ‘A Brief Tutorial’ shows how various mathematical problems can be easily explored with MATLAB; things like continued fractions, random fibonacci sequences, fractals and collatz iterations.

Over at the SIAM blog, Don MacMillen, demonstrates how its now possible, trivial even, to rewrite the entire chapter as an IPython notebook with all MATLAB code replaced with Python.

The notebook is available as a gist and can be viewed statically on nbviewer.

What other examples of successful MATLAB->Python conversions have you found?

MATLAB guide in IPython

May 12th, 2014

The IPython project started as a procrastination task for Fernando Perez during his PhD and is currently one of the most exciting and important pieces of software in computational science today. Last month, Fernando joined us at The University of Manchester after being invited by Nick Higham of the department of Mathematics under the auspices of the EPSRC Network Numerical Algorithms and High Performance Computing

While at Manchester, Fernando gave a couple of talks and we captured one of them using the University of Manchester Lecture Podcasting Service (itself based on a Python project). Check it out below.

Comments Off on Video: Fernando Perez speaks about IPython in Manchester – April 2014
May 8th, 2014

This is a guest article written by friend and colleague, Ian Cottam. For part 1, see http://www.walkingrandomly.com/?p=5435

So why do computer scientists use (i != N) rather than the more common (i < N)?

When I said the former identifies “computer scientists” from others, I meant programmers who have been trained in the use of non-operational formal reasoning about their programs. It’s hard to describe that in a sentence or two, but it is the use of formal logic to construct-by-design and argue the correctness of programs or fragments of programs. It is non-operational because the meaning of a program fragment is derived (only) from the logical meaning of the statements of the programming language. Formal predicate logic is extended by extra rules that say what assignments, while loops, etc., mean in terms of logical proof rules for them.

A simple, and far from complete, example is what role the guard in a while/for loop condition in C takes.

for (i= 0; i != N; ++i) {
/* do stuff with a[i] */
}

without further thought (i.e. I just use the formal rule that says on loop termination the negation of the loop guard holds), I can now write:

for (i= 0; i != N; ++i) {
/* do stuff with a[i] */
}
/* Here: i == N */

which may well be key to reasoning that all N elements of the array have been processed (and no more). (As I said, lots of further formal details omitted.)

Consider the common form:

for (i= 0; i < N; ++i) {
/* do stuff with a[i] */
}

without further thought, I can now (only) assert:

for (i= 0; i < N; ++i) { 
/* do stuff with a[i] */ 
} 
/* Here: i >= N */

That is, I have to examine the loop to conclude the condition I really need in my reasoning: i==N.

Anyway, enough of logic! Let’s get operational again. Some programmers argue that i<N is more “robust” – in some, to me, strange sense – against errors. This belief is a nonsense and yet is widely held.

Let’s make a slip up in our code (for an example where the constant N is 9) in our initialisation of the loop variable i.

for (i= 10; i != N; ++i) {
/* do stuff with a[i] */
}

Clearly the body of my loop is entered, executed many many times and will quite likely crash the program. (In C we can’t really say what will happen as “undefined behaviour” means exactly that, but you get the picture.)

My program fragment breaks as close as possible to where I made the slip, greatly aiding me in finding it and making the fix required.

Now. . .the popular:

for (i= 10; i<N; ++i) {
/* do stuff with a[i]
}

Here, my slip up in starting i at 10 instead of 0 goes (locally) undetected, as the loop body is never executed. Millions of further statements might be executed before something goes wrong with the program. Worse, it may even not crash later but produce an answer you believe to be correct.

I find it fascinating that if you search the web for articles about this the i<N form is often strongly argued for on the grounds that it avoids a crash or undefined behaviour. Perhaps, like much of probability theory, this is one of those bits of programming theory that is far from intuitive.

Giants of programming theory, such as David Gries and Edsger Dijkstra, wrote all this up long ago. The most readable account (to me) came from Gries, building on Dijkstra’s work. I remember paying a lot of money for his book – The Science of Programming – back in 1981. It is now freely available online. See page 181 for his wording of the explanation above. The Science of Programming is an amazing book. It contains both challenging formal logic and also such pragmatic advice as “in languages that use the equal sign for assignment, use asymmetric layout to make such standout. In C we would write

var= expr;

rather than

var = expr; /* as most people again still do */

The visible signal I get from writing var= expr has stopped me from ever making the = for == mistake in C-like languages.

May 6th, 2014

This is a guest article written by friend and colleague, Ian Cottam.

This brief guest piece for Walking Randomly was inspired by reading about some of the Hackday outputs at the recent SSI collaborative workshop CW14 held in Oxford. I wasn’t there, but I gather that some of the outputs from the day examined source code for various properties (perhaps a little tongue-in-cheek in some cases).

So, my also slightly tongue-in-cheek question is “Given a piece of source code written in a language with “while loops”: how do you know if the author is a computer scientist by education/training?”

I’ll use C as my language and note that “for loops” in C are basically syntactic sugar for while loops (allowing one to gather the initialisation, guard and increment parts neatly together). In other languages “for loops” are closer to Fortran’s original iterative “do loop”. Also, I will work with that subset of code fragments that obey traditional structured (one-entry, one-exit) programming constructs. If I didn’t, perhaps one could argue, as famously Dijkstra originally did, that the density of “goto” statements, even when spelt “break” or “continue”, etc., might be a deciding quality factor.

(Purely as an aside, I note that Linux (and related free/open source) contributors seem to use goto fairly freely as an exception case mechanism; and they might well have a justification. The density of gotos in Apple’s SSL code was illustrated recently by the so-called “goto fail” bug. See also Knuth’s famous article on this subject.)

In my own programming, I know from experience that if I use a goto, I find it so much more difficult to reason logically (and non-operationally) about my code that I avoid them. Whenever I have used a programming language without the goto statement, I have never missed it.

Now, finally to the point at hand, suppose one is processing the elements of an array of single dimension and of length N. The C convention is that the index goes from 0 to N-1. Code fragment A below is written by a non computer scientist, whereas B is.

/* Code fragment A */
for (i= 0; i < N; ++i) {

/* do stuff with a[i] */

}
/* Code fragment B */
for (i= 0; i != N; ++i) {

/* do stuff with a[i] */

}

The only difference is the loop’s guard: i<N versus i!=N.

As a computer scientist by training I would always write B; which would you write?

I would – and will in a follow-up – argue that B is better even though I am not saying that code fragment A is incorrect. Also in the follow-up I will acknowledge the computer scientist who first pointed this out – at least to me – some 33 years ago.

April 4th, 2014

A colleague recently sent me the following code snippet in R

> a=c(1,2,3,40)
> b=a[1:10]
> b
[1]  1  2  3 40 NA NA NA NA NA NA

The fact that R didn’t issue a warning upset him since exceeding array bounds, as we did when we created b, is usually a programming error.

I’m less concerned and simply file the above away in an area of my memory entitled ‘Odd things to remember about R’ — I find that most programming languages have things that look odd when you encounter them for the first time. With that said, I am curious as to why the designers of R thought that the above behaviour was a good idea.

Does anyone have any insights here?

March 18th, 2014

Research Software Engineers (RSEs) are the people who develop software in academia: the ones who write code, but not papers. The Software Sustainability Institute (a group of which I am a Fellow) believes that Research Software Engineers lack the recognition and reward they deserve for their contribution to research. A campaign website – with more information – launched last week:

http://www.rse.ac.uk

The campaign has had some early successes and has been generating publicity for the cause, but nothing will change unless the Institute can show that a significant number of Research Software Engineers exist.

Hence this post. If you agree with the issues and objectives on the website, please sign up to the mailing list. If you know of any other Research Software Engineers, please pass this post onto them.

February 28th, 2014

A lot of people don’t seem to know this….and they should. When working with floating point arithmetic, it is not necessarily true that a+(b+c) = (a+b)+c. Here is a demo using MATLAB

>> x=0.1+(0.2+0.3);
>> y=(0.1+0.2)+0.3;
>> % are they equal?
>> x==y

ans =
     0

>> % lets look
>> sprintf('%.17f',x)
ans =
0.59999999999999998

>> sprintf('%.17f',y)
ans =
0.60000000000000009

These results have nothing to do with the fact that I am using MATLAB. Here’s the same thing in Python

>>> x=(0.1+0.2)+0.3
>>> y=0.1+(0.2+0.3)
>>> x==y
False
>>> print('%.17f' %x)
0.60000000000000009
>>> print('%.17f' %y)
0.59999999999999998

If this upsets you, or if you don’t understand why, I suggest you read the following

Does anyone else out there have suggestions for similar resources on this topic?