## Archive for the ‘R’ Category

A recent trend on Facebook is to create a wordcloud of all of your posts using an external service. I chose not to use it because I tend to use Facebook for personal interactions among close friends and I didn’t want to send all of my data to another external company.

Twitter is a different matter, however! All of the data is open and it’s very easy to write a computer program to generate Twitter world clouds without the need for an external service.

I wrote a simple script in R that generates a wordcloud from the most recent 3200 tweets and outputs the top 200 words (get the code on github). The script removes many of the uninteresting words such as the, of, and that would otherwise dominate the cloud. These stopwords come from the Top100Words list of the R package qdap but I also added a few more such as ‘just’ and ‘me’ that I seem to use a lot.

This is the current wordcloud for my twitter account, walkingrandomly. Click on the image to see a bigger version. My main interests are very clear – Python programming, research software, data and anything that’s new!

Once I had seen my wordcloud, I wondered how things would look for other twitter users who I pay a lot of attention to. This is how it looks for Manchester University’s Nick Higham. Clearly he’s big on SIAM, Manchester, and Matrix Analysis!

I then looked at my manager at Sheffield University, Neil Lawrence. Neil finds data and the city of Sheffield very important and also writes about workshops, science, blog posts and machine learning a lot.

The R code that generated these wordclouds is available on github but it won’t work out of the box. You’ll need to register with twitter for app development (It’s free and fairly straightforward) and get various access keys before you can use the code.

I recently found myself in need of a portable install of the Jupyter notebook which made use of a portable install of R as the compute kernel. When you work in institutions that have locked-down managed Windows desktops, such portable installs can be a life-saver! This is particularly true when you are working with rapidly developing projects such as Jupyter and IRKernel.

It’s not perfect but it works for the fairly modest requirements I had for it. Here are the steps I took to get it working.

**Download and install Portable Python**

I downloaded Portable Python 2.7.6.1 from http://portablepython.com/ and installed into a directory called **Portable Python 2.7.6.1**

**Update IPython and install the extra modules we need**

This version of Portable Python comes with a portable IPython instance but it is too old to support alternative kernels. As such, we need to install a newer version.

Open a **cmd.exe** command prompt and navigate to **Portable Python 2.7.6.1\App\Scripts**.

Enter the command

easy_install ipython.exe

You’ll now find that you can launch the ipython.exe terminal from within this directory:

C:\Users\walkingrandomly\Desktop\Portable Python 2.7.6.1\App\Scripts>ipython Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] Type "copyright", "credits" or "license" for more information. IPython 3.1.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: exit()

If you try to launch the notebook, however, you’ll get error messages. This is because we haven’t taken care of all the dependencies. Let’s do that now. Ensuring you are still in the **Portable Python 2.7.6.1\App\Scripts **folder, execute the following commands.

easy_install pyzmq easy_install jinja2 easy_install tornado easy_install jsonschema

You should now be able to launch the notebook using

ipython notebook

**Install portable R and IRKernel**

- I downloaded Portable R 3.2 from http://sourceforge.net/projects/rportable/files/ and installed into a directory called
**R-Portable** - Move this directory into the Portable Python directory. It needs to go inside
**Portable Python 2.7.6.1\App**(see this discussion to learn how I discovered that this location was the correct one) - Launch the Portable R executable which should be at
**Portable Python 2.7.6.1\App\R-Portable\R-portable.exe**and install the IRKernel packages by doing

install.packages(c("rzmq","repr","IRkernel","IRdisplay"), repos="http://irkernel.github.io/")

**Install additional R packages**

The version of Portable R I used didn’t include various necessary packages. Here’s how I fixed that.

- Launch the Portable R executable which should be at
**Portable Python 2.7.6.1\App\R-Portable\R-portable.exe**and install the following packagesinstall.packages('digest') install.packages('uuid') install.packages('base64enc') install.packages('evaluate') install.packages('jsonlite')

**Install the R kernel file**

Create the directory structure **Portable Python 2.7.6.1\App\share\jupyter\kernels\R_kernel**

Create a file called **kernel.json** that contains the following

{"argv": ["R-Portable/App/R-Portable/bin/i386/R.exe","-e","IRkernel::main()", "--args","{connection_file}"], "display_name":"Portable R" }

This file needs to go in the **R_kernel** directory created earlier. Note that the kernel location specified in kernel.json uses Linux style forward slashes in the path rather than the backslashes that Windows users are used to. I found that this was necessary for the kernel to work –it was ignored by the notebook otherwise.

**Finishing off**

Everything created so far, including R, is in the folder **Portable Python 2.7.6**

I created a folder called **PortableJupyter** and put the **Portable Python 2.7.6** folder inside it. I also created the folder **PortableJupyter\notebooks** to allow me to carry my notebooks around with the software that runs them.

There is a bug in Portable Python 2.7.6.1 relating to scripts like IPython.exe that have been installed using easy_install. In short, they stop working if you move the directory they’re installed in – breaking portability somewhat! (Details here)

The workaround is to launch Ipython by running the script **Portable Python 2.7.6.1\App\Scripts\ipython-script.py**

I didn’t want to bother with that so created a shortcut in my **PortableJupyter** folder called **Launch notebook.** The target of this shortcut was the following line

%windir%\system32\cmd.exe /c "cd notebooks && "%CD%/Portable Python 2.7.6.1/App\python.exe" "%CD%/Portable Python 2.7.6.1\App\Scripts\ipython-script.py" notebook"

This starts the notebook using the default web browser and puts you in the notebooks directory.

**The pay off**

My folder looks like this:

If I click on the Launch Notebook shortcut, I get a Jupyter session with 2 kernel options

I can choose the Portable R kernel and start using R in the notebook!

RLink is Mathematica’s interface to the R language – a feature that has been extremely popular since its debut in Mathematica version 9. It’s a great package but has one or two issues. For example, RLink makes use of a built in version of R which is currently stuck at the rather old version 2.14. Official support for the use of external versions of R and adding third-party libraries varies by operating system and version of Mathematica. Windows support is great — OS X support, not so much.

Expert Mathematica user Szabolcs Horvát has written the definitive guide on how to get RLink up and running with the latest version of R on all three major operating systems, building on earlier work by Leonid Shifrin and members of the Mathematica Stack Exchange community. Thanks to this work, we can now enjoy any version of R we like with Mathematica!

A colleague recently sent me the following code snippet in R

> a=c(1,2,3,40) > b=a[1:10] > b [1] 1 2 3 40 NA NA NA NA NA NA

The fact that R didn’t issue a warning upset him since exceeding array bounds, as we did when we created b, is usually a programming error.

I’m less concerned and simply file the above away in an area of my memory entitled ‘Odd things to remember about R’ — I find that most programming languages have things that look odd when you encounter them for the first time. With that said, I am curious as to why the designers of R thought that the above behaviour was a good idea.

Does anyone have any insights here?

The R code used for this example comes from Barry Rowlingson, so huge thanks to him.

A question I get asked a lot is ‘How can I do nonlinear least squares curve fitting in X?’ where X might be MATLAB, Mathematica or a whole host of alternatives. Since this is such a common query, I thought I’d write up how to do it for a very simple problem in several systems that I’m interested in

This is the R version. For other versions,see the list below

- Simple nonlinear least squares curve fitting in Julia
- Simple nonlinear least squares curve fitting in Maple
- Simple nonlinear least squares curve fitting in Mathematica
- Simple nonlinear least squares curve fitting in MATLAB
- Simple nonlinear least squares curve fitting in Python

**The problem**

xdata = -2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9 ydata = 0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001

and you’d like to fit the function

using nonlinear least squares. You’re starting guesses for the parameters are p1=1 and P2=0.2

For now, we are primarily interested in the following results:

- The fit parameters
- Sum of squared residuals
- Parameter confidence intervals

Future updates of these posts will show how to get other results. Let me know what you are most interested in.

**Solution in R**

# construct the data vectors using c() xdata = c(-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9) ydata = c(0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001) # look at it plot(xdata,ydata) # some starting values p1 = 1 p2 = 0.2 # do the fit fit = nls(ydata ~ p1*cos(p2*xdata) + p2*sin(p1*xdata), start=list(p1=p1,p2=p2)) # summarise summary(fit)

This gives

Formula: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata) Parameters: Estimate Std. Error t value Pr(>|t|) p1 1.881851 0.027430 68.61 2.27e-12 *** p2 0.700230 0.009153 76.51 9.50e-13 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.08202 on 8 degrees of freedom Number of iterations to convergence: 7 Achieved convergence tolerance: 2.189e-06

Draw the fit on the plot by getting the prediction from the fit at 200 x-coordinates across the range of xdata

new = data.frame(xdata = seq(min(xdata),max(xdata),len=200)) lines(new$xdata,predict(fit,newdata=new))

Getting the sum of squared residuals is easy enough:

sum(resid(fit)^2)

Which gives

[1] 0.0538127

Finally, lets get the parameter confidence intervals.

confint(fit)

Which gives

Waiting for profiling to be done... 2.5% 97.5% p1 1.8206081 1.9442365 p2 0.6794193 0.7209843

I was recently working with someone who was running thousands of R jobs on our Condor pool and some of them were failing. As part of the diagnostic process, we needed to determine which machines in the pool were causing the problem. My solution was to add the line

hostname

To the Bash script that eventually called R and the program he was running. This gives output that looks like this

badmachine.ourdomain.ac.uk

His solution was to do exactly the same thing in R. To do this, he added the following line to the beginning of his R script

(Sys.info()["nodename"])

Which produces output that looks like

nodename "badmachine.ourdomain.ac.uk"

Either way, the job was done.

R has a **citation()** command that recommends how to cite the use of R in your publications, information that is also included in R’s Frequently Asked Questions document.

To cite R in publications use: R Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/. A BibTeX entry for LaTeX users is @Manual{, title = {R: A Language and Environment for Statistical Computing}, author = {{R Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2012}, note = {{ISBN} 3-900051-07-0}, url = {http://www.R-project.org/}, } We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation("pkgname")’ for citing R packages

This led me to wonder how often people cite the software they use. For example, if you publish the results of a simulation written in MATLAB do you cite MATLAB in any way? How about if you used Origin or Excel to produce a curve fit, would you cite that? Would you cite your plotting software, numerical libraries or even compiler?

The software sustainability institute (SSI), of which I recently became a fellow, has guidelines on how to cite software.

xkcd is a popular webcomic that sometimes includes hand drawn graphs in a distinctive style. Here’s a typical example

In a recent Mathematica StackExchange question, someone asked how such graphs could be automatically produced in Mathematica and code was quickly whipped up by the community. Since then, various individuals and communities have developed code to do the same thing in a range of languages. Here’s the list of examples I’ve found so far

- xkcd style graphs in Mathematica. There is also a Wolfram blog post on this subject.
- xkcd style graphs in R. A follow up blog post at vistat.
- xkcd style graphs in LaTeX
- xkcd style graphs in Python using matplotlib
- xkcd style graphs in MATLAB. There is now some code on the File Exchange that does this with your own plots.
- xkcd style graphs in javascript using D3
- xkcd style graphs in Euler
- xkcd style graphs in Fortran

Any I’ve missed?

My attention was recently drawn to a Google+ post by JerWei Zhang where he evaluates 2^3^4 in various packages and notes that they don’t always agree. For example, in MATLAB 2010a we have 2^3^4 = 4096 which is equivalent to putting (2^3)^4 whereas Mathematica 8 gives 2^3^4 = 2417851639229258349412352 which is the same as putting 2^(3^4). JerWei’s post gives many more examples including Excel, Python and Google and the result is always one of these two (although to varying degrees of precision).

What surprised me was the fact that they disagreed at all since I thought that the operator precendence rules were an agreed standard across all software packages. In this case I’d always use brackets since _I_ am not sure what the correct interpretation of 2^3^4 should be but I would have taken it for granted that there is a standard somewhere and that all of the big hitters in the numerical world would adhere to it.

Looks like I was wrong!

I was recently given the task of converting a small piece of code written in R, the free open-source programming language heavily used by statisticians, into MATLAB which was an interesting exercise since I had never coded a single line of R in my life! Fortunately for me, the code was rather simple and I didn’t have too much trouble with it but other people may not be so lucky since both MATLAB and R can be rather complicated to say the least. Wouldn’t it be nice if there was a sort of Rosetta Stone that helped you to translate between the two systems?

Happily, it turns out that there is in the form of The MATLAB / R Reference by David Hiebeler which gives both the R and MATLAB commands for hundreds of common (and some not so common) operations.

While flicking through the 47 page document I noted that there are a few MATLAB commands for which David hasn’t found an R equivalent (possibly because there simply isn’t one of course). For example, at number 161 of David’s document he describes the MATLAB command

**yy=spline(x,y,xx) **

which he describes as

*‘Fit cubic spline with “not-a-knot” conditions (the first two piecewise cubics coincide,as do the last two), to points (xi , yi ) whose coordinates are in vectors x and y; evaluate at points whose x coordinates are in vector xx, storing corresponding y’s in yy.’*

At the moment David doesn’t know of an R equivalent so if you are a R master then maybe you could help out with this extremely useful document?