This post is also published over at the Software Sustainability Institute.
William Stein, lead developer of the computer algebra system, Sage, and its cloud-based spin-off, SageMathCloud, recently announced that he was quitting academia to go and form a company. In his talk, William says ‘I can’t figure out how to create Sage in academia. The money isn’t there. The mathematical community doesn’t care enough. The only option left is for me to build a company.’
His talk is below and slides are at http://wstein.org/talks/2016-06-sage-bp/bp.pdf
“Every great open source math library is built on the ashes of someone’s academic career.”
William’s departure is not unique. Here’s a tweet from Wes Mckinney, creator of pandas, one of the essential data science tools for Python.
One of the major reasons I dropped out of my PhD was because I didn’t believe academia could properly value software contributions
— Wes McKinney (@wesmckinn) June 11, 2016
We are looking for similar stories; good research software people who felt that they had to leave academia because there wasn’t enough support, recognition or funding. Equally, we want to hear from you if you think academia is a rewarding environment for software development. Either way, please contact us at email@example.com
The High Performance Computing system at University of Sheffield has several different file systems available to it. We have:-
- /fastdata – A lustre-based, shared filesystem with hundreds of terabytes of space. No backup. No quota.
- /data – An NFS file system where each user has access to 100Gb of storage. Back-ups go back 7 days.
- /home – An NFS file system where each user has 10Gb. Backed up over 28 days. Mirrored.
- /scratch – Local disk on each worker node. No back up. Uses ext4.
Lots of options with differing amounts of space, back-up policy and, as I’m about to demonstrate, performance characteristics. I suspect that many other HPC systems have a similar set up.
On our system, it’s very tempting to do everything in /fastdata. There’s lots of space, no quota, readable from all worker nodes simultaneously — good times! I try to encourage people to think about what they are doing, however. Bad things can happen if the lustre filesystem is hammered too much. Also, there can be a huge difference in performance for some operations across different filesystems.
Let’s take an example. I want to download and untar gcc 4.9.2. How long does that take on the three different filesystems?
On the scratch directory of a worker node
cd \scratch mkdir testing123 cd testing123 wget ftp://ftp.mirrorservice.org/sites/sourceware.org/pub/gcc/releases/gcc-4.9.2/gcc-4.9.2.tar.gz time tar xfz ./gcc-4.9.2.tar.gz real 0m6.237s user 0m5.302s sys 0m3.033s
On the lustre filesystem
cd /fastdata/ mkdir testing123 cd testing123 wget ftp://ftp.mirrorservice.org/sites/sourceware.org/pub/gcc/releases/gcc-4.9.2/gcc-4.9.2.tar.gz time tar xfz ./gcc-4.9.2.tar.gz real 7m18.170s user 0m6.751s sys 0m56.802s
On the NFS filesystem
cd /data/myusername mkdir testing123 cd testing123 wget ftp://ftp.mirrorservice.org/sites/sourceware.org/pub/gcc/releases/gcc-4.9.2/gcc-4.9.2.tar.gz time tar xfz ./gcc-4.9.2.tar.gz real 16m37.343s user 0m6.052s sys 0m23.438s
For this particular operation, there is a two orders of magnitude difference between the worst and the best option.
I’m not an expert in filesystems and I have no idea what’s causing these differences or if I’d see a similar speed difference given a different file operation. I currently have no interest in doing a robust set of benchmarks. The point I’m making is that if you are using a system that has multiple filesystems it may be worth checking if there’s an advantage to using one over the other for your particular use case.
I was recently invited to a Schloss Dagstuhl Workshop on ‘Engineering Academic Software’ by organisers Carole Goble, James Howison, Claude Kirchner and Oscar M. Nierstrasz. One week of geeking-out with research software people from all over the world in lovely surroundings with as much beer and cheese as you can eat — sounds good to me!
I gave a presentation about life on the frontline of Research Software Engineering support or the RSE Accident and Emergency department as I sometimes think of it. I spent some time discussing Sheffield’s new Research Software Engineering group formed by Paul Richmond and me off the back of our EPSRC Research Software Engineering Fellowships. I also discussed a worrying trend I’ve noticed in research software — top people are leaving academia for industry, not because they want to but because of a lack of support! Slides for my talk are at https://mikecroucher.github.io/dagstuhl_RSE_Sheffield/#/.
I love attending seminars like this because I get to learn about all of the wonderful things that the community is up to. Personal highlights included:
Effective computation in physics
Meeting Katy Huff, co-author of my favourite Python book, Effective computation in physics. The only problem with this book is the word ‘physics’ in the title since it suggests that it’s only useful if you are a physicist. Totally not the case! If you are doing science in Python, get this book! Fellow blogger John D Cook, interviewed both authors of the book back in 2015 – see the write-up at http://www.johndcook.com/blog/2015/08/08/effective-computation-in-physics/.
— d(-.-)b (@katerererena) June 23, 2016
Learning about the Software Heritage project that launched very recently. The project harvests and archives projects from various locations — github, Debian and the GNU Project for now. They say that ‘we preserve software, because it contains our technical and scientific knowledge.’ It’s shaping up to be a ‘Library of Alexandria of Software’. The full mission statement is over at https://www.softwareheritage.org/mission/
Software citation and credit
There was a lot of discussion about considering software as a first class scientific output and several projects were mentioned that help the situation. The force11 software citation principles address how software should be cited and depsy.org is ‘an open-source webapp that tracks research software impact‘. Dan Katz’s blog post ‘How should we add citations inside software‘ is also worth a read.
What did we talk about?
Many of the participants are active on Twitter so there was a lot of live tweeting. The twitter hashtag for the workshop was #dagstuhleas. It’s been hijacked by spammers recently but there is a lot of great content there – https://twitter.com/search?q=dagstuhleas&src=typd
I’m not the only attendee to write about this workshop:
Alice Allen of the Astrophysics Source Code Library has written up a day by day account of the workshop in a way that captures what it’s like to attend a Dagstuhl seminar perfectly.
- Engineering Academic Software at Schloss Dagstuhl – Introduction
- Engineering Academic Software, Schloss Dagstuhl Day 1
- Engineering Academic Software, Schloss Dagstuhl Day 2
- Engineering Academic Software, Schloss Dagstuhl Day 3
- Engineering Academic Software, Schloss Dagstuhl Day 4
The Software Sustainability Institute was also present in force. See what they had to say over at http://www.software.ac.uk/blog/2016-07-06-dagstuhl-perspectives-workshop-engineering-academic-software
Slides for all presentations can be found at http://materials.dagstuhl.de/index.php?semnr=16252