Natural Scientists: their very big output files – and a tale of diffs

April 7th, 2011 | Categories: Guest posts, programming | Tags:

A couple of weeks ago my friend and colleague, Ian Cottam, wrote a guest post here at Walking Randomly about his work on interfacing Dropbox with the high throughput computing software, Condor.  Ian’s post was very well received and so he’s back; this time writing about a very different kind of project.  Over to Ian.

Natural Scientists: their very big output files – and a tale of diffs by Ian Cottam

I’ve noticed that natural scientists (as opposed to computer scientists) often write, or use, programs that produce masses of output, usually just numbers. It might be the final results file or, often, some intermediate test results.

Am I being a little cynical in thinking that many users glance – ok, carefully check – a few of these thousands (millions?) of numbers and then delete the file? Let’s assume such files are really needed. A little automation is often added to the process by having a baseline file and running a program to show you the differences between it and your latest output (assuming wholesale changes are not the norm).

One of the popular difference programs is the Unix diff command . (I’ve been using it since the 1970s.) It popularised the idea of showing minimal differences between two text files (and, later, binary ones too). There are actually a few algorithms for doing minimal difference computation, and GNU diff, for example, uses a different one from the original Bell Labs version. They all have one thing in common: to achieve minimal difference computation the two files being compared must be read into main memory (aka RAM). Back in the 1970s, and on my then department’s PDP-11/40, this restricted the size of the files considerably; not that we noticed much as everything was 16bit and “small was beautiful”. GNU diff, on modern machines, can cope with very big files, but still chokes if your files, in aggregate, approach the gigabyte range.

(As a bit of an aside: a major contribution of all the GNU tools, over what we had been used to from the Unix pioneers at Bell Labs, was that they removed all arbitrary size restrictions; the most common and frustrating one being the length of text lines.)

Back to the 1970s again: Bell Labs did know that some files were too big to compare with diff , and they also shipped: diffh. The h stands for halfhearted. Diffh does not promise minimal differences, but can be very good for big files with occasional differences. It uses a pass over the files using a simple ‘sliding window’ heuristic approach (the other word that the h is sometimes said to stand for). You can still find an old version of diffh on the Internet. However, it ‘gives up’ rather easily and you may have to spend some time modifying it for, e.g., 64 bit ints and pointers, as well as for compiling with modern C compilers. Other tools exist that can compare enormous files, but they don’t produce readable output in diff’s pleasant format.

A few years back, when a user at the University of Manchester asked for help with the ‘diff – files too big/ out of memory’ problem, I wrote a modern version that I called idiffh (for Ian’s diffh). My ground rules were:

  • Work on any text files on any operating system with a C compiler
  • Have no limits on, e.g., line lengths or file size
  • Never ‘give up’ if the going gets tough (i.e. when the files are very different)

You won’t find this with the GNU collection as they like to write code to the Unix interface and I like to write to the C standard I/O interface (see the first bullet point above).

An interesting implementation detail is that it uses random access to the files directly, relying on the operating system’s cache of file blocks to make this tolerably efficient. Waiting a minute or two to compare gigabyte sized files is usually acceptable.

As the comments in the code say, you can get some improvements by conditional compilation on Unix systems and especially on BSD Systems (including Apple Macs), but you can compile it straight, without such, on Windows and other platforms.

If you would like to try idffh out you can download it here.

  1. April 8th, 2011 at 11:23
    Reply | Quote | #1

    Very interesting indeed.
    Now, as a Natural scientist (well, some people consider Physics an exact science…bwahahaha), I would like to make an observation. The textual difference of a stream of numbers not is usually what gives some hint about the process under study.
    We usually look for spatiotemporal properties of the stream of data and usually we plot(in several visualization possibilities) two ‘different’ streams to observe they differences.
    Unless is absolutely core to the point, I never look at the particular numbers at all. One would do it if there is solid ground for comparison of the particular numeric value, otherwise is the behavior of the stream what we are interested in! An always one can plot the ‘true’ values and the stream we talk about and see if this plot match. That is usually much faster and provides a lot more information about what is wrong, what is good and what is missing in the process that produced the stream.

    Thanks for your post

  2. April 8th, 2011 at 14:11
    Reply | Quote | #2

    diffh.c is not available from the location you point any more. It is however available from here http://dop221.astron.nl/unix/PDP-11/Trees/V7/usr/src/cmd/diffh.c

  3. Ian Cottam
    April 8th, 2011 at 14:54
    Reply | Quote | #3

    @JuanPi
    Good point! Many thanks -Ian

  4. Ian Cottam
    April 8th, 2011 at 15:07
    Reply | Quote | #4

    @adamo
    A better link too, as that is the original Bell Labs version.
    Many thanks -Ian

  5. April 8th, 2011 at 15:16
    Reply | Quote | #5

    @adamo @Ian
    I’ve updated the link in the main text,
    Mike

  6. July 28th, 2011 at 20:53
    Reply | Quote | #6

    I downloaded the code, put it into a git repo, added a makefile, and enabled system getln on Debian. See git://github.com/barak/idiffh