Walking Randomly

MATLAB Tutorial – Reading csv files

May 21st, 2010 | Categories: matlab, programming, tutorials | Tags:

Reading comma separated value (csv) files into MATLAB is trivial as long as the csv file you are trying to import is trivial. For example, say you wanted to import the file very_clean.txt which contains the following data

1031,-948,-76
507,635,-1148
-1031,948,750
-507,-635,114

The following, very simple command, is all that you need

>> veryclean = csvread('very_clean.txt')

veryclean =

        1031        -948         -76
         507         635       -1148
       -1031         948         750
        -507        -635         114

In the real world, however, your data is rarely this nice and clean. One of the most common problems faced by MATLABing data importers is that of header lines. Take the file quite_clean.txt for instance. This is identical to the previous example apart from the fact that it contains two header lines

These are some data that I made using my hand-crafted code
Date:12th July 1996
1031,-948,-76
507,635,-1148
-1031,948,750
-507,-635,114

This is all too much for the csvread command

>> data=csvread('quite_clean.txt')
??? Error using ==> dlmread at 145
Mismatch between file and format string.
Trouble reading number from file (row 1, field 1) ==> This

Error in ==> csvread at 52
    m=dlmread(filename, ',', r, c);

Not to worry, we can just use the more capable importdata command instead

>> quiteclean = importdata('quite_clean.txt')

quiteclean = 

        data: [4x3 double]
    textdata: {2x1 cell}

The result above is a two element structure array and our numerical values are contained in a field called data. Here’s how you get at it.

>> quiteclean.data

ans =

        1031        -948         -76
         507         635       -1148
       -1031         948         750
        -507        -635         114

So far so good. How do we handle a file like messy_data.txt though?

header 1;
header 2;
1031,-948,-76, ,"12"
507,635,-1148, ,"34"
-1031,948,750, ,"45"
-507,-635,114, ,"67"

This is the kind of file encountered by Walking Randomly reader ‘reen’ and it contains exactly the same numerical values as the previous two examples. Unfortunately, it also contains some cruft that makes life more difficult for us. Let’s bring out the big-guns!

Using textscan to import csv files in MATLAB

When the going gets tough, the tough use textscan. Here’s the incantation for importing messy_data.txt

fid=fopen('messy_data.txt');
data = textscan(fid,'%f %f %f %*s %*s','HeaderLines',2,'Delimiter',',','CollectOutput',1);
fclose(fid)

The result is a one-element cell array that contains an array of doubles. Let’s get the array of doubles out of the cell

>> data=data{1}
data =
        1031        -948         -76
         507         635       -1148
       -1031         948         750
        -507        -635         114

If the importdata command is a chauffeur then textscan is a Ferrari and I don’t know about you but I’d much rather be driving my own Ferrari than being chauffeured around (John Cook over at The Endeavour has more to say on Ferraris and Chauffeurs).

Let’s de-construct the above set of commands. The first thing to notice is that, unlike csvread and importdata, you have to explicitly open and close your file when using the textscan command. So, you open your file using fopen and give it a file ID (which in this example is fid).

fid=fopen('messy_data.txt');

The first argument to textscan is just this file ID, fid. Next you need to supply a conversion specifier which in this case is

'%f %f %f %*s %*s'

The conversion specifier tells textscan what you want each row in your csv file to be converted to. %f means “64 bit double” and %s means “string” so ‘%f %f %f %s %s’ means “3 doubles followed by 2 strings” (we’ll get onto the asterisks in the original specifier later). You can use all sorts of data types in a conversion specifier such as integers, quoted strings and pattern matched strings among others. Check out the MATLAB documentation for textscan for the full list but an abbreviated list is shown below:

%d signed 32bit integer
%u unsigned 32bit integer
%f 64bit double (you'll want this most of the time when using MATLAB)
%s string

Now, in the command I used to import messy_data.txt the conversion specifier contained some asterisks such as %*s so what do these mean? Quite simply, the asterisk just means ‘ignore’ so %*s means ‘ignore the string in this field’. So, the full meaning of my conversion specifier ‘%f %f %f %*s %*s’ is “read 3 doubles and ignore 2 strings” and textscan will do this for every row.

The rest of the command is pretty self explanatory but I’ll explain it anyway for the sake of completeness

'HeaderLines',2

The file has 2 headerlines which should be ignored

'Delimiter',','

The fields are delimited (a posh word for separated) by a comma

'CollectOutput',1

If you supply a 1 (which stands for True) to the CollectOutput option then textscan will join consecutive output cells with the same data type into a single array. Since I want all of my doubles to be in a single array then this is the behaviour I went for.

Finally, once you have finished textscanning, don’t forget to close your file

fclose(fid)

That’s pretty much it for this mini-tutorial – I hope you find it useful.

Leave a comment | Trackback

biao

May 21st, 2010 at 15:49

Reply | Quote | #1

thanks for your tip, Michael, I haven’t tested, is this file suitable for this case? http://www.mathworks.com/matlabcentral/fileexchange/16075
Paul Brabban

May 23rd, 2010 at 21:17

Reply | Quote | #2

If only more developers stopped to look for some documentation or a specification before they assume they know it all and start coding.

Although there’s no spec, there are well documented rules around the csv format (for example http://en.wikipedia.org/wiki/Comma-separated_values http://tools.ietf.org/html/rfc4180#page-2), the use of quotation to escape special characters being relevant to your examples. I don’t think there are any rules about putting random prologue into a csv file mind you.

Anyway, perhaps they should call the first tool csvishread…

Have a good one!
Genevieve

January 26th, 2011 at 04:21

Reply | Quote | #3

This was really helpful. I was trying just to use the csvread cmd, but not getting the results I wanted. This was a perfect solution!
Andrew

March 19th, 2011 at 10:27

Reply | Quote | #4

Thank you.
Now I use importdata for file.csv:
=====
header;
‘bla1’, ‘bla2’, 12, 13, 14;
‘bla1’, ‘bla2’, 15, 16, 17;
…
=====
>importdata(‘file.csv’)
ans=
data: [739×3 double]
textdata: {740×5 cell}

Ok!
KC

April 21st, 2011 at 23:22

Reply | Quote | #5

Okay this tutorial has been REALLY helpful… but… I think I’ve got one that will challenge the best of you… take a look at this csv file… warning it’s quite large:
http://dl.dropbox.com/u/24099041/CMStudy.csv

It exceeds the max # of columns for excel (256) by 9, but the last several columns aren’t important so you can open it in excel to have a look-see.

Scroll down a ways (or better yet CTRL-downarrow to jump to the next blank cell) and you’ll see there are two groups of data after all the headerlines, called “run#1” and “run#2”.

I have dozens of csv files like this. Some of them have only one “run” some have up to 4 or 5 runs. Runs are NOT the same number of rows – each is different. They all have 265 columns though.

My goal is to have some code that will open a csv file that contains m runs and extract each run into m separate (k x 256) arrays class “double” (where k is different for each m).

I found a function on File Exchange called “readtext.m” that will successfully read the entire csv into a cell array (painfully slowly though). I’m wondering if there’s a better way. I also can’t figure out how to extract each “run” into it’s own standard numerical/double array.

Oh and just to confound it even further, the # of headerlines varies depending on how many runs there are, so data for run#1 starts anywhere from row 21… to 21+m!

Thanks in advance for any help… :D
KC

April 22nd, 2011 at 19:05

Reply | Quote | #6
Okay, using the m-file “readtext.m” from the Matlab FileExchange, I came up with this method to bring my data into numerical arrays:
```
K = readtext(csvfilename);
tf = strncmp(K(:,1),'----',4);
runi = find(tf); %run index numbers
for n = 1:length(runi);
    if n==length(runi);
    eval(['run' num2str(n) '=K(runi(n,1)+1:end,:);']);
    else eval(['run' num2str(n) '=K(runi(n,1)+1:runi(n+1,1)-1,:);']);
    end
    eval(['run' num2str(n) '=cell2mat(run' num2str(n) ');']);
end
```
This works, but the “readtext” m-file is still very slow at importing the csv file to a cell. If anyone knows a more efficient way I’d love to hear it. Thanks!

Mike Croucher

April 23rd, 2011 at 01:51

Reply | Quote | #7

Hi KC

I haven’t tested your original but how does the function below do?
I tested it on the data set you provided as follows

>> data=import_KC('CMStudy.csv');
>> data

data = 
    [2664x265 double]    [2108x265 double]

to get run1 you would do

run1 = data{1}

Hope this helps,
Mike

function data = import_KC(filename)

fid = fopen(filename);

run_num = 1;

%all runs contain n x m numbers and n is different for each run.
%So we cannot use a multidimensional array.
%We use a cell instead.
data = {}; 
while ~feof(fid)
    %Skip the header lines.  We don't know in advance how many there are
    %so cant use the header functionality in textscan.
    current_line = fgetl(fid);
    while ischar(current_line)
        match = strfind(current_line, 'Run #');
        if(match)
            break;
        end 
        current_line = fgetl(fid);      
    end
    
    %Load data.  We don't know dimensions in advance
    %res will initially contain all data from this run in a single column
    res=textscan(fid,'%f','Delimiter',',');
    res=res{1};
    %find number of rows.  Columns are fixed at 265
    rows = length(res)/265;
    %reshape into a matrix
    res = reshape(res,265,rows)';
    %Add this run to the set of runs
    data{run_num} = res;
    run_num=run_num+1;
end

fclose(fid);

end

KC

April 23rd, 2011 at 08:35

Reply | Quote | #8

Wow thanks. That looks pretty sweet. You used several functions I didn’t even know existed. I will say though that I also eventually decided to store the runs in a cell.

I’d like to try out your code, but I’m getting an error:
>> data=import_KC(‘CMStudy.csv’);
??? Undefined function or method ‘import_KC’ for input arguments of type ‘char’.

Can’t seem to figure that out… but it’s really late so… zzzzzzzzzzz
Mike Croucher

April 23rd, 2011 at 10:40

Reply | Quote | #9

You have to save the function I wrote into an .m file called import_KC.m and then make sure that it is in your MATLAB path.
KC

April 25th, 2011 at 17:37

Reply | Quote | #10

Yeah… I had accidentally saved it in the wrong directory! “which” saves the day.

It works! Now I just have to figure out why it works… ;-)

Thanks! I think I will learn a lot from your little script.
nemanja

January 14th, 2012 at 21:20

Reply | Quote | #11

Hello,
I have a problem while using importdata to open a large (100000×17) .csv file. When I use this function as test=importdata(‘training – Copy.csv’, ‘,’, 1); it puts it in a struct test, which is
test =
data: [99999×1 double]
textdata: {100000×17 cell}

the data part is the last column of the original file, while textdata is all the remaining data, but as strings, where the 17th column is consisted of ‘ ‘.

Can anybody tell me why isn’t the file loaded normally as in the examples above, and secondly, if this cannot be done, how can I convert it from string into a matrix?

Thank you!
F_Ahmed

February 26th, 2012 at 07:12

Reply | Quote | #12

Thanks, man!
I wish I had found this two months back…would have saved a lot of my time.
Chris

February 12th, 2013 at 22:48

Reply | Quote | #13

Hi! I would also like to know how to read this type of csv file:

2/9/2013 0:01,
Price: 15,
Quantity: 3,
1,3
3,5
6,7
2/9/2013 0:03,
Price: 18,
Quantity: 9,
8,4
6,2
5,3

I’d like to know if it’s possible can analyze the 2 sets of data separately.

Thank you!
Algirdas

May 8th, 2013 at 16:05

Reply | Quote | #14

hi guys, what you have done where is alright but still my problem is not solved. I have a bit more difficult csv file from IFM 3D sensor. I only need values. Values are actual meters to object, but it has a different formatting. It is not a comma separated values but semicolon separated values and instead of a dot we have a comma, after which are the centimeter part. https://www.dropbox.com/s/oc1jd6i7o7iqceq/hard.csv could someone please help me out with this one. MATLAB does not import this into workspace as values separated by semicolons.
And I need to get everything out in program way not user hand writing and not by find and replace in word :)
Right now i get one cell with all the numbers and a lot of blank cells, and now i am trying to recreate a csv matrix from a string object.
Theo

May 19th, 2013 at 02:56

Reply | Quote | #15

Very often fields are empty, like this
45, 67, , , , 45,12, , ,

how do you deal with those????

Thanks

Theo
S. Aich

January 3rd, 2017 at 20:38

Reply | Quote | #16

Dear Sir/Madam,

I was trying to import a csv data file that contains 2500 rows and 5 columns. But, I am getting back a storing only first row instead of full file. Can anyone help in this regard, kindly?

>> a = importdata(‘F0006CH1.CSV’)

a =

data: [2500 0 -2.5000e-004 0.0200 0]
textdata: {‘Record Length,’}
rowheaders: {‘Record Length,’}