Matlab: How to read in numbers with a comma as decimal separator?
-
05-03-2021 - |
Question
I have a whole lot (hundreds of thousands) of rather large (>0.5MB) files, where data are numerical, but with a comma as decimal separator.
It's impractical for me to use an external tool like sed "s/,/./g"
.
When the separator is a dot, I just use textscan(fid, '%f%f%f')
, but I see no option to change the decimal separator.
How can I read such a file in an efficient manner?
Sample line from a file:
5,040000 18,040000 -0,030000
Note: There is a similar question for R, but I use Matlab.
Solution
With a test script I've found a factor of less than 1.5. My code would look like:
tmco = {'NumHeaderLines', 1 , ...
'NumColumns' , 5 , ...
'ConvString' , '%f' , ...
'InfoLevel' , 0 , ...
'ReadMode' , 'block', ...
'ReplaceChar' , {',.'} } ;
A = txt2mat(filename, tmco{:});
Note the different 'ReplaceChar' value and 'ReadMode' 'block'.
I get the following results for a ~5MB file on my (not too new) machine:
- txt2mat test comma avg. time: 0.63231
- txt2mat test dot avg. time: 0.45715
- textscan test dot avg. time: 0.4787
The full code of my test script:
%% generate sample files
fdot = 'C:\temp\cDot.txt';
fcom = 'C:\temp\cCom.txt';
c = 5; % # columns
r = 100000; % # rows
test = round(1e8*rand(r,c))/1e6;
tdot = sprintf([repmat('%f ', 1,c), '\r\n'], test.'); % '
tdot = ['a header line', char([13,10]), tdot];
tcom = strrep(tdot,'.',',');
% write dot file
fid = fopen(fdot,'w');
fprintf(fid, '%s', tdot);
fclose(fid);
% write comma file
fid = fopen(fcom,'w');
fprintf(fid, '%s', tcom);
fclose(fid);
disp('-----')
%% read back sample files with txt2mat and textscan
% txt2mat-options with comma decimal sep.
tmco = {'NumHeaderLines', 1 , ...
'NumColumns' , 5 , ...
'ConvString' , '%f' , ...
'InfoLevel' , 0 , ...
'ReadMode' , 'block', ...
'ReplaceChar' , {',.'} } ;
% txt2mat-options with dot decimal sep.
tmdo = {'NumHeaderLines', 1 , ...
'NumColumns' , 5 , ...
'ConvString' , '%f' , ...
'InfoLevel' , 0 , ...
'ReadMode' , 'block'} ;
% textscan-options
tsco = {'HeaderLines' , 1 , ...
'CollectOutput' , true } ;
A = txt2mat(fcom, tmco{:});
B = txt2mat(fdot, tmdo{:});
fid = fopen(fdot);
C = textscan(fid, repmat('%f',1,c) , tsco{:} );
fclose(fid);
C = C{1};
disp(['txt2mat test comma (1=Ok): ' num2str(isequal(A,test)) ])
disp(['txt2mat test dot (1=Ok): ' num2str(isequal(B,test)) ])
disp(['textscan test dot (1=Ok): ' num2str(isequal(C,test)) ])
disp('-----')
%% speed test
numTest = 20;
% A) txt2mat with comma
tic
for k = 1:numTest
A = txt2mat(fcom, tmco{:});
clear A
end
ttmc = toc;
disp(['txt2mat test comma avg. time: ' num2str(ttmc/numTest) ])
% B) txt2mat with dot
tic
for k = 1:numTest
B = txt2mat(fdot, tmdo{:});
clear B
end
ttmd = toc;
disp(['txt2mat test dot avg. time: ' num2str(ttmd/numTest) ])
% C) textscan with dot
tic
for k = 1:numTest
fid = fopen(fdot);
C = textscan(fid, repmat('%f',1,c) , tsco{:} );
fclose(fid);
C = C{1};
clear C
end
ttsc = toc;
disp(['textscan test dot avg. time: ' num2str(ttsc/numTest) ])
disp('-----')
OTHER TIPS
You may use txt2mat
.
A = txt2mat('data.txt');
It will handle the data automatically. But you can explicitly say:
A = txt2mat('data.txt','ReplaceChar',',.');
P.S. It may not be efficient, but you can copy the part from the source file if you need it only for your specific data formats.
You may try to speed up txt2mat by also adding the number of header lines, and, if possible, the number of columns as inputs to bypass its file analysis. There shouldn't be a factor of 25 compared to a textscan import with dot-separated decimals then. (You may also contact me using the author page on the mathworks site.) Please let us know if you find a more efficient way to handle comma-separated decimals in matlab.
My solution (assumes commas are only used as decimal place holders and that white space delineates columns):
fid = fopen("FILENAME");
indat = fread(fid, '*char');
fclose(fid);
indat = strrep(indat, ',', '.');
[colA, colB] = strread(indat, '%f %f');
If you should happen to need to remove a single header line, as I did, then this should work:
fid = fopen("FILENAME"); %Open file
indat = fread(fid, '*char'); %Read in the entire file as characters
fclose(fid); %Close file
indat = strrep(indat, ',', '.'); %Replace commas with periods
endheader=strfind(indat,13); %Find first newline
indat=indat(endheader+1:size(indat,2)); %Extract all characters after first new line
[colA, colB] = strread(indat, '%f %f'); %Convert string to numerical data