質問

I have a dataset of 1911518 values. I have used the textscan function. But the function returns only 1424458 values. I again created a new variable to get the remaining values but this time it gave me around 9000 values. Does anyone know what the reason for this can be?

n=9
N=1911518

file_id=fopen('CRSP.csv');

C=textscan(file_id,'%s',n,'delimiter', ','); %To get the headers
C_text=textscan(file_id,'%s %s %s %d %d %d %d %f %f',N, 'delimiter' , ','); 

%Returns 1424458

C_text1=textscan(file_id,'%s %s %s %d %d %d %d %f %f',N, 'delimiter' , ','); 

%Returns only 9000 values    

fclose(file_id);

Sample Data

DATE,COMNAM,TICKER,PERMNO,PERMCO,SHROUT,VOL,RET,sprtrn

01/02/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 23700,  0.000000,  0.007793
01/03/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 20800,  0.020000,  0.000950
01/04/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 65300, -0.026144, -0.005826
01/05/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 340600, 0.000000, -0.001587
01/08/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 3400,   0.000000,  0.002821
01/09/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 27200, -0.006711, -0.014568
01/10/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 25400, -0.033784, -0.018000
01/11/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 14000,  0.000000,  0.007034
役に立ちましたか?

解決

I would assume, that there is some error in the data, or format pattern does not match the data. Try to extract these lines:

file_id=fopen('CRSP.csv');
for idx=1:1424456
    fgetl(file_id); %dump data
end
for idx=1:10
    fprintf('%s\n',fgetl(file_id));
end

If there is an error, it should be at the 2rd or 3nd printed line. Anything special there? Maybe a COMNAM with some special character?

To read the file, i would use the following code to read line by line:

file_id=fopen('CRSP.csv');
line=fgetl(file_id);
data={};
int ix=1;
while(ischar(line))
    [parsed,sindex,eindex] = regexpi(line,'(\d\d/\d\d/\d\d\d\d)\s*, ([\w ]+), ([\w ]+), ([\d]+), ([\d]+), ([\d]+), ([\d]+), ([\d \.]+), ([\d \.]+)','tokens')
    if ~isempty(sindex)&&numel(sindex)==1&&(sindex==1)&&(eindex==numel(x))
        data{end+1}=parsed{1};
    else
        fprintf('Unable to parse line %d with content: %S',ix,line);
    end
    line=fgetl(file_id);
    ix=ix+1;
end

Short summary of regular expressions:

'(...)' Everything between is a "token" which is returned

'([\d .]+)' Numbers, white space and "."

'([\d .]+)' Numbers and white space

'([\w ]+)' Word, including white space

'(\d\d/\d\d/\d\d\d\d)' date

This expression is a bit "lazy". It not only accepts "0.000" as a number but also "0.0 00." or some other combinations, but it should be enough to detect all errors. If not, the expression has to be improved.

他のヒント

Daniel R's answer is basically correct. To elaborate (I would have posted this as a comment but I lack the reputation), textscan in MATLAB is very finicky and it basically bails out whenever it hits something that does not PRECISELY match the format you specify.

If you have a datafile that is likely to contain some errors or inconsistencies, your main options are to pre-process the file somehow to prune out those errors, or (as Daniel suggests) read the file in yourself line-by-line and parse it however you want. The former is probably about as much work as the latter unless you plan to do it manually and there aren't many errors to fix, so in most cases it might just be easier to write your own parser.

The only other thing you could potentially do -- if the only errors are errors of type (e.g. a column is supposed to be an integer but sometimes a floating-point number slips in), you could still use textscan and replace format specifiers with more generic ones. E.g., in that example, you could replace %d (integer) with %f (floating-point number). Since all integers are floating-point numbers, that should work OK. In the most extreme case, you could read in all columns as strings (%s), but then you'd need to parse them all anyway and you're probably better off just doing that without textscan.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top