Cannot import all values at once in Matlab using TextScan

Question 1

I would assume, that there is some error in the data, or format pattern does not match the data. Try to extract these lines:

file_id=fopen('CRSP.csv');
for idx=1:1424456
    fgetl(file_id); %dump data
end
for idx=1:10
    fprintf('%s\n',fgetl(file_id));
end

If there is an error, it should be at the 2rd or 3nd printed line. Anything special there? Maybe a COMNAM with some special character?

To read the file, i would use the following code to read line by line:

file_id=fopen('CRSP.csv');
line=fgetl(file_id);
data={};
int ix=1;
while(ischar(line))
    [parsed,sindex,eindex] = regexpi(line,'(\d\d/\d\d/\d\d\d\d)\s*, ([\w ]+), ([\w ]+), ([\d]+), ([\d]+), ([\d]+), ([\d]+), ([\d \.]+), ([\d \.]+)','tokens')
    if ~isempty(sindex)&&numel(sindex)==1&&(sindex==1)&&(eindex==numel(x))
        data{end+1}=parsed{1};
    else
        fprintf('Unable to parse line %d with content: %S',ix,line);
    end
    line=fgetl(file_id);
    ix=ix+1;
end

Short summary of regular expressions:

'(...)' Everything between is a "token" which is returned

'([\d .]+)' Numbers, white space and "."

'([\d .]+)' Numbers and white space

'([\w ]+)' Word, including white space

'(\d\d/\d\d/\d\d\d\d)' date

This expression is a bit "lazy". It not only accepts "0.000" as a number but also "0.0 00." or some other combinations, but it should be enough to detect all errors. If not, the expression has to be improved.

Question 2

Daniel R's answer is basically correct. To elaborate (I would have posted this as a comment but I lack the reputation), textscan in MATLAB is very finicky and it basically bails out whenever it hits something that does not PRECISELY match the format you specify.

If you have a datafile that is likely to contain some errors or inconsistencies, your main options are to pre-process the file somehow to prune out those errors, or (as Daniel suggests) read the file in yourself line-by-line and parse it however you want. The former is probably about as much work as the latter unless you plan to do it manually and there aren't many errors to fix, so in most cases it might just be easier to write your own parser.

The only other thing you could potentially do -- if the only errors are errors of type (e.g. a column is supposed to be an integer but sometimes a floating-point number slips in), you could still use textscan and replace format specifiers with more generic ones. E.g., in that example, you could replace %d (integer) with %f (floating-point number). Since all integers are floating-point numbers, that should work OK. In the most extreme case, you could read in all columns as strings (%s), but then you'd need to parse them all anyway and you're probably better off just doing that without textscan.