Parse a PC-Axis (.px) file in Matlab

Question 1

Usually textscan and regexp is the way to go when parsing string fields (as shown here):

Read the input lines as strings with textscan:

fid = fopen('input.px', 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);

Parse the header field names and values using regexp. Picking the right regular expression should do the trick!

X = regexp(C{:}, '^\s*([^=\(\)]+)\s*=\s*"([^"]+)"\s*', 'tokens');
X = [X{:}];                          %// Flatten the cell array
X = reshape([X{:}], 2, []);          %// Reshape into name-value pairs

The "VALUE" fields may span over multiple lines, so they need to be concatenated first:

idx_data = find(~cellfun('isempty', regexp(C{:}, '^\s*Data')), 1);
idx_values = find(~cellfun('isempty', regexp(C{:}, '^\s*VALUES')));
Y = arrayfun(@(m, n){[C{:}{m:m + n - 1}]}, ...
   idx_values(idx_values < idx_data), diff([idx_values; idx_data]));

... and then tokenized:

Y = regexp(Y, '"([^,"]+)"', 'tokens');  %// Tokenize values
Y = cellfun(@(x){{x{1}{1}, {[x{2:end}]}}}, Y); %// Group values in one array
Y = reshape([Y{:}], 2, []);             %// Reshape into name-value pairs

Make sure the field names are legal (I've decided to convert everything to lowercase and replace apostrophes and any whitespace with underscores), and plug them into a struct:
```
X = [X, Y];                             %// Store all fields in one array
X(1, :) = lower(regexprep(X(1, :), '-+|\s+', '_')); 
S = struct(X{:});
```

Here's what I get for your input file (only the header fields):

S =
          charset: 'ANSI'
           matrix: 'BE001'
     subject_code: 'BE'
     subject_area: 'Population'
            title: 'Population by region, time, marital status and sex.'
            month: {1x12 cell}
           region: {1x5 cell}

As for the data itself, it needs to be handled separately:

Extract data lines after the "Data" field and replace all ".." values with default values (say, NaN):
```
D = strrep(C{:}(idx_data + 1:end), '".."', 'NaN');
```
Obviously this assumes that there are only numerical data after the "Data" field. However, this can be easily modified if this is not case.

Convert the data to a numerical matrix and add it to the structure:

D = cellfun(@str2num, D, 'UniformOutput', false);
S.data = vertcat(D{:})

And here's S.data for your input file:

S.data =

        NaN        NaN        NaN        NaN        NaN
        NaN        NaN        NaN        NaN        NaN
        NaN   24.80000   34.20000   52.00000   23.00000
        NaN   32.10000   40.30000   50.70000    1.00000
        NaN   31.60000   35.00000   49.10000    2.30000
   41.20000   43.00000   50.80000   60.10000    0.00000
   50.90000   52.00000   53.90000   65.90000    0.00000

Hope this helps!

Question 2

I'm not personally familiar with PC-Axis files, but here are my thoughts.

Parse the header first. If the header is of fixed size, you can read in that many lines and parse out the values you want. The regexp method may be useful for this.

The data appear to be both string and numeric. I would change the ".." values to NaN (make an original backup first, of course), and then scan the matrix using textscan. Textscan can be tricky, so make sure the file parses completely. If textscan encounters a line that does not match the format string, it will stop parsing. You can check the position of the file handle (using ftell) to see if it matches the end of the file (you can fseek to the end of the file to find what that value should be). The length of the cell arrays returned by textscan should all be the same. If not, the length will tell you what line they failed on - you can check this line with a text editor to see what violated the format.

You can assign and access fields in Matlab structs using string arguments. For example:

foo.('a') = 1;
foo.a
ans = 
     1

So, the workflow I suggest is to parse the header lines, assigning each attribute/value pair as field/value pairs in struct. Then parse the matrix (after some brief text preprocessing to make sure all the data are numeric).