Question

Background: PC-Axis is a file format format used for dissemination of statistical information. The format is used by a number of national statistical organisations to disseminate official statistics.

A PC-Axis file looks a little like this, although they're usually a lot longer:

CHARSET=”ANSI”;
MATRIX="BE001";
SUBJECT-CODE="BE";
SUBJECT-AREA="Population";
TITLE="Population by region, time, marital status and sex.";
Data=
".." ".." ".." ".." ".." 
".." ".." ".." ".." ".." 
".." 24.80 34.20 52.00 23.00 
".." 32.10 40.30 50.70 1.00 
".." 31.60 35.00 49.10 2.30 
41.20 43.00 50.80 60.10 0.00 
50.90 52.00 53.90 65.90 0.00 
28.90 31.80 39.60 51.00 0.00;

More details about PC-Axis files can be found at the Statistics Sweden website, but the basic gist is that the metadata is positioned at the top of the file and after "DATA=" is the actual data itself. It's also worth noting that the data is organized more like a data-table rather than in columns.

The Problem: I'd like to parse a PC-Axis file using Matlab, but I'm a little stumped as to how to go about doing it. Does anyone know how to parse one of these files in Matlab? Would it be easier to parse this type of file using some other language, like Perl, and then import the data into Matlab, or, would Matlab be a suitable enough tool for the job? Note that the plan would be to analyze the data in Matlab after the text processing stage.

I've tried using Matlab's text processing tools such as fgetl, textscan, fscanf, and a few others, but it's terribly tricky. Does anyone have any pointers on how to go about doing it?

Essentially, I'd like to store each of the keywords (CHARSET, MATRIX, etc.) and their corresponding values (ANSI, BE001, etc.) as metadata in Matlab - as a structure, perhaps. I'd like to have the data stored in Matlab also - as a matrix, for example.

Note: I'm aware of the pxR package (CRAN) in R, which works a treat for reading .px files into the workspace as a data.frame object. There's also a Perl module called Data::PcAxis (CPAN) that is also very good, but I'm specifically wanting to know how to parse a .px file using Matlab.

UPDATE: I should have mentioned that in addition to metadata and data, there are also variables. This is best explained by an example. The example PC-Axis file below is the same as the one above except I've added two variables. They're named VALUES("Month") and VALUES("region") and are positioned after the metadata and before the data.

CHARSET=”ANSI”;
MATRIX="BE001";
SUBJECT-CODE="BE";
SUBJECT-AREA="Population";
TITLE="Population by region, time, marital status and sex.";
VALUES("Month")="1976M01","1976M02","1976M03","1976M04",
"1976M05","1976M06","1976M07","1976M08",
"1976M09","1976M10","1976M11","1976M12";
VALUES("region")="Sweden","Germany","France",
"Ireland","Finland";
Data=
".." ".." ".." ".." ".." 
".." ".." ".." ".." ".." 
".." 24.80 34.20 52.00 23.00 
".." 32.10 40.30 50.70 1.00 
".." 31.60 35.00 49.10 2.30 
41.20 43.00 50.80 60.10 0.00 
50.90 52.00 53.90 65.90 0.00 
28.90 31.80 39.60 51.00 0.00;

Textscan works a treat when reading in each line of the text file as a string (in a cell array). However, the elements after the "=" sign for both of the variables (i.e. VALUES("Month") and VALUES("region")) span more than one line. It seems that using textscan in this case means that some strings would have to be concatenated, say, for example, in order to collect the list of months (1976M01 to 1976M12).

Question: What's the best way to collect the variables data? Read the text file as a single string and then use strtok twice to extract the substring of dates? Perhaps, there's a better (more systematic) way?

Was it helpful?

Solution

Usually textscan and regexp is the way to go when parsing string fields (as shown here):

  1. Read the input lines as strings with textscan:

    fid = fopen('input.px', 'r');
    C = textscan(fid, '%s', 'Delimiter', '\n');
    fclose(fid);
    
  2. Parse the header field names and values using regexp. Picking the right regular expression should do the trick!

    X = regexp(C{:}, '^\s*([^=\(\)]+)\s*=\s*"([^"]+)"\s*', 'tokens');
    X = [X{:}];                          %// Flatten the cell array
    X = reshape([X{:}], 2, []);          %// Reshape into name-value pairs
    
  3. The "VALUE" fields may span over multiple lines, so they need to be concatenated first:

    idx_data = find(~cellfun('isempty', regexp(C{:}, '^\s*Data')), 1);
    idx_values = find(~cellfun('isempty', regexp(C{:}, '^\s*VALUES')));
    Y = arrayfun(@(m, n){[C{:}{m:m + n - 1}]}, ...
       idx_values(idx_values < idx_data), diff([idx_values; idx_data]));
    

    ... and then tokenized:

    Y = regexp(Y, '"([^,"]+)"', 'tokens');  %// Tokenize values
    Y = cellfun(@(x){{x{1}{1}, {[x{2:end}]}}}, Y); %// Group values in one array
    Y = reshape([Y{:}], 2, []);             %// Reshape into name-value pairs
    
  4. Make sure the field names are legal (I've decided to convert everything to lowercase and replace apostrophes and any whitespace with underscores), and plug them into a struct:

    X = [X, Y];                             %// Store all fields in one array
    X(1, :) = lower(regexprep(X(1, :), '-+|\s+', '_')); 
    S = struct(X{:});
    

Here's what I get for your input file (only the header fields):

S =
          charset: 'ANSI'
           matrix: 'BE001'
     subject_code: 'BE'
     subject_area: 'Population'
            title: 'Population by region, time, marital status and sex.'
            month: {1x12 cell}
           region: {1x5 cell}

As for the data itself, it needs to be handled separately:

  1. Extract data lines after the "Data" field and replace all ".." values with default values (say, NaN):

    D = strrep(C{:}(idx_data + 1:end), '".."', 'NaN');
    

    Obviously this assumes that there are only numerical data after the "Data" field. However, this can be easily modified if this is not case.

  2. Convert the data to a numerical matrix and add it to the structure:

    D = cellfun(@str2num, D, 'UniformOutput', false);
    S.data = vertcat(D{:})
    

And here's S.data for your input file:

S.data =

        NaN        NaN        NaN        NaN        NaN
        NaN        NaN        NaN        NaN        NaN
        NaN   24.80000   34.20000   52.00000   23.00000
        NaN   32.10000   40.30000   50.70000    1.00000
        NaN   31.60000   35.00000   49.10000    2.30000
   41.20000   43.00000   50.80000   60.10000    0.00000
   50.90000   52.00000   53.90000   65.90000    0.00000

Hope this helps!

OTHER TIPS

I'm not personally familiar with PC-Axis files, but here are my thoughts.

Parse the header first. If the header is of fixed size, you can read in that many lines and parse out the values you want. The regexp method may be useful for this.

The data appear to be both string and numeric. I would change the ".." values to NaN (make an original backup first, of course), and then scan the matrix using textscan. Textscan can be tricky, so make sure the file parses completely. If textscan encounters a line that does not match the format string, it will stop parsing. You can check the position of the file handle (using ftell) to see if it matches the end of the file (you can fseek to the end of the file to find what that value should be). The length of the cell arrays returned by textscan should all be the same. If not, the length will tell you what line they failed on - you can check this line with a text editor to see what violated the format.

You can assign and access fields in Matlab structs using string arguments. For example:

foo.('a') = 1;
foo.a
ans = 
     1

So, the workflow I suggest is to parse the header lines, assigning each attribute/value pair as field/value pairs in struct. Then parse the matrix (after some brief text preprocessing to make sure all the data are numeric).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top