Question

I am trying to setup a standard work flow to efficiently import data from the Dutch National Bureau of Statistics (http://statline.cbs.nl) in SPSS syntax into R and /or Python so I can do analyses, load it into our database etc.

The good news is that they have standardized a lot of different output formats, amongst others an .sps syntax file. In essence, this is a space-delimited data file with extra information contained in the header and in the footer. The file looks like shown below. I prefer to use this format than plain .csv because it contains more data and should make it easier to import large amounts of data in a consistent manner.

The bad news is that I can't find a working library in Python and/or R that can deal with .sps SPPS syntax files. Most libraries work with the binary .sav or .por formats.

I am not looking for a full working SPSS clone, but something that will parse the data correctly using the meta-data with the keywords 'DATA LIST' (length of each column, 'VAR LABELS' (the column headers) and 'VALUE LABELS' (extra data should be joined/replaced during the import).

I'm sure a Python/R library could be written to parse and process all this info efficiently, but I am not that fluent/experienced in either language to do it myself.

Any suggestions or hints would be helpful

SET            DECIMAL = DOT.
TITLE          "Gezondheidsmonitor; regio, 2012, bevolking van 19 jaar of ouder".
DATA LIST      RECORDS = 1
 /1            Key0         1 -    5 (A)
               Key1         7 -    7 (A)
               Key2         9 -   14 (A)
               Key3        16 -   23 (A)
               Key4        25 -   28 (A)
               Key5        30 -   33 (A)
               Key6        35 -   38 (A)
               Key7        40 -   43 (A).

BEGIN DATA
80200 1 GM1680 2012JJ00 .    .    .    .   
80200 1 GM0738 2012JJ00 13.2 .    .    21.2
80200 1 GM0358 2012JJ00 .    .    .    .   
80200 1 GM0197 2012JJ00 13.7 .    .    10.8
80200 1 GM0059 2012JJ00 12.4 .    .    16.5
80200 1 GM0482 2012JJ00 13.3 .    .    14.1
80200 1 GM0613 2012JJ00 11.6 .    .    16.2
80200 1 GM0361 2012JJ00 17.0 9.6  17.1 14.9
80200 1 GM0141 2012JJ00 .    .    .    .   
80200 1 GM0034 2012JJ00 14.3 18.7 22.5 18.3
80200 1 GM0484 2012JJ00 9.7  .    .    15.5

(...)

80200 3 GM0642 2012JJ00 15.6 .    .    19.6
80200 3 GM0193 2012JJ00 .    .    .    .   
END DATA.
VAR LABELS
               Key0      "Leeftijd"/
               Key1      "Cijfersoort"/
               Key2      "Regio's"/
               Key3      "Perioden"/
               Key4      "Mantelzorger"/
               Key5      "Zwaar belaste mantelzorgers"/
               Key6      "Uren mantelzorg per week"/
               Key7      "Ernstig overgewicht".

VALUE LABELS
               Key0      "80200"  "65 jaar of ouder"/
               Key1      "1"  "Percentages"
                         "2"  "Ondergrens"
                         "3"  "Bovengrens"/
               Key2      "GM1680"  "Aa en Hunze"
                         "GM0738"  "Aalburg"
                         "GM0358"  "Aalsmeer"
                         "GM0197"  "Aalten"
                         (...)
                         "GM1896"  "Zwartewaterland"
                         "GM0642"  "Zwijndrecht"
                         "GM0193"  "Zwolle"/
               Key3      "2012JJ00"  "2012".

LIST           /CASES TO 10.

SAVE           /OUTFILE "Gezondheidsmonitor__regio,_2012,_bevolking_van_19_jaar_of_ouder.SAV".
Was it helpful?

Solution

Some sample code to get you started - sorry not the best Python programmer here.. so any improvements might be welcome. Steps to add here is a method to load the labels and create a list of dicts for the LABEL VALUES.....

f = open('Bevolking_per_maand__100214211711.sps','r')
#lines = f.readlines()
spss_keys = list()
data = list()
begin_data_step= False
end_data_step = False

for l in f:
    # first look for TITLE
    if l.find('TITLE') <> -1:
        start_pos=l.find('"')+1
        end_pos = l.find('"',start_pos+1)
        title = l[start_pos:end_pos]
        print "title:" ,title

    if l.find('DATA LIST') <> -1:
        data_list = True
        start_pos=l.find('=')+1
        end_pos=len(l)
        num_records= l[start_pos:end_pos].strip()
        print "number of records =", num_records

    if num_records=='1':
        if ((l.find("Key") <> -1) and (not begin_data_step) and (not end_data_step)):
            spss_keys.append([l[15:22].strip(),int(l[23:29].strip()),int(l[32:36].strip()),l[37:].strip()])

    if l.find('END DATA.') <> -1:
        end_data_step=True

    if ((begin_data_step) and (not end_data_step)):
        values = list()
        for key in spss_keys:
            values.append(l[key[1]-1:key[2]])
        data.append(values)
        if l[-1]=="." :
            begin_data=False

    if l.find('BEGIN DATA') <> -1:
        begin_data_step=True

    if end_data_step:
        print ""
        # more to follow


data

OTHER TIPS

From my point of view I would not bother with the SPSS file option, but choose the HTML version and scrape it down. It looks the tables are nicely formatted with classes which would make scraping/parsing the HTML much easier....

Another question to be answered should be: are you going to download the files manually or would you also like to do that automatically?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top