Question

I would like to collect unstructured data from a rather big table (about 350 000 observations). What strategy would you recommend?

Let's say that I have the following database

|ID |                           Description                          | 
|12 | Mr A is thirty-five years old and works as an accountant in ...|
|34 | Mr B, 24 and has set up a retail business since 2004.          |
|55 | Mr C aged 58, lives in town A and has a hardware shop ...      |

...

And I would like to get the age the town and the profession in each observation. (If the data is available).

I started using SAS with use Perl type regular expression. I had spend a lot of time building regular expressions and to capture the data but is worked rather well. I'm aware that regular expression might not be the best strategy but I would like to capture most the the data automatically when the number of observations grows.

Was it helpful?

Solution

I see two issues here at once. One: extract structured data. Two: present graphically. I would start with One.

I don't think the following would be an exact solution, and will not win any algorithm award, and, for 350.000 rows, may spend a few nights running. But if you would like to try this path, this may give you a few hints. (But as some mentioned, this may be a very bumpy path, or even a dead end)

Add a few columns to the table, use (class) DBI to iterate over the rows, adding separate functions to try to guess each parameter.

see e.g. PerlMonks for some efficient db updates.

#meta code alert
my $dbh= DBI->new('connect to a database');
my $sth = $dbh->prepare("SELECT ID, THETEXT FROM ATABLE");
$sth->execute();
while (my $row = $sth->fetchrow_hashref) {
    my $age = guess_age($row->{TEXT});
    if ($age > 0) {
        ...#update database
    }
}
#end meta

sub guess_age{
     my $text = shift,
     my $age;
     #look for text, any sequence of number words or - or <whitespace>\s 
     if ($text =~ /((?:one|two|three|...ninety|-|\s)+)/  ) {
        $age = some_number_from_text_function($1)
     #see if we have some prefix words in front of a number 
     } elsif ($text =~ /(?:age|aged)\s*(\d+)/ ) {
        $age = $1;
     #see if we have some postfix words after a number  
     } elsif ($text =~ /(\d+)\s*(?:old|of age|years)/ ) {
        $age = $1;
     #see if we have a comma early in the sentence, 
     } elsif ($text =~ /,\s*(\d+)/ ) {
            #this 'if' should been part of main elsif, as it may stop here:-(
        if ($-[0] <50) {#found before pos 50 in the text
            $age = $1;
        }
    } elsif (... ) {
    } else {
        $age = -1; #flag : not found?
     }
     return $age;
}

But again, this may be a dead end...

For Town, I guess any unexpected capitalization may be something to look for /[a-z]\W([A-Z]\w+)/#ie a non-cap letter followed by a non-letter, followed by a capital + any letters. For profession I am really out of clues. Maybe do a word match against a big hash with many professions??

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top