Help: Extracting data tuples from text… Regex or Machine learning?

https://stackoverflow.com/questions/6323740

27-10-2019
|

Question

I would really appreciate your thoughts on the best approach to the following problem. I am using a Car Classified listing example which is similar in nature to give an idea.

Problem: Extract a data tuple from the given text.

Here are some characteristics of the data.

The vocabulary (words) in the text is limited to a specific domain. Lets assume 100-200 words at the most.
Text that needs to be parsed is a headline like a Car Ad data shown below. So each record corresponds to one tuple (row).
In some cases some of the attributes may be missing. So for example, in raw data row #5 below the year is missing.
Some words go together (bigrams). Like "Low miles".
Historical data available = 10,000 records
Incoming New Data volume = 1000-1500 records / week

The expected output should be in the form of (Year,Make,Model, feature). So the output should look like

1 -> (2009, Ford, Fusion, SE)
2 -> (1997, Ford, Taurus, Wagon)
3 -> (2000, Mitsubishi, Mirage, DE)
4 -> (2007, Ford, Expedition, EL Limited)
5 -> ( , Honda, Accord, EX)
....
....

Raw Headline Data:

1 -> 2009 Ford Fusion SE - $7000
2 -> 1997 Ford Taurus Wagon - $800 (san jose east)
3 -> '00 Mitsubishi Mirage DE - $2499 (saratoga) pic
4 -> 2007 Ford Expedition EL Limited - $7800 (x)
5 -> Honda Accord ex low miles - $2800 (dublin / pleasanton / livermore) pic
6 -> 2004 HONDA ODASSEY LX 68K MILES - $10800 (danville / san ramon)
7 -> 93 LINCOLN MARK - $2000 (oakland east) pic
8 -> #######2006 LEXUS GS 430 BLACK ON BLACK 114KMI ####### - $19700 (san rafael) pic
9 -> 2004 Audi A4 1.8T FWD - $8900 (Sacramento) pic
10 -> #######2003 GMC C2500 HD EX-CAB 6.0 V8 EFI WHITE 4X4 ####### - $10575 (san rafael) pic
11 -> 1990 Toyota Corolla RUNS GOOD! GAS SAVER! 5SPEED CLEAN! REG 2011 O.B.O - $1600 (hayward / castro valley) pic img
12 -> HONDA ACCORD EX 2000 - $4900 (dublin / pleasanton / livermore) pic
13 -> 2009 Chevy Silverado LT Crew Cab - $23900 (dublin / pleasanton / livermore) pic
14 -> 2010 Acura TSX - V6 - TECH - $29900 (dublin / pleasanton / livermore) pic
15 -> 2003 Nissan Altima - $1830 (SF) pic

Possible choices:

A machine learning Text Classifier (Naive Bayes etc)
Regex

What I am trying to figure out is if RegEx is too complicated for the job and a Text classifier is an overkill?

If the choice is to go with a text classifier then what would you consider to be the easiest to implement.

Thanks in advance for your kind help.

Solution

This is a well studied problem called information extraction. It is not straight forward to do what you want to do, and it is not as simple as you make it sound (ie machine learning is not an overkill). There are several techniques, you should read an overview of the research area.

OTHER TIPS

Check this IE library for writing extraction rule< I think it will work best for you problem. There also example how to create fast dictionary matching.

I think that the ARX or Phoebus systems may suit your needs if you already have annotated data and a list of words associated to each field. Their approach is a mix of information extraction and information integration.

There are a few good entity recognition libraries. Have you taken a look at Apache opennlp?

As a user looking for a specific model of car the task is easier. I'm pretty sure I could classify, say, most Ford Rangers since I know what to look for with regexp.

I think your best bet is to write a function for each car model with type String -> Maybe Tuple. Then run all these on each input and throw away those inputs resulting in zero or too many tuples.

You should use a tool like Amazon Mechanical Turk for this. Human microtasking. Another alternative is to use a data entry freelancer. upWork is a great place to look. You can get excellent quality results and the cost is very reasonable for each.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow