Question

This is a case of me wanting to search for something online but not knowing what it's called.

I have a collection of job descriptions in text files, some only a sentence or two long, most a paragraph or two. I want to write a script that, given a set of rules, will notify me when it finds a job description I would want.

For example, lets say I am looking for a job in PHP programming, but not a full-time position and not a designing position. So my "rule book" could be:

want: PHP
want: web programming
want: telecommuting
do not want: designing
do not want: full-time position

What is a method I could use to sort these files into a "pass" (descriptions that match what I'm looking for) and a "fail" (descriptions are not relevant)? Some ideas I was considering:

  • Count the occurrences of the phrases in the text file that are also in my "rule book", and reject those that contain words that I do not want. This doesn't always work, though, because what if a description says "web designing not required"? Then my algorithm would say "That contains the word designing so it is not relevant" when it really was!
  • When searching the text for phrases that I do and do not want, count phrases within a certain Levenshtein distance as the same phrase. For example, designing and design should be treated the same way, as well as misspellings of words, such as programing.
  • I have a large collection of descriptions that I have looked through manually. Is there a way I could "teach" the program "these are examples of good descriptions, these are examples of bad ones"?

Does anyone know what this "filtering process" is called, and/or have any advice or methods on how I can accomplish this?

Was it helpful?

Solution

You basically have a text classification or document classification problem. This is a specific case of binary classification, which is itself a specific case of supervised learning. It's well studied problem, there are many tools to do it. Basically you give a set of good documents and bad documents to a learning or training process, which finds words that correlate strongly with positive and negative documents and it outputs a function capable of classifying unseen documents as positive or not. Naive Bayes is the simplest learning algorithm for this kind of task, and it will do a decent job. There are fancier algorithms like Logistic Regression and Support Vector Machines which will probably do a somewhat better, but they are more complicated.

To determine which variants words are actually equivalent to each other, you want to do some kind of stemming. The Porter stemmer is a common choice here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top