Question

I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like:

"Manny Ramirez makes his return for the Dodgers today against the Houston Astros",

what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun).

To make this question even worse, what are the things I'm not asking that I should be? Do I need a corpus of existing words to get started? What lexical analysis stuff do I need to know to make this work? I did come across one other question on the topic and I'm digging through those resources now.

Was it helpful?

Solution

Use the NLTK, in particular chapter 7 on Information Extraction.

You say you want to extract meaning, and there are modules for semantic analysis, but I think IE is all you need--and honestly one of the only areas of NLP computers can handle right now.

See sections 7.5 and 7.6 on the subtopics of Named Entity Recognition (to chunk and categorize Manny Ramerez as a person, Dodgers as a sports organization, and Houston Astros as another sports organization, or whatever suits your domain) and Relationship Extraction. There is a NER chunker that you can plugin once you have the NLTK installed. From their examples, extracting a geo-political entity (GPE) and a person:

>>> sent = nltk.corpus.treebank.tagged_sents()[22]
>>> print nltk.ne_chunk(sent) 
(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  ...
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ...)

Note you'll still need to know tokenization and tagging, as discussed in earlier chapters, to get your text in the right format for these IE tasks.

OTHER TIPS

You need to look at the Natural Language Toolkit, which is for exactly this sort of thing.

This section of the manual looks very relevant: Categorizing and Tagging Words - here's an extract:

>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]

Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective.

Natural Language Processing (NLP) is the name for parsing, well, natural language. Many algorithms and heuristics exist, and it's an active field of research. Whatever algorithm you will code, it will need to be trained on a corpus. Just like a human: we learn a language by reading text written by other people (and/or by listening to sentences uttered by other people).

In practical terms, have a look at the Natural Language Toolkit. For a theoretical underpinning of whatever you are going to code, you may want to check out Foundations of Statistical Natural Language Processing by Chris Manning and Hinrich Schütze.

alt text
(source: stanford.edu)

Here is the book I stumbled upon recently: Natural Language Processing with Python

What you want is called NP (noun phrase) chunking, or extraction.

Some links here

As pointed out, this is very problem domain specific stuff. The more you can narrow it down, the more effective it will be. And you're going to have to train your program on your specific domain.

This is a really really complicated topic. Generally, this sort of stuff falls under the rubric of Natural Language Processing, and tends to be tricky at best. The difficulty of this sort of stuff is precisely why there still is no completely automated system for handling customer service and the like.

Generally, the approach to this stuff REALLY depends on precisely what your problem domain is. If you're able to winnow down the problem domain, you can gain some very serious benefits; to use your example, if you're able to determine that your problem domain is baseball, then that gives you a really strong head start. Even then, it's a LOT of work to get anything particularly useful going.

For what it's worth, yes, an existing corpus of words is going to be useful. More importantly, determining the functional complexity expected of the system is going to be critical; do you need to parse simple sentences, or is there a need for parsing complex behavior? Can you constrain the inputs to a relatively simple set?

Regular expressions can help in some scenario. Here is a detailed example: What’s the Most Mentioned Scanner on CNET Forum, which used a regular expression to find all mentioned scanners in CNET forum posts.

In the post, a regular expression as such was used:

(?i)((?:\w+\s\w+\s(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)\s(\w+\s){0,1}(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner))|(?:(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner)\s(\w+\s){1,2}(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)))

in order to match either of the following:

  • two words, then model number (including all-in-one), then “scanner”
  • “scanner”, then one or two words, then model number (including all-in-one)

As a result, the text extracted from the post was like,

  1. discontinued HP C9900A photo scanner
  2. scanning his old x-rays
  3. new Epson V700 scanner
  4. HP ScanJet 4850 scanner
  5. Epson Perfection 3170 scanner

This regular expression solution worked in a way.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top