Question

Is there a best practice, algorithm or software (open source with a permissive license required...) which can find information from bodies of text? I'm referring to:

  • find all email addresses in a text
  • find all mentions of cities
  • find all mentions of states
  • find all urls
  • find all mentions of telephone numbers
  • find all mentions of zipcodes ... with the ability to add more ...

I heard RapidMiner should be able to do text mining like this, but AGPL is not an acceptable license for my purpose.

Is there anything 'standard' to do this kind of analysis?

Was it helpful?

Solution

Read about Named Entity Recognition. You can try Apache OpenNLP or Apache UIMA, both of which have the, well, Apache license.

OTHER TIPS

For such entities type you can use rule-based NER tool like gexp.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top