문제

Is there a best practice, algorithm or software (open source with a permissive license required...) which can find information from bodies of text? I'm referring to:

  • find all email addresses in a text
  • find all mentions of cities
  • find all mentions of states
  • find all urls
  • find all mentions of telephone numbers
  • find all mentions of zipcodes ... with the ability to add more ...

I heard RapidMiner should be able to do text mining like this, but AGPL is not an acceptable license for my purpose.

Is there anything 'standard' to do this kind of analysis?

도움이 되었습니까?

해결책

Read about Named Entity Recognition. You can try Apache OpenNLP or Apache UIMA, both of which have the, well, Apache license.

다른 팁

For such entities type you can use rule-based NER tool like gexp.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top