Pergunta

Is there a best practice, algorithm or software (open source with a permissive license required...) which can find information from bodies of text? I'm referring to:

  • find all email addresses in a text
  • find all mentions of cities
  • find all mentions of states
  • find all urls
  • find all mentions of telephone numbers
  • find all mentions of zipcodes ... with the ability to add more ...

I heard RapidMiner should be able to do text mining like this, but AGPL is not an acceptable license for my purpose.

Is there anything 'standard' to do this kind of analysis?

Foi útil?

Solução

Read about Named Entity Recognition. You can try Apache OpenNLP or Apache UIMA, both of which have the, well, Apache license.

Outras dicas

For such entities type you can use rule-based NER tool like gexp.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top