Question

I am looking to extract specific items out of a large pool of unstructured documents. These documents could be 1-5 pages of text formatted in various ways by the user, but in most cases would contain at least:

  • Name
  • Address (physical)
  • Email Address
  • Phone number
  • website URL

I'm looking for a semantic parser that can attempt to extract these elements from the documents so that I can load that information into a relational database and work with these records as contacts.

Other services I've looked for, while valuable for other purposes, do not address this specific need.

Any thoughts, suggestions or leads?

Was it helpful?

Solution

Have you found a lead to your question? I found some research articles:

www.cis.upenn.edu/~pereira/papers/crf.pdf

citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.9192&rep=rep1&type=pdf

www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta04extracting.pdf

But no specific examples of code on implementing any of these ideas.

Take a look at this too: stackoverflow.com/questions/953150/general-address-parser-for-freeform-text

(sorry I excluded the http, this system is not allowing me to post more than one url/link)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top