Extracting Demographic and Contact Information from unstructured text files

https://stackoverflow.com/questions/2946875

05-10-2019
|

Question

I am looking to extract specific items out of a large pool of unstructured documents. These documents could be 1-5 pages of text formatted in various ways by the user, but in most cases would contain at least:

Name
Address (physical)
Email Address
Phone number
website URL

I'm looking for a semantic parser that can attempt to extract these elements from the documents so that I can load that information into a relational database and work with these records as contacts.

Other services I've looked for, while valuable for other purposes, do not address this specific need.

Any thoughts, suggestions or leads?

Solution

Have you found a lead to your question? I found some research articles:

www.cis.upenn.edu/~pereira/papers/crf.pdf

citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.9192&rep=rep1&type=pdf

www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta04extracting.pdf

But no specific examples of code on implementing any of these ideas.

Take a look at this too: stackoverflow.com/questions/953150/general-address-parser-for-freeform-text

(sorry I excluded the http, this system is not allowing me to post more than one url/link)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow