Extracting Demographic and Contact Information from unstructured text files
-
05-10-2019 - |
Question
I am looking to extract specific items out of a large pool of unstructured documents. These documents could be 1-5 pages of text formatted in various ways by the user, but in most cases would contain at least:
- Name
- Address (physical)
- Email Address
- Phone number
- website URL
I'm looking for a semantic parser that can attempt to extract these elements from the documents so that I can load that information into a relational database and work with these records as contacts.
Other services I've looked for, while valuable for other purposes, do not address this specific need.
Any thoughts, suggestions or leads?
Solution
Have you found a lead to your question? I found some research articles:
www.cis.upenn.edu/~pereira/papers/crf.pdf
citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.9192&rep=rep1&type=pdf
www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta04extracting.pdf
But no specific examples of code on implementing any of these ideas.
Take a look at this too: stackoverflow.com/questions/953150/general-address-parser-for-freeform-text
(sorry I excluded the http, this system is not allowing me to post more than one url/link)