質問

I want to parse a resume to get different titles and content, which includes bullets, paragraphs, urls. I have the resume in .doc/.docx format. Research so far has resulted in

1.building an xml file from the .doc file and then
2. build an xml parser using JDOM.

Is there any other approach or a better way to do this? some algorithm that would help identify structures in resume?

役に立ちましたか?

解決 2

look like you are in right direction. Simple approach is : Once you identify information and moved further, you just need to transverse based on +/- steps with calculated spaces, and identify results.

I am sure you are using NLP methodology which can help you to get data with proximity and then you can remove noise based on your experience.

or simple go and get some already build up. I recomend you RChilli CV Parsing or others like hireability or sovren and discuss your need. I am sure you get some information

thanks -K

他のヒント

Interesting -- I worked in a solution where we used Solr to identify my identities.

Another approach is - you can use Apache Solr / index document into that, and fetch faceted search .

Only challenge is how to build library. This will be much shorter and simpler than Apache POI

Let me know if you need some help ?

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top