Extract structured data from plain text [closed]
-
14-04-2021 - |
Question
On input I have a plain text (in my case typically it will be HTML) and a "grammar specification" (some way for extracting data from plain text to structured data), then on output I need to have some structured data (JSON is fine but maybe there exists something better?)
Are there any libraries for this task? What are good approaches to specify "grammar spec"? What are the best approaches for solving such problem?
Solution
Some tools for grammar based transformations:
- TXL http://www.txl.ca/
- Stratego/XT http://strategoxt.org/
- ASF+SDF http://www.meta-environment.org/
Addition:
- FPP (http://jffp.sourceforge.net/) is a flat file parsing library in Java that can be useful
- If the input file is only HTML, jsoup (http://jsoup.org/) is a Java HTML parser
- So is http://htmlparser.sourceforge.net/ or http://mozillaparser.sourceforge.net/ or http://jericho.htmlparser.net/docs/index.html
OTHER TIPS
to parse HTML you will need a DOM parser which is a bit lenient depending on the quality of the html code to parse it using your grammar spec and then you will need to provide a type of data structure that you want and there are libraries to do that stuff for you
Have a look at jilapi
This takes in unstructured plain text and gives out structured JSON.
Well if the structure of the plain text files are well-formed, why not use the Java DOM API (or JDOM) combined with a DOCTYPE to create a DOM Object? From there, you could iterate through that Object and easily convert it to JSON, using something like the google-gson library.