Question

On input I have a plain text (in my case typically it will be HTML) and a "grammar specification" (some way for extracting data from plain text to structured data), then on output I need to have some structured data (JSON is fine but maybe there exists something better?)

Are there any libraries for this task? What are good approaches to specify "grammar spec"? What are the best approaches for solving such problem?

Was it helpful?

Solution

Some tools for grammar based transformations:

Addition:

OTHER TIPS

Take a look at jsoup for HTML parsing and and gson for Java-to-JSON.

to parse HTML you will need a DOM parser which is a bit lenient depending on the quality of the html code to parse it using your grammar spec and then you will need to provide a type of data structure that you want and there are libraries to do that stuff for you

Have a look at jilapi

This takes in unstructured plain text and gives out structured JSON.

Well if the structure of the plain text files are well-formed, why not use the Java DOM API (or JDOM) combined with a DOCTYPE to create a DOM Object? From there, you could iterate through that Object and easily convert it to JSON, using something like the google-gson library.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top