There's DSL format for creating and distributing dictionaries. Every dictionary article in such formats looks like this:

algorithm
    [m0][b]al·go·rithm[/b] {{id=000001018}} [c rosybrown]\[[/c][c darkslategray][b]algorithm[/b][/c] [c darkslategray][b]algorithms[/b][/c][c rosybrown]\][/c] [p]BrE[/p] [c darkgray] [/c][c darkcyan]\[ˈælɡərɪðəm\][/c] [s]z_algorithm__gb_1.wav[/s] [p]NAmE[/p] [c darkgray] [/c][c darkcyan]\[ˈælɡərɪðəm\][/c] [s]z_algorithm__us_1.wav[/s] [c orange] noun[/c] [c darkgray] ([/c][c green]computing[/c][c darkgray])[/c]
    [m1]{{d}}a set of rules that must be followed when solving a particular problem{{/d}} [m3] 
    {{Word Origin}}[m3][c darkslategray][u]Word Origin:[/u][/c]
    [m3][c darkgray] [/c]{{d}}late 17th cent.{{/d}} [c dimgray]{{etymology}} (denoting the Arabic or decimal notation of numbers): variant (influenced by {{/etymology}} [/c][c darkslategray]{{lang}}Greek{{/lang}} [/c][c darkgray] [/c][c darkcyan]{{ff}}arithmos{{/ff}} [/c][c darkgray] [/c][c darkslateblue][b]{{etym_tr}}‘number’{{/etym_tr}}[/b][/c][c dimgray]{{etymology}}) of {{/etymology}} [/c][c darkslategray]{{lang}}Middle English{{/lang}} [/c][c darkgray] [/c][c darkslategray]{{etym_i}}algorism{{/etym_i}}[/c][c dimgray]{{etymology}}, via {{/etymology}} [/c][c darkslategray]{{lang}}Old French{{/lang}} [/c][c dimgray]{{etymology}} from {{/etymology}} [/c][c darkslategray]{{lang}}medieval Latin{{/lang}} [/c][c darkgray] [/c][c darkcyan]{{ff}}algorismus{{/ff}}[/c][c dimgray]{{etymology}}. The {{/etymology}} [/c][c darkslategray]{{lang}}Arabic{{/lang}} [/c][c dimgray]{{etymology}} source, {{/etymology}} [/c][c darkcyan]{{ff}}al-K̲wārizmī{{/ff}} [/c][c darkgray] [/c][c darkslateblue][b]{{etym_tr}}‘the man of K̲wārizm’{{/etym_tr}} [/b][/c][c dimgray]{{etymology}} (now Khiva), was a name given to the 9th-cent. mathematician Abū Ja‘far Muhammad ibn Mūsa, author of widely translated works on algebra and arithmetic.{{/etymology}} [/c]

I need to parse it to HTML in Java application.
My question is how to do it? I have thought about two options,

  • write multiple regex expressions which will cover all cases
  • parse it to something like a semantic tree by dividing to nodes, and each node parse on its own

Absolutely have no experience with such kind of task, so I asking for advice and possible pitfalls. Any help will be appreciated!

有帮助吗?

解决方案

You could build a bunch of regex expressions and then write a bunch of code that will figure out when which regex should be used. This is actually not a terrible idea for something fairly simple. For something like this, you probably want define a grammar and use a tool like ANTLR to build a lexer/parser.

It can be a little intimidating at first but there are lots of resources that can help. I would try one of the tutorials and build a simple language parser first. You should find a lot of overlap with regular expressions and the way you use them.

其他提示

Why not convert it to a form where existing tools can parse it?

This DSL looks like it could easily be converted to XML.

There appear to be four types of 'Elements' in the example.

  • Elements that follow the pattern [tag-name]*[/tag-name]
  • Elements that follow the pattern {{tag-name}}*{{/tag-name}}
  • Self-closing Elements that have the pattern {{tag-name=value-string}}
  • A free-standing text element with no surrounding tags

Speculation: It appears the two types of elements are not expected to be interleaved, e.g.

  • [x]{{y}}blah [/x] blah{{/y}}

How about a cheap to code conversion to XML?

If this is correct you could decide how you would like each of these patterns converted to XML elements. For example:

  • Convert "[tag-name]" to <tag-name> and "[/tag-name]" to </tag-name>

  • Convert "{{tag-name}}" and "{{/tag-name}}" to <cc-tag-name> and </cc-tag-name>

  • Convert "{{tag-name=value-string}}" to <tag-name>value</tag-name> or <tag-name value=value-string/> for example

  • Leave free-standing text nodes as-is. XML (and HTML) supports those already.

You could do this conversion using global replace with three regex. You could do it with a hand-coded recursive parser-converter (that would be pretty small).

Although most people would probably prefer regex for this job, I would tend to write the hand coded converter in this case because I would like to put some structure validation into the process, e.g. to log or handle the case where elements are interleaved in a way XML will not support.

I would like to find out early on whether the source data violates my assumptions.

You need to know the valid character set and escaping rules.

If you have a large volume of input data to read, you are likely to encounter odd cases, such as unfamiliar unicode characters that may need normalizing, or may be expressed in some form of "entity" notation in the data.

At the very least you need to know how [, ], {{, }}, / are escaped when they occur in text and are not part of markup. If you don't handle this there's a good chance your process will fail on some dictionary entries.

The documentation might tell you these things. It's worth a look.

OK Great. We have XML but we wanted HTML

The point of converting into XML is that you are then free to use the many free open-source tools for parsing XML, transforming XML, deserializing XML into Objects of your design.

When you generate HTML you will make a lot of decisions that are not included in this input data.

  • Navigation: How will users navigate to different entries? Will entries always display in full detail, or will there also be a short form?

  • Formatting, presentation: Mapping this markup into your html structure and css style sheets. Some sections of the input look like they might want margins/indenting that are not included in the markup.

  • Search: For decent user experience you may need to also extract indexing terms from this markup and load them to a text search service.

  • Accessibility: if that's part of your requirements.

  • Images? If there are any, they may need preprocessing.

You may want parse it into POJO's (e.g. with Jackson or other reflection based deserializer) and use html templates (e.g. Freemarker, Thymeleaf, Velocity, even JSP). You'll probably want css styles separate from the main text of the documents if only for flexibility to change presentation without reprocessing all data.

Putting the data in XML or some other well-supported format lets you choose from many tools for these more complex tasks.

But it is not necessary, and if you don't think it would save you time you probably should not do it.

You can, of course, parse the input directly into your chosen Java object model or directly into HTML. ANTLR is great but as previous answer suggests there is a significant setup and learning curve.

许可以下: CC-BY-SA归因
scroll top