Question

I've been looking at writing a Textile parser using Scala's parser combinator library (basically a PEG parser), and was wondering what kind of approach I should use for parsing the inline modifiers

This is *bold* text, _italic_ text, +underlined+ text, etc.

in this case it's pretty clear what's what, and what should be parsed. However, there are a large number of edge cases where it's not so clear. Focusing only on bold text:

Which sections get bolded: 
*onomato*poeia* ?
bold *word*, without a space after?
tyr*annos*aurus
a bold word in a (*bracket*)?
How about *This *case?

Obviously this is a mix of subjective (which things should count as bold) and objective (how to make the parsing rules parse it correctly).

I'm leaning towards a PEG something like

wordChar = [a-zA-Z]
nonWordChar = [^a-zA-Z]
boldStart = nonWordChar ~ * ~ wordChar
boldEnd = wordChar ~ * ~ nonWordChar
boldSection = boldStart ~ rep(not(boldEnd) ~ anyChar) ~ boldEnd

Which would parse the above as follows:

<b>onomato*poeia</b> ?
bold <b>word</b>, without a space after?
tyr*annos*aurus    <- fails because of lack of whitespace
a bold word in a (<b>bracket</b>)?
How about *This *case? <- fails because there is no correct closing *

However I'm not sure if this method holds for all use cases and is well defined for all edge cases. Is there a standard way of doing this which I can copy and rely on? I'd rather not rely on my ad-hoc not-well-thought-through language spec if I can avoid it.

Was it helpful?

Solution

There is no standard in the case of markdown, and implementations differ on edge cases. For one set of choices in the case of markdown, you could look at peg-markdown, which is also used in MultiMarkdown. Of course, markdown is more complex than textile in this respect, because it uses ** for bold and * for italics, giving rise to even more decisions about how to treat things like *hello**there**.

Michel Fortin, developer of PHP markdown extra, has a test suite that includes a number of edge cases for bold/italics. However, I don't think there is universal agreement on his decisions here, and many implementations parse differently.

That said, I think the following decisions are fairly uncontroversial in markdown:

  • * only starts emphasis if the next character is non-whitespace.
  • * only ends emphasis if the preceding character is non-whitespace.
  • Emphasis can occur within a word, so in he*ll*o, the two l's are emphasized (though some markdown implementations disable this feature for the _ character, since underscores are common in identifiers).

OTHER TIPS

After sourcing around a while, I found the inline markup recognition rules for reStructuredText.

It does not follow the rules of markdown; in particular things like t*hi*s are not parsed as inline tags, but it's pretty similar and has a similar overall purpose.

It's also a somewhat complex spec (e.g. with special casing for brackets and punctuation) but it's pretty well specified, thoroughly explained and justified. I found its spec a pretty solid base to build off of.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top