Question

I'm trying to write a parser for org mode files using the instaparse library. This library takes EBNF notation and turns it into a parsing function. Org-mode files uses lines starting with stars to make outline headlines, where the number of stars set the level in the outline tree, like this

* Headline
** Sub headline1
** Sub headline2

My first try is to put all headlines on the same level in the resulting tree:

(def outline
  (insta/parser
   "<S> = Headline-node *
    Headline-node =  Level <' '> Headline
    Level = #'^\\*+'
    Headline = #'\\S'+ <'\n'>"))
(outline "* Headline\n** Subheadline\n")
;=>
([:Headline-node [:Level "*"] [:Headline "H" "e" "a" "d" "l" "i" "n" "e"]] 
 [:Headline-node [:Level "**"] [:Headline "S" "u" "b" "h" "e" "a" "d" "l" "i" "n" "e"]])

Maybe I could transform the tree afterwards to put sub-headlines inside headlines. But I would prefer to create a tree with subheadlines inside headlines from the beginning. My only idea so far is to manually create different levels like this:

(def outline
  (insta/parser
   "<S> = Headline-node1 *
    Headline-node1 =  <#'^\\* '> Headline (Headline-node2)*
    Headline-node2 =  <#'^\\*\\* '> Headline
    Headline = #'\\S'+ <'\n'>"))
(outline "* Headline\n** Subheadline\n")
;=>
([:Headline-node1 [:Headline "H" "e" "a" "d" "l" "i" "n" "e"] 
                  [:Headline-node2 [:Headline "S" "u" "b" "h" "e" "a" "d" "l" "i" "n" "e"]]])

But I would like to create infinite levels of headlines. Is there a way to convey this in EBNF?

Was it helpful?

Solution

I don't think that is possible. Pushdown automata can recognize precisely context-free languages. But think about how a pushdown automaton would have to operate to decide the level of a heading in your example.

A pushdown automaton has a finite set of states. So it cannot use the states to track the level of a heading if we are to allow arbirtarily deep levels. So it must use its stack. The only way to count using the stack is to push a stack symbol (there are a finite number of stack symbols too) each time it reads a *. Say it is somewhere in the input string at the beginning of some *s. It proceeds by pushing a stack symbol for every *. But when it comes to the end of the *s in the input string, the only things the automaton can use to decide its action are: the symbol in the input string, its state, and the top symbol on the stack. We "already" knew that the input symbol is not a *, so that does not help. We know that the top of the stack is the symbol that we pushed once for every *, so that does not help either. And as previously noted, the set of states are finite, so we can not somehow use the states to count the number of symbols on the stack.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top