What's are reasonable upper bounds for the number of states, symbols, and rules of LR(1) grammars?

StackOverflow https://stackoverflow.com/questions/14151239

  •  12-01-2022
  •  | 
  •  

Question

I'm making an LR(1) parser, and I've run across performance bottlenecks in various places.

I'd like to try optimizing the the data structures for the parser, but in order to do so, I need a rough idea of how many states, rules, and terminal symbols are reasonable for (possibly complicated) computer languages, like C++.

My guesses are that a typical grammar for a complicated language would have:

  • ≤ 100 terminal symbols
  • ≤ 50 symbols per production
  • ≤ 2,000 rules
  • ≤ 10,000 states

but I really don't know how correct they are.

Note that I assume each rule is of the form nonterminalsymbol symbol symbol..., so a single compound "rule" that looks like foo: (bar | baz)+ might actually consist of, say, 5 rules, instead of just 1 rule.

Are they reasonable? If not, where I find some numbers on these?

Was it helpful?

Solution

The DMS system I developed daily processes a production IBM Enterprise COBOL front end grammar in about 7 seconds on a crummy laptop (measured just now on that laptop).

The grammar has about 500 terminals and 2500 productions, averaging about 2.5 tokens per production. Our productions are exactly as you describe them (no EBNF, just doesn't buy enough to matter, and yes, we're big DSL fans. Sometimes the geegaws people put in a DSL aren't worth it). The parser generator produces 3800 states. (These values measured just now, too).

DMS has a full C++11 grammar with lots of extra stuff to handle GCC and MS dialects, as well as OpenMP. The grammar has 457 terminals, some 3000 productions, some 2.3 tokens per production average. The parser generator produces 5800 states. Takes somewhat longer to generate: 11 seconds, on an i7. What you might find surprising is that it takes some tens of seconds to generate the lexer (really multiple lexers); there's a lot more lexing weirdness in C++11 than you'd expect.

The generator is a GLR generator of our own implementation.

We didn't do a lot to optimize the generation time. It likely can be sped up by a factor of 10 or more; we don't do sophisticated cycle detection optimization as suggested in most papers on LR parser generation. The consequence is that it takes longer to generate the tables but nothing is lost in functionality. We've never had enough motivation to do that optimization, because there are so many other things to do with a language front end than worry about the parser table generation time.

I doubt the data structures matter a lot, if designed reasonably. We don't worry much about sizes of rules, item sets or states; we just use dynamic arrays and they take care of themselves. We do pack lookaheads into dense bit vectors.

As additional background data, you'll probably find this paper useful: Tiago Alves and Joost Visser, Metrication of SDF Grammars. Technical Report, DI-Research.PURe-05.05.01, Departamento de Informática, Universidade do Minho, May 2005.

The parser generator isn't where you have a difficult time with grammars. It is getting the grammar rules right for the specific implementations.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top