Regular Expressions and Language Generators

https://softwareengineering.stackexchange.com/questions/231295

02-10-2020
|

Question

I have few CS textbooks with me which discuss languages, well actually 2 plus old course notes supplied a few years ago. I have been searching the web too any only seem to come up with vague responses just like the text books I have.

My question is about language recognisers verses generators.

I get the overriding principle of a recogniser. That it analyses a language and is able to determine nay or yay if a String belongs to a language. This is at least what I have picked up from the books and notes. However, it's much more complex than that is it not? A tokeniser and syntax analyser (which I assume to be recognisers) do not just say yes or no, they also say where don't they...?

However, language generators. No one seems to be very clear about what they are. The typical description I get is For example Sebasta's Concepts of Programming Languages says 'A language generator is a device that can be used to generate the sentences of a language. We can think of a generator as a push button that produces a sentence of a language every time it is pressed.' Seriously? That's it?? Your kidding right....

I read that Regex is an example of a Generator, then why when people talk of generators do they not talk of the inputs. For example Regex has a target String, and the Regex with defines both the accepted alphabet and it's grammar rules.

Can someone provide for me a clearer distinction of what a recognizer is?

Solution

I'm surprised you didn't find your answer in Sebesta's Concepts of Programming Languages... In short:

Recogniser: sentences -> grammar. Application: to check if given sentences are part of the language grammar. This is an essential part of a compiler or interpreter.

Generator: grammar -> sentences. Application: to provide examples of valid sentences in the language. This is an essential part of a programming language manual.

For instance, the user manual for a certain programming language may explain in words what a valid assignment looks like, and even provide a BNF rule (something like: assign -> var = expr), but it will also include examples of valid assignments so that the programmer understands better (e.g., a = b or x = y + 2*z). Such examples are generated from the BNF through a process known as derivation.

OTHER TIPS

First off, these are not difficult subjects in their general sense, so do not be surprised that generating sentences of a language is it.

There is one concept shared by both a language recognizer, as well as a language generator, and that is a language.

Languages can be defined as simple as L = {a, b, c} where the language L only knows three words/sentences. Regular expressions themselves are neither recognizers nor generators. Instead, a regular expression is a way of defining a language. For example, the regular expression a+ defines the language L' = {a, aa, aaa, aaaa, ... }.

Now if you look into those textbooks of yours, I am sure, there is a section on how to convert a regular expression into a (non-)deterministic finite automaton. Such an automaton is now a language recognizer, that fails to accept a b, but does accept aaa when constructed from the regular expression a+.

On the other hand, a language generator is something that will output one element of L' at a time, eventually, generating all of L'. For example, a generator for L' could give you aaa, then a, then aa, then aaaa, and so on.

Generally, languages are infinitely large, hence, generators need to generate an infinite number of sentences in that language. Consider the language L1 defined by the regular expression a+b+. If you have a generator that outputs ab, aab, aaab, ..., then it can run infinitely long, but you will never ever see a word that has more than one b. Such generators tend to be rather useless due to their biased nature.

Therefore, to reduce the "that's it" effect, we typically demand from a generator, that given any word from the language, the generator eventually produces it. So you could pick aabb and your generator must have a finite number of steps that it needs to produce that word. It cannot just infinitely increase the number of as anymore.

Now, to start making things really complicated, assume a more interesting language like, for example, Java. When you have a Java language generator, then this thing will eventually produce all of the Java classes that you have written and will ever write in your life as a Java programmer. I leave it as an exercise to you to think about how to realize such a thing - and do not forget: only valid Java programs are allowed. It must not output gibberish, but only sentences of the language.

Usage of language generators is primarily in CS theory. For practical purposes their infinite nature makes any kind of meaningful usage rather problematic. Apart from pathological cases, where the language is trivial and finite, one could use these generators for testing tools that are supposed to process sentences of the language. For example, you could test a Java parser against a Java generator by having it continuously parse the generated progarms.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange