Regular languages vs. non-regular ones [closed]

Question 1

Regular languages are defined recursively over an alphabet A as follows:

The empty set \null is regular.
The set { \eps } is regular, where \eps is the empty string.
The set { a } is regular for all a \in A
If X and Y are regular, then the set { xy | x \in X, y \in Y } is also regular.
If X and Y are regular, then X \union Y is also regular.
If X is regular then so is { x^n | x \in X and n >= 0 }

In 6, the definition of x^n is x concatenated with itself n times, and x^0 = \eps.

It follows from all these steps except 6 that every finite set of strings over A is regular. All the interesting stuff happens when we consider infinite sets.

Regular expressions are just a "programming language" for representing regular languages. They work like this. I'll use "regex" as an abbreviation for regular expression.

The regex \NULL stands for the null set.
The regex \EPS stands for { \eps }.
The regex a stands for the set { a }. Note the boldface a connotes a regex rather than the character a that it represents.
If the regex x stands for the language X and y stands for Y, then the regex xy stands for the language { xy | x \in X, y \in Y }.
If the regex x stands for the language X and y stands for Y, then the regex x|y stands for the language X \union Y.
If the regex x stands for the language X, then the regex x* stands for the language { x^n | x \in X, n >= 0 }.

It's directly implied by the definition that every regex describes a regular language. It's not hard to go the other way and show that every regular language must have a corresponding regex.

So the examples of regular languages you requested are all those that some regular expression stands for. Example: ab* is the language of all strings beginning with a followed by any number of b's and so forth.

There are some pretty cool languages that seem too complex to be regular, but actually are. My favorite is the set S_k of binary representations (alphabet {0, 1}) for all numbers N such than N==0 mod k. You can choose any positive integer k you like.

There is a wonderful theorem due to Kleene that goes farther. It shows that the languages recognizable by Deterministic Finite Automata - simple state machines - and Nondeterministic Finite Automata - state machines with transitions on the empty string and that allow multiple transitions on each character - are exactly the regular languages. They all have the same expressive power. That is, if you give me any one of { regex, DFA, NFA }, I will be able to convert to the other two every time.

The regex for any set S_k, choose k as you wish, described above isfiendishly complex, but the DFA that recognizes it is quite simple. Kleene's Theorem lets you proceed with the best tool.

Finite automata effectively have finite memory, so you'd expect - and you'd be right - that languages with some kind of infinite structure would not be regular. The simplest such language is probably { a^n b^n | n >= 0 }. That is the set of all strings of a's followed by an equal number of b's. Any finite automaton (hence regex or NFA) must fail to "store" the value of n recorded while looking at the a's if n is large enough. Therefore it must fail at looking for an equal number of b's that come along later in the input.

Another statement of the same idea: If you claim you have a DFA of N states that will recognize { a^n b^n | n >= 0 }, I will give you strings { a^k b^k | k > N } where it will have to fail because it must "loop", i.e. repeat at least one state. At that point is has lost track of how much it has read so far. It is doomed to get the wrong answer for some of these long strings.

The Pumping Lemma exploits this fact. It provides a mathematically rigorous way of proving (by contradiction of the Lemma) that languages are NOT regular. Every good computer science student learns to "pump it up (or down)" in the fashion necessitated by the PL to prove sets are non-regular.

Examples of non-regular languages include the aforementioned { a^n b^n | n >= 0 } along with similar languages that require "balancing" of various kinds: { a^n b^m | n > m }, { a^n b a^n }, and an infinitude of similar ones.

The non-regular languages can be further subdivided into sets of increasing complexity: the context free languages, the context sensitive ones, the recursive sets, recursively enumerable sets, and undecidable sets. Yet these are just the beginning of an infinite hierarchy of ever more complex sets of strings. This infinitude of sets is infinitely complex! Enjoy it.

Question 2

An expression is regular if one can decompose it in four basic language concepts:

a single character. For instance a, b, c
a concatenation between two regular expressions. For instance ab, abc
a disjunction between two regular expressions. For instance a|b
a Kleene-star: for instance (ab)*

Other aspects in regular expressions are simply syntactical sugar. For instance (ab)+ is short for (ab)(ab)* and [A-F] is short for A|B|C|D|E|F.

One can proof some language (set of strings) cannot be expressed with a regular expression using the Pumping lemma. For instance a language {ab,aabb,aaabbb,...} where there are as many a's left as b's cannot be expressed using a regular expression.

There is a hierarchy defined by Chomsky about how such languages can be recognized (for instance using Context Free Grammars (CFG), Context Sensitive Grammars (CSG), Turing machines and Oracle Machines.