Simulating regular expressions with deterministic finite automata and the size of the alphabet

https://stackoverflow.com/questions/18852319

finite-automata

28-06-2022
|

Question

I'm currently working my way through the "Dragon Book" (Compilers: Principles, Techniques, & Tools) and I'm kind of stuck at the lexical analysis chapter, which uses DFAs (Deterministic finite automata).

A DFA is a two-dimensional array, the first dimension contains the state and the second the transition symbols. This means that every DFA state contains all the symbols of the language. The examples in the book use a small language (usually two symbols), and they make the following note at the end of the chapter: "since a typical lexical analyzer has several hundred states in its DFA and involves the ASCII alphabet of 128 input characters, the array consumes less than a megabyte".

However, for matching strings I want to match all characters, which means the entire character set, and a lot of input files use UTF-8 encoding. This causes the alphabet, and thus the size of the DFA, to rise enormously.

This is the point where I'm stuck. How are lexical analyzers, or regular expression simulators in general handling this?

Thanks!

Solution

I've had an epiphany on this problem. In lexical analysis, about the only time you want to match characters beyond the ASCII range is while doing wildcard matching, like in strings or comments. Because these are only used in wildcards, and not individually, all the characters with a value of 128 or higher can be represented as a single 'other' value. The alphabet and DFA remain small this way, while I am still able to use transition tables and match the entire unicode charset.

OTHER TIPS

Here's an interesting tool that converts a regular expression to non-deterministic finite automata.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow