DFA minimization

https://stackoverflow.com/questions/11132319

16-06-2021
|

Question

I have a question about DFA minimization. So I've used very well known techniques to convert regular expression into NFA and then construct DFA from it, using goto / closure algorithm. Now the question is how do I minimize it? I've watched lections about it here: http://www.youtube.com/watch?v=T9Z66NF5YRk and I still cannot get the point. What is DFA minimization? Is this just merging IDENTICAL states (states that goes to the same states on the same characters) or something different?

So, I've started with the following grammar:

%digit = '0'..'9'
%letter = 'a'..'z' | 'A'..'Z'
%exponent = ("e" | "E") ("+" | "-")? digit+

T_INT = digit+
T_FLOAT = T_INT exponent
T_IDENTIFIER = (letter | "$" | "_") (letter | "$" | "_" | digit)*

and end up with the following DFA (represented as JSON):

{
    "START": [{
        "type": "range",
        "from": 36,
        "to": 36,
        "shift": "1"
    }, {
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "2"
    }, {
        "type": "range",
        "from": 65,
        "to": 90,
        "shift": "1"
    }, {
        "type": "range",
        "from": 95,
        "to": 95,
        "shift": "1"
    }, {
        "type": "range",
        "from": 97,
        "to": 122,
        "shift": "1"
    }],
    "1": [{
        "type": "range",
        "from": 36,
        "to": 36,
        "shift": "1"
    }, {
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "1"
    }, {
        "type": "range",
        "from": 65,
        "to": 90,
        "shift": "1"
    }, {
        "type": "range",
        "from": 95,
        "to": 95,
        "shift": "1"
    }, {
        "type": "range",
        "from": 97,
        "to": 122,
        "shift": "1"
    }, {
        "shift": ["t_identifier"]
    }],
    "2": [{
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "2"
    }, {
        "type": "range",
        "from": 69,
        "to": 69,
        "shift": "3"
    }, {
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "3"
    }, {
        "shift": ["t_int"]
    }],
    "3": [{
        "type": "range",
        "from": 43,
        "to": 43,
        "shift": "5"
    }, {
        "type": "range",
        "from": 45,
        "to": 45,
        "shift": "5"
    }, {
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "4"
    }],
    "4": [{
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "4"
    }, {
        "shift": ["t_float"]
    }],
    "5": [{
        "type": "range",
        "from": 48,
        "to": 57,
        "shift": "4"
    }]
}

So how do I minimize it?

UPDATE:

Okay so here is my algorithm. Given following DFA:

{
    0: [{
        from: 97,
        to: 97,
        shift: 1
    }],
    1: [{
        from: 97,
        to: 97,
        shift: 3
    }, {
        from: 98,
        to: 98,
        shift: 2
    }],
    2: [{
        from: 98,
        to: 98,
        shift: 4
    }],
    3: [{
        from: 97,
        to: 97,
        shift: 3
    }, {
        from: 98,
        to: 98,
        shift: 4
    }],
    4: [{
        from: 98,
        to: 98,
        shift: 4
    }]
}

This is what I do to minimize it:

For each state (numbered as 0, 1, 2, 3, 4 in my example) obtain unique hash that identifies such state (for example for state0 this will be: from=97,to=97,shift=1, for state1 this will be: from=97,to=97,shift=3&from=98,to=98,shift=2 and so on…)

Compare obtained hashes and if we find two identical ones, then merge it. In my example, state2 hash will be: from=98&shift=4&to=98, and state4 hash will be: from=98&shift=4&to=98, they are the same, so we can merge them into the new state5, after that DFA will look like this:

{
0: [{
    from: 97,
    to: 97,
    shift: 1
}],
1: [{
    from: 97,
    to: 97,
    shift: 3
}, {
    from: 98,
    to: 98,
    shift: 5
}],
3: [{
    from: 97,
    to: 97,
    shift: 3
}, {
    from: 98,
    to: 98,
    shift: 5
}],
5: [{
    from: 98,
    to: 98,
    shift: 5
}]

}

Continue this ’till we get no changes, so the next step (merging states 1 and 3) will transform DFA into the following form:

{
0: [{
    from: 97,
    to: 97,
    shift: 6
}],
6: [{
    from: 97,
    to: 97,
    shift: 6
}, {
    from: 98,
    to: 98,
    shift: 5
}],
5: [{
    from: 98,
    to: 98,
    shift: 5
}]

}

There are no more identical states, that means we’re done.

SECOND UPDATE:

Okay, so given the following regular expression: 'a' ('ce')* ('d' | 'xa' | 'AFe')+ | 'b' ('ce')* ('d' | 'xa' | 'AFe')+ 'ce'+ I've got the following DFA (START -> start state, ["accept"] -> so to say transition to the accepting state):

{
    "START": [{
        "type": "range",
        "from": 98,
        "to": 98,
        "shift": "1.2"
    }, {
        "type": "range",
        "from": 97,
        "to": 97,
        "shift": "17.18"
    }],
    "1.2": [{
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "10"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "6.7"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "8"
    }, {
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "4"
    }],
    "10": [{
        "type": "range",
        "from": 97,
        "to": 97,
        "shift": "6.7"
    }],
    "6.7": [{
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "15"
    }, {
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "13"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "6.7"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "11"
    }],
    "15": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "14.accept"
    }],
    "14.accept": [{
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "16"
    }, {
        "shift": ["accept"]
    }],
    "16": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "14.accept"
    }],
    "13": [{
        "type": "range",
        "from": 97,
        "to": 97,
        "shift": "6.7"
    }],
    "11": [{
        "type": "range",
        "from": 70,
        "to": 70,
        "shift": "12"
    }],
    "12": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "6.7"
    }],
    "8": [{
        "type": "range",
        "from": 70,
        "to": 70,
        "shift": "9"
    }],
    "9": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "6.7"
    }],
    "4": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "2.3"
    }],
    "2.3": [{
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "10"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "6.7"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "8"
    }, {
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "5"
    }],
    "5": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "2.3"
    }],
    "17.18": [{
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "25"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "22.accept"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "23"
    }, {
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "20"
    }],
    "25": [{
        "type": "range",
        "from": 97,
        "to": 97,
        "shift": "22.accept"
    }],
    "22.accept": [{
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "28"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "22.accept"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "26"
    }, {
        "shift": ["accept"]
    }],
    "28": [{
        "type": "range",
        "from": 97,
        "to": 97,
        "shift": "22.accept"
    }],
    "26": [{
        "type": "range",
        "from": 70,
        "to": 70,
        "shift": "27"
    }],
    "27": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "22.accept"
    }],
    "23": [{
        "type": "range",
        "from": 70,
        "to": 70,
        "shift": "24"
    }],
    "24": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "22.accept"
    }],
    "20": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "18.19"
    }],
    "18.19": [{
        "type": "range",
        "from": 120,
        "to": 120,
        "shift": "25"
    }, {
        "type": "range",
        "from": 100,
        "to": 100,
        "shift": "22.accept"
    }, {
        "type": "range",
        "from": 65,
        "to": 65,
        "shift": "23"
    }, {
        "type": "range",
        "from": 99,
        "to": 99,
        "shift": "21"
    }],
    "21": [{
        "type": "range",
        "from": 101,
        "to": 101,
        "shift": "18.19"
    }]
}

Story is the same, how do I minimize it? If I follow classical Hopcroft's algorithm with all this table construction, determining indistinguishable states, merging them together and so on, then I'm going to get DFA that contains 15 states (use this tool: http://regexvisualizer.apphb.com/ with this regular expression a(ce)(d|xa|AFe)+|b(ce)(d|xa|AFe)+ce+ to check that). Here is how DFA looks like after minification with Hopcroft's algorithm:

Hopcroft's minimized DFA

The algorithm that I came up with, after "rethinking" Hopcroft's algorithm, builds DFA that is smaller than the one you can see above (sorry for the image quality, I had to redraw it step by step to understand why it's smaller):

my algorithm

And here is how it works, decision about "state equivalence" is based on the result of the hash function that is given the state (for example "START") builds short strings that can be constructed from the DFA if we start from that state. Given the DFA above, and the START state, we can construct following strings: 98->120, 98->100, 98->65, 98->99, 97->120, 97->100, 97->65, 97->99 so let it be the result of the hash function for START state. If we run this function for each state in the DFA, we will see that for the some states this function gives us identical results ("1.2", "6.7", "2.3" AND "10", "13" AND "15", "16" AND "11", "8", "26", "23" AND "12", "9", "4", "5", "20", "21" AND "17.18", "18.19" AND "25", "28" AND "27", "24") so all we need to do is to merge those states together.

I see that I'm wrong somewhere, but don't understand what is wrong with minimized DFA produced by my algorithm?

Solution

Your proposed algorithm does not do a full minimization, because it does not detect complex structures that behave identically. To understand this look at this DFA (drawn by JFLAP):

enter image description here

Minimization would combine q1 and q2, but the outlined algorithm does not manage to.

In contrast to this, Hopcroft's algorithm would initially partition like this:

   {q0, q1, q2}, {q3}

then split the first set, because q0 does not have a transition to q3:

   {q0}, {q1, q2}, {q3}

and not split further, because q1 and q2 behave identically.

OTHER TIPS

Let your original DFA be called M1. To put it in simple terms, constructing a minimized DFA (call it M2) implies converting it into a DFA that contains minimum number of states. So, the number of states in M2 will be less than number of states in M1. An important point to note here is that M1 and M2 have to be equivalent, which means that they must define the SAME regular language. Constructing a minimized DFA not only involves looking for identical states, but also the following:

Removal of "unreachable" states:
This involves removing the states that are not reachable from the initial state of the DFA, for any input string.
Removal of "dead" or "trap" states:
This involves removal of non-accepting states that terminates on itself. They are also called as TRAP states.
Removal of "Nondistinguishable" states:
This involves removing the states those that cannot be distinguished from one another for any input string.

There are also some popular algorithms used for DFA minimization:

It may be worthwhile to go through these algorithms!

Given that you have code to determinize a NFA to a DFA, the simplest solution to minimize it is Brzozowski's algorithm. You will need to implement a function to reverse the NFA, but this is fairly simple. (reverse the transitions, swap start and accept states)

Once you have a determinize and reverse function, Brzozowski minimization is implemented as:

minimize(nfa) = determinize(reverse(determinize(reverse(nfa))))

IMHO, this is a very elegant solution

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow