Data-mining algorithm for dynamically consolidating recurring substrings?

Question

First step: the tokeniser. Define what you consider {A,B,C,D} and what not.

you need at least one extra token for garbage/miscellanious stuff (the good news is, that if this token occurs, the statemachine that follows will always be reset to its starting state)
you may or may not want to preserve whitespace (which would again cause an extra token, and a lot of extra states later in the DFA or NFA recogniser)
maybe you need some kind of equivalence class: eg wrap all numeric strings to one token type; fold lower/uppercase; accept some degree of misspelling (difficult!)
you might need special dummy token types for begin of line/end of line and the like.
you must make some choice about the amount of false positives versus the amount of false negatives that you allow.
if there is text involved make sure that all the sources are in the same canonical encoding, or preprocess them to bring them into the same encoding.

Building the tokeniser is an excellent way to investigate your corpus: if it is real data from the outside world, you will be amazed about the funky cases you did not even knew they existed when you started!

The second step (the recogniser) will probably be much easier, given the right tokenisation. For a normal deterministic statemachine (with predefined sequences to recognise) you can use the standard algorithms from the Dragon Book, or from Crochemore.

For fuzzy self-learning matchers, I would start by building Markov-chains or -trees. (maybe Bayes-trees, I am not an expert on this) . I don't think it will be very hard to start with a standard state-machine, and add some weights and counts to the nodes and edges. And dynamically add edges to the graph. Or remove them. (this is where I expect to starts to get hard)

A strategic decision: do you need a database? If your model fits in core, you won't and you should not. (databases are not intended to fetch one row and process it, then store it then fetch the next row, etc) If your data does not fit in core, you'll have more than a data-modelling problem. BTW: all the DNA-assemblers/matchers that I know off, work in core and with flatfiles. (maybe backed up by a database for easy management and inspection)