Question

Do the space-related benefits of using an on-the-fly parser outweigh the time-related benefits of a pre-generated lookup table?


Long version:

I am authoring a chemistry reference tool, and am including a feature that will automatically name formulae conforming to a specific pattern; e.g. C[n]H[2n+2] => [n]ane; where [n] is an integer for the LHS; and an index into an array of names on the RHS. (meth, eth, …)

As far as I can see, this can be implemented in one of two ways:

  1. I pre-generate a key/value dual lookup dictionary of formula <=> name pairs; either when the application starts (slower startup), or a static list which is published with the application (slower download).

  2. Formulae are evaluated on the fly by a custom-built parser.

In approach 1. name => formula lookup becomes simpler by an order of magnitude; but the generator will, unless I want to ship dozens of megabytes of data with the application, have to have a preset, and fairly low, value for n.

Compounding this is the fact that formulae can have several terms; such as C[n]H[2n+1]OC[n']H[2n'+1]; and for each of these, the number of possible matches increases geometrically with n. Additionally, using this approach would eat RAM like nobody's business.

Approach 2. lets me support fairly large values of n using a fairly small lookup table, but makes name => formula lookup somewhat more complex. Compared to the pre-generation to file for shipping with the application, it also lets me correct errors in the generation logic without having to ship new data files.

This also requires that each formula be matched against a cursory test for several rules, determining if it could fit; which, if there are a lot of rules, takes time that might lead to noticeable slowdowns in the interface.

The question then, is:

  1. Are there any considerations in the tradeoff I have failed to account for, or approaches that I haven't considered?

  2. Do the benefits of using an on-the-fly parser justify the increased implementation complexity?

Was it helpful?

Solution

You should go with the second approach.

One possible solution is a greedy algorithm. Define your set of transformations as a regular expression (used to test the pattern) and a function which is given the regexp match object and returns the transformed string.

Regular expressions aren't quite powerful enough to handle what you want directly. Instead you'll have to do something like:

m = re.match(r"C\[(\d+)\]H\[(\d+)]\]", formula)
if m:
    C_count, H_count = int(m.group(1)), int(m.group(2))
    match_size = len(m.group(0))
    if C_count*2+2 == H_count:
        replacement = alkane_lookup[C_count]
    elif C_count*2 == H_count:
        replacement = alkene_lookup[C_count]
    ...
    else:
        replacement = m.group(0)  # no replacement available

(plus a lot more for the other possibilities)

then embed that in a loop which looks like:

formula = "...."
new_formula = ""
while formula:
    match_size, replacement = find_replacement(formula)
    new_formula += replacement
    formula = formula[match_size:]

(You'll need to handle the case where nothing matches. One possible way is to include a list of all possible elements at the end of find_replacement(), which only returns the next element and counts.)

This is a greedy algorithm, which doesn't guarantee the smallest solution. That's more complicated, but since chemists themselves have different ideas of the right form, I wouldn't worry so much about it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top