Parsing a chemical formula from a string in C#?

https://stackoverflow.com/questions/4116786

29-09-2019
|

Pergunta

I am trying to parse a chemical formula (in the format, for example: Al2O3 or O3 or C or C11H22O12) in C# from a string. It works fine unless there is only one atom of a particular element (e.g. the oxygen atom in H2O). How can I fix that problem, and in addition, is there a better way to parse a chemical formula string than I am doing?

ChemicalElement is a class representing a chemical element. It has properties AtomicNumber (int), Name (string), Symbol (string). ChemicalFormulaComponent is a class representing a chemical element and atom count (e.g. part of a formula). It has properties Element (ChemicalElement), AtomCount (int).

The rest should be clear enough to understand (I hope) but please let me know with a comment if I can clarify anything, before you answer.

Here is my current code:

    /// <summary>
    /// Parses a chemical formula from a string.
    /// </summary>
    /// <param name="chemicalFormula">The string to parse.</param>
    /// <exception cref="FormatException">The chemical formula was in an invalid format.</exception>
    public static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
    {
        Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();

        string nameBuffer = string.Empty;
        int countBuffer = 0;

        for (int i = 0; i < chemicalFormula.Length; i++)
        {
            char c = chemicalFormula[i];

            if (!char.IsLetterOrDigit(c) || !char.IsUpper(chemicalFormula, 0))
            {
                throw new FormatException("Input string was in an incorrect format.");
            }
            else if (char.IsUpper(c))
            {
                // Add the chemical element and its atom count
                if (countBuffer > 0)
                {
                    formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));

                    // Reset
                    nameBuffer = string.Empty;
                    countBuffer = 0;
                }

                nameBuffer += c;
            }
            else if (char.IsLower(c))
            {
                nameBuffer += c;
            }
            else if (char.IsDigit(c))
            {
                if (countBuffer == 0)
                {
                    countBuffer = c - '0';
                }
                else
                {
                    countBuffer = (countBuffer * 10) + (c - '0');
                }
            }
        }

        return formula;
    }

Solução

I rewrote your parser using regular expressions. Regular expressions fit the bill perfectly for what you're doing. Hope this helps.

public static void Main(string[] args)
{
    var testCases = new List<string>
    {
        "C11H22O12",
        "Al2O3",
        "O3",
        "C",
        "H2O"
    };

    foreach (string testCase in testCases)
    {
        Console.WriteLine("Testing {0}", testCase);

        var formula = FormulaFromString(testCase);

        foreach (var element in formula)
        {
            Console.WriteLine("{0} : {1}", element.Element, element.Count);
        }
        Console.WriteLine();
    }

    /* Produced the following output

    Testing C11H22O12
    C : 11
    H : 22
    O : 12

    Testing Al2O3
    Al : 2
    O : 3

    Testing O3
    O : 3

    Testing C
    C : 1

    Testing H2O
    H : 2
    O : 1
        */
}

private static Collection<ChemicalFormulaComponent> FormulaFromString(string chemicalFormula)
{
    Collection<ChemicalFormulaComponent> formula = new Collection<ChemicalFormulaComponent>();
    string elementRegex = "([A-Z][a-z]*)([0-9]*)";
    string validateRegex = "^(" + elementRegex + ")+$";

    if (!Regex.IsMatch(chemicalFormula, validateRegex))
        throw new FormatException("Input string was in an incorrect format.");

    foreach (Match match in Regex.Matches(chemicalFormula, elementRegex))
    {
        string name = match.Groups[1].Value;

        int count =
            match.Groups[2].Value != "" ?
            int.Parse(match.Groups[2].Value) :
            1;

        formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(name), count));
    }

    return formula;
}

Outras dicas

The problem with your method is here:

            // Add the chemical element and its atom count
            if (countBuffer > 0)

When you don't have a number, count buffer will be 0, I think this will work

            // Add the chemical element and its atom count
            if (countBuffer > 0 || nameBuffer != String.Empty)

This will work when for formulas like HO2 or something like that. I believe that your method will never insert into the formula collection the las element of the chemical formula.

You should add the last element of the bufer to the collection before return the result, like this:

    formula.Add(new ChemicalFormulaComponent(ChemicalElement.ElementFromSymbol(nameBuffer), countBuffer));

    return formula;
}

first of all: I haven't used a parser generator in .net, but I'm pretty sure you could find something appropriate. This would allow you to write the grammar of Chemical Formulas in a far more readable form. See for example this question for a first start.

If you want to keep your approach: Is it possible that you do not add your last element no matter if it has a number or not? You might want to run your loop with i<= chemicalFormula.Length and in case of i==chemicalFormula.Length also add what you have to your Formula. You then also have to remove your if (countBuffer > 0) condition because countBuffer can actually be zero!

Regex should work fine with simple formula, if you want to split something like:

(Zn2(Ca(BrO4))K(Pb)2Rb)3

it might be easier to use the parser for it (because of compound nesting). Any parser should be capable of handling it.

I spotted this problem few days ago I thought it would be good example how one can write grammar for a parser, so I included simple chemical formula grammar into my NLT suite. The key rules are -- for lexer:

"(" -> LPAREN;
")" -> RPAREN;

/[0-9]+/ -> NUM, Convert.ToInt32($text);
/[A-Z][a-z]*/ -> ATOM;

and for parser:

comp -> e:elem { e };

elem -> LPAREN e:elem RPAREN n:NUM? { new Element(e,$(n : 1)) }
      | e:elem++ { new Element(e,1) }
      | a:ATOM n:NUM? { new Element(a,$(n : 1)) }
      ;

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow