Writing a mini-language

https://stackoverflow.com/questions/615040

03-07-2019
|

Question

I have an application that needs to allow users to write expressions similar to excel:

(H1 + (D1 / C3)) * I8

and more complex things like

If(H1 = 'True', D3 * .2, D3 * .5)

I can only do so much with regular expressions. Any suggestions as to the right approach to doing this as well as any resources I can learn from would be much appreciated.

Thanks!

Solution

Some other question, you will find hints in:

Good luck!

OTHER TIPS

When faced with a similar situation - the need to handle short one-line expressions - I wrote a parser. The expressions were boolean logic, of the form

n1 = y and n2 > z
n2 != x or (n3 > y and n4 = z)

and so on. In english you could say that there are atoms joined by AND and OR, and each atom has three elements - a left-hand-side attribute, an operator, and a value. Because it was so succint I think the parsing was easier. The set of possible attributes is known and limited (eg: name, size, time). The operators vary by attribute: different attributes take different sets of operators. And the range and format of possible values vary according to attribute as well.

To parse, I split the string on whitespace using String.Split(). I later realized that prior to Split(), I needed to normalize the input string - inserting whitespace before and after parens. I did that with a regex.Replace().

The output of the split is an array of tokens. Then parsing occurs in a big for loop with a switch on the left-hand-side attribute value. With each go-round of the loop, I was set to slurp in a group of tokens. If the first token was an open-paren, then the group was just one token in length: the paren itself. For tokens that were well-known names - my attribute values - the parser had to slurp in a group of 3 tokens, one each for the name, the operator, and the value. If at any point there are not enough tokens, the parser throws an exception. Based on the stream of tokens, the parser state would change. A conjunction (AND,OR,XOR) meant to push the prior atom onto a stack, and when the next atom was finished, I'd pop the prior atom and join those two atoms into a compound atom. And so on. The state management happened at the end of each loop of the parser.

Atom current;
for (int i=0; i < tokens.Length; i++) 
{
  switch (tokens[i].ToLower())
  {
    case "name":
        if (tokens.Length <= i + 2)
            throw new ArgumentException();
        Comparison o = (Comparison) EnumUtil.Parse(typeof(Comparison), tokens[i+1]);
        current = new NameAtom { Operator = o, Value = tokens[i+2] };
        i+=2;
        stateStack.Push(ParseState.AtomDone);
        break;
    case "and": 
    case "or":
        if (tokens.Length <= i + 3) 
          throw new ArgumentException();
        pendingConjunction = (LogicalConjunction)Enum.Parse(typeof(LogicalConjunction), tokens[i].ToUpper());
        current = new CompoundAtom { Left = current, Right = null, Conjunction = pendingConjunction };
        atomStack.Push(current);
        break;

    case "(":
        state = stateStack.Peek();
        if (state != ParseState.Start && state != ParseState.ConjunctionPending && state != ParseState.OpenParen)
          throw new ArgumentException();
        if (tokens.Length <= i + 4)
          throw new ArgumentException();
        stateStack.Push(ParseState.OpenParen);
        break;

    case ")":
        state = stateStack.Pop();
        if (stateStack.Peek() != ParseState.OpenParen)
            throw new ArgumentException();
        stateStack.Pop();
        stateStack.Push(ParseState.AtomDone);
        break;

    // more like that...
    case "":
       // do nothing in the case of whitespace
       break;
    default:
        throw new ArgumentException(tokens[i]);
  }

  // insert housekeeping for parse states here

}

That's simplified, just a little. But the idea is that each case statement is fairly simple. It's easy to parse in an atomic unit of the expression. The tricky part was joining them all together appropriately.

That trick was accomplished in the housekeeping section, at the end of each slurp-loop, using the state stack and the atom stack. Different stuff can happen according to the parser state. As I said, in each case statement, the parser state might change, with the prior state getting pushed onto a stack. Then at the end of the switch statement, if the state said I had just finished parsing an atom, and there was a pending conjunction, I'd move the just-parsed atom into the CompoundAtom. The code looks like this:

            state = stateStack.Peek();
            if (state == ParseState.AtomDone)
            {
                stateStack.Pop();
                if (stateStack.Peek() == ParseState.ConjunctionPending)
                {
                    while (stateStack.Peek() == ParseState.ConjunctionPending)
                    {
                        var cc = critStack.Pop() as CompoundAtom;
                        cc.Right = current;
                        current = cc; // mark the parent as current (walk up the tree)
                        stateStack.Pop();   // the conjunction is no longer pending 

                        state = stateStack.Pop();
                        if (state != ParseState.AtomDone)
                            throw new ArgumentException();
                    }
                }
                else stateStack.Push(ParseState.AtomDone); 
            }

The one other bit of magic was the EnumUtil.Parse. That allows me to parse things like "<" into an enum value. Suppose you define your enums like this:

internal enum Operator
{
    [Description(">")]   GreaterThan,
    [Description(">=")]  GreaterThanOrEqualTo,
    [Description("<")]   LesserThan,
    [Description("<=")]  LesserThanOrEqualTo,
    [Description("=")]   EqualTo,
    [Description("!=")]  NotEqualTo
}

Normally Enum.Parse looks for the symbolic name of the enum value, and < is not a valid symbolic name. EnumUtil.Parse() looks for the thing in the description. The code looks like this:

internal sealed class EnumUtil
{
    /// <summary>
    /// Returns the value of the DescriptionAttribute if the specified Enum value has one.
    /// If not, returns the ToString() representation of the Enum value.
    /// </summary>
    /// <param name="value">The Enum to get the description for</param>
    /// <returns></returns>
    internal static string GetDescription(System.Enum value)
    {
        FieldInfo fi = value.GetType().GetField(value.ToString());
        var attributes = (DescriptionAttribute[])fi.GetCustomAttributes(typeof(DescriptionAttribute), false);
        if (attributes.Length > 0)
            return attributes[0].Description;
        else
            return value.ToString();
    }

    /// <summary>
    /// Converts the string representation of the name or numeric value of one or more enumerated constants to an equivilant enumerated object.
    /// Note: Utilised the DescriptionAttribute for values that use it.
    /// </summary>
    /// <param name="enumType">The System.Type of the enumeration.</param>
    /// <param name="value">A string containing the name or value to convert.</param>
    /// <returns></returns>
    internal static object Parse(Type enumType, string value)
    {
        return Parse(enumType, value, false);
    }

    /// <summary>
    /// Converts the string representation of the name or numeric value of one or more enumerated constants to an equivilant enumerated object.
    /// A parameter specified whether the operation is case-sensitive.
    /// Note: Utilised the DescriptionAttribute for values that use it.
    /// </summary>
    /// <param name="enumType">The System.Type of the enumeration.</param>
    /// <param name="value">A string containing the name or value to convert.</param>
    /// <param name="ignoreCase">Whether the operation is case-sensitive or not.</param>
    /// <returns></returns>
    internal static object Parse(Type enumType, string stringValue, bool ignoreCase)
    {
        if (ignoreCase)
            stringValue = stringValue.ToLower();

        foreach (System.Enum enumVal in System.Enum.GetValues(enumType))
        {
            string description = GetDescription(enumVal);
            if (ignoreCase)
                description = description.ToLower();
            if (description == stringValue)
                return enumVal;
        }

        return System.Enum.Parse(enumType, stringValue, ignoreCase);
    }

}

I got that EnumUtil.Parse() thing from somewhere else. Maybe here?

A little recursive-descent parser is perfect for this. You probably don't even have to build a parse tree - you can do the evaluation as you parse.

 /* here's a teeny one in C++ */
void ScanWhite(const char* &p){
  while (*p==' ') p++;
}

bool ParseNum(const char* &p, double &v){
  ScanWhite(p);
  if (!DIGIT(*p)) return false;
  const char* p0 = p;
  while(DIGIT(*p)) p++;
  if (*p == '.'){
    p++;
    while(DIGIT(*p)) p++;
  }
  v = /* value of characters p0 up to p */;
  return true;
}

bool ParseId(const char* &p, double &v){
  ScanWhite(p);
  if (ALPHA(p[0]) && DIGIT(p[1])){
    v = /* value of cell whose name is p[0], p[1] */;
    p += 2;
    return true;
  }
  return false;
}

bool ParseChar(const char* &p, char c){
  ScanWhite(p);
  if (*p != c) return false;
  p++;
  return true;
}

void ParseExpr(const char* &p, double &v); /* forward declaration */

void ParsePrimitive(const char* &p, double &v){
  if (ParseNum(p, v));
  else if (ParseId(p, v));
  else if (ParseChar(p, '(')){
    ParseExpr(p, v);
    if (!ParseChar(p, ')'){/* throw syntax error */}
  }
  else {/* throw syntax error */}
}
#define PARSE_HIGHER ParsePrimitive

void ParseUnary(const char* &p, double &v){
  if (ParseChar(p, '-')){
    ParseUnary(p, v);
    v = -v;
  }
  else {
    PARSE_HIGHER(p, v);
  }
}
#undef  PARSE_HIGHER
#define PARSE_HIGHER ParseUnary

void ParseProduct(const char* &p, double &v){
  double v2;
  PARSE_HIGHER(p, v);
  while(true){
    if (ParseChar(p, '*')){
      PARSE_HIGHER(p, v2);
      v *= v2;
    }
    else if (ParseChar(p, '/')){
      PARSE_HIGHER(p, v2);
      v /= v2;
    }
    else break;
  }
}
#undef  PARSE_HIGHER
#define PARSE_HIGHER ParseProduct

void ParseSum(const char* &p, double &v){
  double v2;
  PARSE_HIGHER(p, v);
  while(true){
    if (ParseChar(p, '+')){
      PARSE_HIGHER(p, v2);
      v += v2;
    }
    else if (ParseChar(p, '-')){
      PARSE_HIGHER(p, v2);
      v -= v2;
    }
    else break;
  }
}
#undef  PARSE_HIGHER
#define PARSE_HIGHER ParseSum

void ParseExpr(const char* &p, double &v){
  PARSE_HIGHER(p, v);
}

double ParseTopLevel(const char* buf){
  const char* p = buf;
  double v;
  ParseExpr(p, v);
  return v;
}

Now if you just call ParseTop, it will calculate the value of an expression for you.

The reason for the PARSE_HIGHER macro is to make it easier to add operators at intermediate levels of precedence.

To do the "if" statement is a little more involved. Each parse routine needs an additional "enable" argument, so it does no calculation unless it's enabled. Then you parse the word "if", parse the test expression, and then parse the two result expressions, with the inactive one disabled.

You could use the .NET JScript compiler, or interface with IronPython, IronRuby or IronScheme (named alphabetically, not preference ;p ).

I've got a counter-example of how not to do it: Will o’ the Wisp (since this is my own code I feel confident criticizing it).

What's good about the Code?

It uses a design pattern consequently: The interpreter pattern
It has a rather clean design
It uses attributes in a nice way.
It produces nice graphics. ;-)

Turtle graphics http://i3.codeplex.com/Project/Download/FileDownload.aspx?ProjectName=wisp&DownloadId=34823

What's bad about the code?

It's slow!
The language is ill-defined with regards to lists (data vs. code).

Check out ANTLR. You define a language syntax, test it using a GUI tool and generate source code in a variety of languages. Open Source.

I would recommend the book Constructing Little Languages. It takes you through many compiler basics needed for completing this task properly.

You've brought up the fact that regular expressions will not work unless you have some stringent limits on your language. Like others have said, a Recursive Descent Parser will do the trick.

The choice next would be whether to use a Parser Generator like ANTLR, or to write one from scratch.

Have a look at this open source project:

Excel Financial Functions

I recommend to look at CoreCalc/FunCalc work: http://www.itu.dk/people/sestoft/funcalc/

I've used their grammar for COCO\R parser generator in production and it works really fast.

All you need to do is: 1. get excel grammar from corecalc 2. run coco.exe on it (generates parser for excel-like expressions) 3. translate expression tree to reverse polish notation 4. simple calc

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow