Question

As an answer to my question Tokenizing an infix string in Java, I got the regex (?<=[^\.a-zA-Z\d])|(?=[^\.a-zA-Z\d]. However, now I'm writing the same code in Javascript, and I'm stuck as to how I would get a Javascript regex to do the same thing.

For example, if I have the string sin(4+3)*2, I would need it parsed into ["sin","(","4","+","3",")","*","2"]

What regex would I use to tokenize the string into each individual part.

Before, what I did is I just did a string replace of every possible token, and put a space around it, then split on that whitespace. However, that code quickly became very bloated.

The operators I would need to split on would be the standard math operators (+,-,*,/,^), as well as function names (sin,cos,tan,abs,etc...), and commas

What is a fast, efficient way to do this?

Was it helpful?

Solution

You can take advantage of regular expression grouping to do this. You need a regex that combines the different possible tokens, and you apply it repeatedly.

I like to separate out the different parts; it makes it easier to maintain and extend:

var tokens = [
  "sin",
  "cos",
  "tan",
  "\\(",
  "\\)",
  "\\+",
  "-",
  "\\*",
  "/",
  "\\d+(?:\\.\\d*)?"
];

You glue those all together into a big regular expression with | between each token:

var rtok = new RegExp( "\\s*(?:(" + tokens.join(")|(") + "))\\s*", "g" );

You can then tokenize using regex operations on your source string:

function tokenize( expression ) {
  var toks = [], p;

  rtok.lastIndex = p = 0; // reset the regex
  while (rtok.lastIndex < expression.length) {
    var match = rtok.exec(expression);

    // Make sure we found a token, and that we found
    // one without skipping garbage

    if (!match || rtok.lastIndex - match[0].length !== p)
      throw "Oops - syntax error";

    // Figure out which token we matched by finding the non-null group
    for (var i = 1; i < match.length; ++i) {
      if (match[i]) {
        toks.push({
          type: i,
          txt: match[i]
        });
        // remember the new position in the string
        p = rtok.lastIndex;
        break;
      }
    }
  }
  return toks;
}

That just repeatedly matches the token regex against the string. The regular expression was created with the "g" flag, so the regex machinery will automatically keep track of where to start matching after each match we make. When it doesn't see a match, or when it does but has to skip invalid stuff to find it, we know there's a syntax error. When it does match, it records in the token array which token it matched (the index of the non-null group) and the matched text. By remembering the matched token index, it saves you the trouble of having to figure out what each token string means after you've tokenized; you just have to do a simple numeric comparison.

Thus calling tokenize( "sin(4+3) * cos(25 / 3)" ) returns:

[ { type: 1, txt: 'sin' },
  { type: 4, txt: '(' },
  { type: 10, txt: '4' },
  { type: 6, txt: '+' },
  { type: 10, txt: '3' },
  { type: 5, txt: ')' },
  { type: 8, txt: '*' },
  { type: 2, txt: 'cos' },
  { type: 4, txt: '(' },
  { type: 10, txt: '25' },
  { type: 9, txt: '/' },
  { type: 10, txt: '3' },
  { type: 5, txt: ')' } ]

Token type 1 is the sin function, type 4 is left paren, type 10 is a number, etc.

edit — if you want to match identifiers like "x" and "y", then I'd probably use a different set of token patterns, with one just to match any identifiers. That'd mean that the parser would not find out directly about "sin" and "cos" etc. from the lexer, but that's OK. Here's an alternative list of token patterns:

var tokens = [
  "[A-Za-z_][A-Za-z_\d]*",
  "\\(",
  "\\)",
  "\\+",
  "-",
  "\\*",
  "/",
  "\\d+(?:\\.\\d*)?"
];

Now any identifier will be a type 1 token.

OTHER TIPS

I don't know if this will do everything of what you want to achieve, but it works for me:

'sin(4+3)*2'.match(/\d+\.?\d*|[a-zA-Z]+|\S/g);

// ["sin", "(", "4", "+", "3", ")", "*", "2"]

You may replace [a-zA-Z]+ part with sin|cos|tan|etc to support only math functions.

Just offer up a few possibilities:

[a-zA-Z]+|\d+(?:\.\d+)?|.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top