Convert Java tokenizing regex into Javascript

Question 1

You can take advantage of regular expression grouping to do this. You need a regex that combines the different possible tokens, and you apply it repeatedly.

I like to separate out the different parts; it makes it easier to maintain and extend:

var tokens = [
  "sin",
  "cos",
  "tan",
  "\\(",
  "\\)",
  "\\+",
  "-",
  "\\*",
  "/",
  "\\d+(?:\\.\\d*)?"
];

You glue those all together into a big regular expression with | between each token:

var rtok = new RegExp( "\\s*(?:(" + tokens.join(")|(") + "))\\s*", "g" );

You can then tokenize using regex operations on your source string:

function tokenize( expression ) {
  var toks = [], p;

  rtok.lastIndex = p = 0; // reset the regex
  while (rtok.lastIndex < expression.length) {
    var match = rtok.exec(expression);

    // Make sure we found a token, and that we found
    // one without skipping garbage

    if (!match || rtok.lastIndex - match[0].length !== p)
      throw "Oops - syntax error";

    // Figure out which token we matched by finding the non-null group
    for (var i = 1; i < match.length; ++i) {
      if (match[i]) {
        toks.push({
          type: i,
          txt: match[i]
        });
        // remember the new position in the string
        p = rtok.lastIndex;
        break;
      }
    }
  }
  return toks;
}

That just repeatedly matches the token regex against the string. The regular expression was created with the "g" flag, so the regex machinery will automatically keep track of where to start matching after each match we make. When it doesn't see a match, or when it does but has to skip invalid stuff to find it, we know there's a syntax error. When it does match, it records in the token array which token it matched (the index of the non-null group) and the matched text. By remembering the matched token index, it saves you the trouble of having to figure out what each token string means after you've tokenized; you just have to do a simple numeric comparison.

Thus calling tokenize( "sin(4+3) * cos(25 / 3)" ) returns:

[ { type: 1, txt: 'sin' },
  { type: 4, txt: '(' },
  { type: 10, txt: '4' },
  { type: 6, txt: '+' },
  { type: 10, txt: '3' },
  { type: 5, txt: ')' },
  { type: 8, txt: '*' },
  { type: 2, txt: 'cos' },
  { type: 4, txt: '(' },
  { type: 10, txt: '25' },
  { type: 9, txt: '/' },
  { type: 10, txt: '3' },
  { type: 5, txt: ')' } ]

Token type 1 is the sin function, type 4 is left paren, type 10 is a number, etc.

edit — if you want to match identifiers like "x" and "y", then I'd probably use a different set of token patterns, with one just to match any identifiers. That'd mean that the parser would not find out directly about "sin" and "cos" etc. from the lexer, but that's OK. Here's an alternative list of token patterns:

var tokens = [
  "[A-Za-z_][A-Za-z_\d]*",
  "\\(",
  "\\)",
  "\\+",
  "-",
  "\\*",
  "/",
  "\\d+(?:\\.\\d*)?"
];

Now any identifier will be a type 1 token.

Question 2

I don't know if this will do everything of what you want to achieve, but it works for me:

'sin(4+3)*2'.match(/\d+\.?\d*|[a-zA-Z]+|\S/g);

// ["sin", "(", "4", "+", "3", ")", "*", "2"]

You may replace [a-zA-Z]+ part with sin|cos|tan|etc to support only math functions.

Question 3

Just offer up a few possibilities:

[a-zA-Z]+|\d+(?:\.\d+)?|.