BigQuery REGEXP_MATCH and accents : boundary wildcard fails?

Question 1

BigQuery's behavior is correct with respect to the RE2 syntax documentation. (No surprise, because BigQuery uses RE2 to implement regexps.)

RE2's character classes are:

\b = at word boundary (\w on one side and \W, \A, or \z on the other)
\w = word characters (≡ [0-9A-Za-z_])
\W = not word characters (≡ [^0-9A-Za-z_])
\A = beginning of text
\z = end of text

In other words, you can only use \b to match boundaries of non-accented characters. RE2 has plenty of support for Unicode characters, though, so you can most likely craft an alternative regexp using something like \pL.

I'm not sure why Google Apps Script doesn't follow the RE2 spec here, but I'll follow up with that team to figure out what's going on.

Question 2

Check this out:

SElect Regexp_extract(StringToParse,r'\b?(à)\b?') as Extract,
 Regexp_match(StringToParse,r'\b?(à)\b?') as match,
FROM
(SELECT 'la séance est à Paris' as StringToParse)

Hope this helps

Question 3

The answer is: in BQ don't use \b with accents; rewrite the regular expresssion:

frenRegExp = frenRegExp.replace(/\\b/g, "(?:[- .,;!?()]|$|^)");      
frenRegExp = frenRegExp.replace(/\\w/g, "(?:[A-Za-zÀàÂâÄäÆæÇçÈèÉéÊêËëÎîÏïÔôÙùÛûÜüñ])"); 
frenRegExp = frenRegExp.replace(/\\W/g, "(?:[^A-Za-zÀàÂâÄäÆæÇçÈèÉéÊêËëÎîÏïÔôÙùÛûÜüñ])");

Also, though the GAS specification has RE2 as its re engine (oops! I really don't know what it uses, since it does not exclude accented characters from \w like BQ), it is only partially implemented. For example \pL does not match a letter.

Here is some test code that works in apps scripts, but not in BQ without a substitution.

////////////////////// TEST ///////////////////

function test_regExp() {
  var str = " Voilà la séance générale qui est à Paris";
  var RegExpString ="\\bs\\w+an\\w*"
  Logger.log(RegExpString);
  var RegExpCompiled= new RegExp( RegExpString,"i");
  Logger.log(RegExpCompiled.source); 
  var found=RegExpCompiled.exec(str);
  if (found) {
    Logger.log("|"+found[0]+"|")
    Logger.log( [str.substring(0,found.index),found[0],str.substring(found[0].length+found.index)] );
  } else Logger.log("Oops: not found");

}

Output:

[16-02-09 22:15:59:659 PST] \bs\w+anc\w*
[16-02-09 22:15:59:660 PST] \bs\w+an\w*
[16-02-09 22:15:59:660 PST] |séance|
[16-02-09 22:15:59:661 PST] [ Voilà la , séance,  générale qui est à Paris]