Question

In GAS I can correctly match accents with regular expression having boundary characters, such as \bà\b. The character à is matched only when it is a separate word. This works in GAS:

function test_regExp() {
  var str = "la séance est à Paris";
  var RegExp = "\\bà\\b";
  var PatReg= new RegExp( RegExp);
  var found=PatReg.exec(str);
  if (found) {
    Logger.log( [str.substring(0,found.index),found[0],str.substring(found[0].length+found.index)] );
  } else Logger.log("oops! Did not match");

In BigQuery, if boundary characters are next to accents the patterns do not match. \bséance\b matches séance:

SELECT [row],etext,ftext FROM [hcd.hdctextx] WHERE (REGEXP_MATCH(ftext,"\\bséance\\b") ) LIMIT 100;

\bà\b does not match à as a word:

SELECT [row],etext,ftext FROM [hcd.hdctextx] WHERE (REGEXP_MATCH(ftext,"\\bà\\b") ) LIMIT 100;

I'm assuming that BigQuery, unlike GAS, is including accents in the boundary character set. So \bséance\b works because é can function properly as a boundary in that configuration. \bà\b or \bétranger\b or \bmarché\b do not work because accent + \b is interpreted as \b\b, which never matches anything. (Ok, I'm grasping at straws here, because I can't find a better explanation....besides a bug.)

I don't think it is a unicode problem, because it only crops up at boundary positions.

For the moment therefore, no way to use boundary in those particular configurations of accents.

Is there a way to set the Locale in BigQuery or other fix?

Workaround: substitute (?:[^a-zA-Zéàïëâê]) and so on for \b.

Thanks!

Was it helpful?

Solution

BigQuery's behavior is correct with respect to the RE2 syntax documentation. (No surprise, because BigQuery uses RE2 to implement regexps.)

RE2's character classes are:

\b = at word boundary (\w on one side and \W, \A, or \z on the other)
\w = word characters (≡ [0-9A-Za-z_])
\W = not word characters (≡ [^0-9A-Za-z_])
\A = beginning of text
\z = end of text

In other words, you can only use \b to match boundaries of non-accented characters. RE2 has plenty of support for Unicode characters, though, so you can most likely craft an alternative regexp using something like \pL.

I'm not sure why Google Apps Script doesn't follow the RE2 spec here, but I'll follow up with that team to figure out what's going on.

OTHER TIPS

Check this out:

SElect Regexp_extract(StringToParse,r'\b?(à)\b?') as Extract,
 Regexp_match(StringToParse,r'\b?(à)\b?') as match,
FROM
(SELECT 'la séance est à Paris' as StringToParse)

Hope this helps

The answer is: in BQ don't use \b with accents; rewrite the regular expresssion:

frenRegExp = frenRegExp.replace(/\\b/g, "(?:[- .,;!?()]|$|^)");      
frenRegExp = frenRegExp.replace(/\\w/g, "(?:[A-Za-zÀàÂâÄäÆæÇçÈèÉéÊêËëÎîÏïÔôÙùÛûÜüñ])"); 
frenRegExp = frenRegExp.replace(/\\W/g, "(?:[^A-Za-zÀàÂâÄäÆæÇçÈèÉéÊêËëÎîÏïÔôÙùÛûÜüñ])");  

Also, though the GAS specification has RE2 as its re engine (oops! I really don't know what it uses, since it does not exclude accented characters from \w like BQ), it is only partially implemented. For example \pL does not match a letter.

Here is some test code that works in apps scripts, but not in BQ without a substitution.

////////////////////// TEST ///////////////////

function test_regExp() {
  var str = " Voilà la séance générale qui est à Paris";
  var RegExpString ="\\bs\\w+an\\w*"
  Logger.log(RegExpString);
  var RegExpCompiled= new RegExp( RegExpString,"i");
  Logger.log(RegExpCompiled.source); 
  var found=RegExpCompiled.exec(str);
  if (found) {
    Logger.log("|"+found[0]+"|")
    Logger.log( [str.substring(0,found.index),found[0],str.substring(found[0].length+found.index)] );
  } else Logger.log("Oops: not found");

}

Output:

[16-02-09 22:15:59:659 PST] \bs\w+anc\w*
[16-02-09 22:15:59:660 PST] \bs\w+an\w*
[16-02-09 22:15:59:660 PST] |séance|
[16-02-09 22:15:59:661 PST] [ Voilà la , séance,  générale qui est à Paris]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top