Question

My question is inspired by an interesting question somebody asked at http://tex.stackexchange.com and my attempt to provide the AWK solution. Note AWK here means NAWK since as we know gawk != awk. I am reproducing a bit of that answer here.

Original question:

I have a rather large document with lots of math notation. I've used |foo| throughout to indicate the absolute value of foo. I'd like to replace every instance of |foo| with \abs{foo}, so that I can control the notation via an abs macro I define.

My answer:

This post is inspired by cmhughes proposed solutions. His post is one of the most interesting posts on TeX editing which I have ever read. I just spent 2 hours trying to produce nawk solution. During that process I learned that AWK not only doesn't support non-greedy regular expressions which is to be expected since it is sed's cousin but even worse AWK regular expression does not capture its groups. A simple AWK script

#!/usr/bin/awk -f

NR>0{
gsub(/\|([^|]*)\|/,"\\abs{\1}")
print
}

Applied to the file

$|abs|$ so on and so fourth
$$|a|+|b|\geq|a+b|$$
who is affraid of wolf $|abs|$

will unfortunately produce

$\abs{}$ so on and so fourth
$$\abs{}+\abs{}\geq\abs{}$$
who is affraid of wolf $\abs{}$

An obvious fix for above solution is to use gawk instead as in

awk '{print gensub(/\|([^|]*)\|/, "\\abs{\\1}", "g", $0)}'

However I wonder if there is a way to use an external regex library from AWK for example tre. Even more generally how does one interface AWK with the C code (the pointer to documentation would be OK).

Was it helpful?

Solution

In the case of nawk, the answer is: not without modifying the source.

Two of the problems are:

  • regular expressions are part of the language (~ and //), as well as the defined language functions (match() etc.)
  • nawk uses its own regex code (in the file b.c) so unlike a program which uses one regex library, using a different library with alternate implementations of regcomp() regexec() will not help.

One way gawk has approached this is to extend match() with a third argument. (There's also gensub() as you note, but I try to avoid it where possible.)

gawk also supports loadable extensions, which would be a way to interface with a PCRE library to provide new "builtin" functions (though not replace ~ or any internal functions). This API is the new "4.1" way of doing extensions, previous versions had a substantially different implementation.

Finally, one nawk way to achieve the required substitution is:

match($0,/\|[^|]*\|/) {
    do {
        sub(/\|[^|]*\|/,"\\abs{" substr($0,RSTART+1,RLENGTH-2) "}",$0)
    } while (match($0,/\|[^|]*\|/))
}
{ print }

OTHER TIPS

This is my nawk based solution using split function:

awk '{
   split($0, arr, "|");
   for (i=1; i<=length(arr); i++) {
      if (i%2)
         printf("%s", arr[i]);
      else
         printf("\\abs{%s}", arr[i]);
   }
   printf("%s", ORS)
}' file

OUTPUT:

$\abs{abs}$ so on and so fourth
$$\abs{a}+\abs{b}\geq\abs{a+b}$$
who is affraid of wolf $\abs{abs}$

Live Demo: http://ideone.com/lMf2hL

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top