Question

I have a file like this:

all <div class="first">these</div> <div class="second">words</div> <div class="second">are</div> <div class="second">marked</div> <div class="second">but</div> these words are not.
<div class="first">this</div> is <div class="second">another</div> <div class="second">example</div> with <div class="second">some</div> unmarked words.

I need to place braces around all words that have a space before and after, e.g., the output would be this:

all <div class="first">these</div> <div class="second">words</div> <div class="second">are</div> <div class="second">marked</div> <div class="second">but</div> {these} {words} {are} not.
<div class="first">this</div> {is} <div class="second">another</div> <div class="second">example</div> {with} <div class="second">some</div> {unmarked} words.
  • all was not given brace, because there is no space before.
  • not. and words.were not substituted, because there are no spaces after.

I have tried many different things with awk, but nothing works quite right. This is the closest I can get:

awk '{ gsub(/.[[:blank:]][[:alpha:]][[:blank:]]*/, "{&}"); }1'
  • Words can only contain these letters: a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, and ü and the upper-case equivalents.
  • Words cannot contain any other symbols not listed above. For example, if 1, á, and < appear somewhere within the two spaces, then it is not considered a match.
Was it helpful?

Solution 3

This awk script works on the sample data:

awk '{ for (i = 1; i <= NF; i++)
         if ($i ~ /^[[:alpha:]]+$/ && (i != 1 || $0 ~ /^ /))
            $i = "{" $i "}"
       print $0
     }' data

For the given input, the output is exactly the desired output. The condition requires the word in each of the fields to be all alphabetic, and either 'not the first word, or if the line as a whole starts with a blank'. If there was an all-alpha word at the end, you could add a condition && (i != NF || $0 ~ / $/) in the if statement.

I used [[:alpha:]] based on the question assuming that in your locale, ü is valid as an alpha character. If you need only plain Latin letters plus ü (U+00FC, LATIN SMALL LETTER U WITH DIAERESIS) and Ü (U+00DC, LATIN CAPITAL LETTER U WITH DIAERESIS), then you can replace that character class with [a-zA-ZüÜ] instead. Only EBCDIC might get screwed up by the use of a-zA-Z, and you'd know if that's a problem for you. You can revise as necessary to get the characters you're interested in.

OTHER TIPS

Unless there is another way you can do this, you will need to use lookahead and lookbehind assertions which are not supported in awk or sed. With Perl, you could do the following.

perl -pe 's/(?<= )([a-zA-ZüÜ]+)(?= )/{\1}/g' file

With GNU sed you can create a loop and put braces around the words.

$ sed -r ':a;s/ ([[:alpha:]]+) / {\1} /;ta' file
all <div class="first">these</div> <div class="second">words</div> <div class="second">are</div> <div class="second">marked</div> <div class="second">but</div> {these} {words} {are} not.
<div class="first">this</div> {is} <div class="second">another</div> <div class="second">example</div> {with} <div class="second">some</div> {unmarked} words.

The character class can be modified to suit your requirements.

With GNU awk for gensub() and \s:

awk '{while((new=gensub(/(\s)([[:alpha:]]+)(\s)/,"\\1{\\2}\\3","g")) != $0) $0=new}1' file
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top