Domanda

I am trying to understand gawk in shell scripting. The command below is trying to count the number of paragraphs based on two (or more) consecutive new lines marking the end of a paragraph.

gawk 'END{print "Number of paragraphs: "NR}' RS="" tmp.txt

How does it work?

È stato utile?

Soluzione

The GNU awk manual says of RS:

The empty string "" (a string without any characters) has a special meaning as the value of RS. It means that records are separated by one or more blank lines and nothing else.

So, your program can be parsed as follows:

gawk 'END{print "Number of paragraphs: "NR}' RS="" tmp.txt
  1. Run the gawk command.
  2. The gawk script is END{print "Number of paragraphs: "NR} (the single quotes are removed by the shell). When the input is ended, it prints the value of NR preceded by a phrase. NR is the number of records read. Note that this is using the implicit concatenation operator between the phrase and NR. It could also be written print "Number of paragraphs:", NR and it would produce the same result.
  3. RS="" is actually seen by gawk as RS= (the double quotes are removed by the shell). This sets the special mode referenced from the manual. Here, two or more consecutive newlines will be counted as the end of a paragraph, as will EOF.
  4. The file processed is tmp.txt.

So, the command works because of a special case built into gawk.

Everything in this discussion also applies to standard awk.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top