Frage

I have a file:

To jest długi string z wieloma polskimi literami ąółżęś kodowany w UTF8, 
żeby 
było śmieszniej, haha.
ą
a

Example gawk:

gawk '{printf "%-80s %-s\n", $0, length}' file

In gawk, I get the correct result:

To jest długi string z wieloma polskimi literami ąółżęś kodowany w UTF8,         73
żeby                                                                             5
było śmieszniej, haha.                                                           22
ą                                                                                1
a                                                                                1

In gawk, I get the correct result:


Example mawk:

mawk '{printf "%-80s %-s\n", $0, length}' file
To jest długi string z wieloma polskimi literami ąółżęś kodowany w UTF8,  80
żeby                                                                            6
było śmieszniej, haha.                                                         24
ą                                                                               2
a                                                                                1

In mawk, I get the incorrect result:

As mawk get the same result as gawk?

War es hilfreich?

Lösung

mawk is a minimal-featured awk designed for speed of execution over functionality. You should not expect it to behave exactly the same as gawk or a POSIX awk. If you're going to use mawk, you need to get a mawk manual describing how IT behaves, don't rely on any other documentation describing how other awks behave.

IMHO there is no correct result for the formatting string %-s as it is meaningless to align a string without specifying a width within which to align it. There's also different interpretations of what length means on it's own - it could be short-hand for length($0) or it could be something else in a non-POSIX awk, there might not even be a length function in some non-POSIX awk and so it might take that as an undefined variable name. How does any given awk handle non-English characters?

As I said - if you're going to use a non-POSIX awk, you need to check the manual for THAT awk for all of the gory details...

Andere Tipps

UPDATE 1 : realized i could massively streamline it -

  • the only thing one needs is to pad back the count of UTF-8 continuation bytes into the total width, and by defining FS as such, then the count will always be NF - 1 for non-empty lines, and the count at the tail end of the line reflects the UTF-8 character count (strictly speaking… it's a code-point count)

    caveat : this code takes the leap of faith and assumes input is valid UTF-8 to begin with, w/o performing data validation checks

=

mawk[1/2]|gawk -b '

$!NF = sprintf("%-*s %s",(__=NF-!_)+80,$_,length($_)-__)' FS='[\\200-\\277]'

=

To jest długi string z wieloma polskimi literami ąółżęś kodowany w UTF8,         73
żeby                                                                             5
było śmieszniej, haha.                                                           22
ą                                                                                1
a                                                                                1

I assume you are using different systems... because awk installation on a system uses to be a symlink to either gawk or mawk.

All awk versions are compatible as long as the versions coincide.

I therefore assume that the issue you are facing is due to the use of an older and a newer version of the programs.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top