Finding punctuation and counting the number of each from the Unix Command line
-
21-02-2021 - |
Pregunta
I want find all of the punctuation marks used my .txt
file and give a count of the number of occurrences of each one. How would I go about doing this?? I am new at this but I am trying to learn! This is not homework! I have been doing research on grep
and sed
right now.
Solución
$ perl -CSD -nE '$seen{$1}++ while /(\pP)/g; END { say "$_ $seen{$_}" for keys %seen }' sometextfile.utf8
As in
$ perl -CSD -nE '$seen{$1}++ while /(\pP)/g; END { say "$_ $seen{$_}" for keys %seen }' programming_perl_4th_edition.pod | sort -k2rn
, 21761
. 19578
; 10986
( 8856
) 8853
- 7606
: 7420
" 7300
_ 5305
’ 4906
/ 4528
{ 2966
} 2947
\ 2258
@ 2121
# 2070
* 1991
' 1715
“ 1406
” 1404
[ 1007
] 1003
% 881
! 838
? 824
& 555
— 330
‑ 72
– 41
‹ 16
› 16
‐ 10
⁂ 10
… 8
· 3
「 2
」 2
« 1
» 1
‒ 1
― 1
‘ 1
• 1
‥ 1
⁃ 1
・ 1
If you want not just punctuation but punctuation and symbols, use [\pP\pS]
in your pattern. Don’t use old-style POSIX classes whatever you do, though.
Otros consejos
Use sed, tr, sort and uniq (and no perl):
sed -E 's/[^[:punct:]]//g;s/(.)/\1x/g' myfile.txt | tr 'x' '\n' | sort | uniq -c
I did it this way (sed + tr) so it will work on both unix and mac. Mac needs an imbedded linefeed in the sed command, but unix can use \n
. This way it works everywhere.
This will work on non-mac unix:
sed -E 's/[^[:punct:]]//g;s/(.)/\1\n/g' myfile.txt | sort | uniq -c
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow