Pregunta

im trying to sort a text regarding to its frequency of certain cluster of consonants in cygwin.

the command first used is:

tr 'a-zöäü' 'A-ZÖÄÜ' < text.txt | tr -sc 'BCDFGHJKLMNPQRSTVWXYZ' '\n' | 
sort | uniq -c | sort -nr

what i think it does:

translate all lower to uppercase, eliminate everything not matching the first regex and print a new line after every string.

it gave me a list like this:

300 N
181 R
157 D
116 S
 91 T
 82 G
 81 M
 69 B
 65 ND

which is already pretty nice, BUT im only interested in clusters of two ore more letters (so the first match which would be interesting for me would be 'ND'). now im trying to elimate every string with less then two letters.

what i tried:

 tr 'a-zöäü' 'A-ZÖÄÜ' < text.txt | tr -sc [BCDFGHJKLMNPQRSTVWXYZ]{2} '\n' | 
 sort | uniq -c | sort -nr

because i thought that adding {2} would match any combination of consonants and shut out the single letters thrashing my list (N,R,D..) - but actually it didn't change anything, the list stayed the same.

anyone can help me out?

thanks already.

¿Fue útil?

Solución

You could post-process with grep:

... | grep -E '[[:digit:]]+ [[:alnum:]]{2,}$'

That'll show just lines ending with two or more characters and their preceding digits.

Otros consejos

This can be fairly straightforward with a little Perl:

perl -ne '$_ = join("", <>); print map("$_\n", m/[[:alnum:]]{2,}/g)' < text.txt | \
    sort | uniq -c | sort -nr

And it will be more efficient to let Perl do the counting and sorting too:

perl -ne '$_ = join("", <>); for (m/[[:alnum:]]{2,}/g) { $m{$_}++ } END { print map("$m{$_}: $_\n", sort { $m{$a} <=> $m{$b} } keys %m) }' < text.txt

That one-liner Perl script expanded and commented:

# match and iterate over alphabetic sequences of length >= 2
for (m/[[:alnum:]]{2,}/g) {
    # increment the count of the current item, building a map of counts
    $m{$_}++
}
END {
    # print the map as COUNT: ITEM, sorted by counted, descending
    print map("$m{$_}: $_\n", sort { $m{$a} <=> $m{$b} } keys %m)
}
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top