Word frequency tally script is too slow

https://stackoverflow.com/questions/4627324

30-09-2019
|

Question

Background

Created a script to count the frequency of words in a plain text file. The script performs the following steps:

Count the frequency of words from a corpus.
Retain each word in the corpus found in a dictionary.
Create a comma-separated file of the frequencies.

The script is at: http://pastebin.com/VAZdeKXs

#!/bin/bash

# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
sed -e 's/ /\n/g' -e 's/[^a-zA-Z\n]//g' corpus.txt | \
  tr [:upper:] [:lower:] | \
  sort | \
  uniq -c | \
  sort -rn > frequency.txt

echo Creating corpus lexicon...
rm -f corpus-lexicon.txt

for i in $(awk '{if( $2 ) print $2}' frequency.txt); do
  grep -m 1 ^$i\$ dictionary.txt >> corpus-lexicon.txt;
done

echo Creating lexicon...
rm -f lexicon.txt

for i in $(cat corpus-lexicon.txt); do
  egrep -m 1 "^[0-9 ]* $i\$" frequency.txt | \
    awk '{print $2, $1}' | \
    tr ' ' ',' >> lexicon.txt;
done

Problem

The following lines continually cycle through the dictionary to match words:

for i in $(awk '{if( $2 ) print $2}' frequency.txt); do
  grep -m 1 ^$i\$ dictionary.txt >> corpus-lexicon.txt;
done

It works, but it is slow because it is scanning the words it found to remove any that are not in the dictionary. The code performs this task by scanning the dictionary for every single word. (The -m 1 parameter stops the scan when the match is found.)

Question

How would you optimize the script so that the dictionary is not scanned from start to finish for every single word? The majority of the words will not be in the dictionary.

Thank you!

Solution

You can use grep -f to search for all of the words in one pass over frequency.txt:

awk '{print $2}' frequency.txt | grep -Fxf dictionary.txt > corpus-lexicon.txt

-F to search for fixed strings.
-x to match whole lines only.
-f to read the search patterns from dictionary.txt

In fact, you could even combine this with the second loop and eliminate the intermediate corpus-lexicon.txt file. The two for loops can be replaced by a single grep:

grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'

Notice that I changed -x to -w.

OTHER TIPS

This is typically one of those scripts that you'd write in Perl for speed. But if, like me, you hate write-only programming languages, you can do it all in Awk:

awk '
    BEGIN {
        while ((getline < "dictionary.txt") > 0)
            dict[$1] = 1
    }
    ($2 && $2 in dict) { print $2 }
' < frequency.txt > corpus-lexicon.txt

No need for the rm -f corpus-lexicon.txt in this version.

Use a real programming language. All of the app start ups and file scans are killing you. For instance, here's an example I just whipped up in Python (minimizing lines of code):

import sys, re
words = re.findall(r'(\w+)',open(sys.argv[1]).read())
counts = {}
for word in words:
  counts[word] = counts.setdefault(word,0) + 1
open(sys.argv[2],'w').write("\n".join([w+','+str(c) for (w,c) in counts.iteritems()]))

Testing a against a large text file I had sitting aound (1.4MB, 80,000 words according to wc), this completes in under a second (18k unique words) on a 5 year old powermac.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow