Question

How might I create an applescript that would count duplicate words in a pdf, and then display the results in a hierarchy with the most duplicated word at the top (with its count) and the second most second, so on and so forth? I'd like to use this in school, so that after converting ppt's to pdf I can run this script to see what is most important in the presentation.

Ideally it would filter out words such as: the, so, it, etc.

Was it helpful?

Solution

The last part you are looking for is simple.

Just set up a list and check if the word is in it or not.

    set ignoreList to {"to", "is"}
    set reportFile to "/Users/USERNAME/Desktop/Word Frequencies.txt"
set theTextFile to "Users/USERNAME/Desktop/foo.txt")


set word_list to every word of (do shell script "cat " & quoted form of theTextFile)

    set word_frequency_list to {}

    repeat with the_word_ref in word_list
        set the_current_word to contents of the_word_ref
        if the_current_word is not in ignoreList then

            set word_info to missing value

            repeat with record_ref in word_frequency_list
                if the_word of record_ref = the_current_word then
                    set word_info to contents of record_ref
                    exit repeat
                end if
            end repeat

            if word_info = missing value then
                set word_info to {the_word:the_current_word, the_count:1}
                set end of word_frequency_list to word_info
            else
                set the_count of word_info to (the_count of word_info) + 1
            end if

        end if
    end repeat
    --return word_frequency_list

    set the_report_list to {}
    repeat with word_info in word_frequency_list
        set end of the_report_list to quote & the_word of word_info & ¬
            quote & "  - appears " & the_count of word_info & " times."
    end repeat

    set AppleScript's text item delimiters to return
    set the_report to the_report_list as text
    do shell script "echo  " & quoted form of the_report & " >  " & quoted form of reportFile
    set AppleScript's text item delimiters to ""
    delay 1
    do shell script " open   " & quoted form of reportFile

I have also changed some of the code to use shell script to read/write the file. Only because I prefer using it rather than Textedit.

OTHER TIPS

While it is doable in applescript, as shown by markhunte, it is VERY slow. If you are processing larger pieces of text or lots of files, applescript is extremely slow. In my tests I gave up on it. So, here is a short shell script, which you can call from applescript if you need to, that is very fast.

#!/bin/sh

[ "$1" = "" ] || [ "$2" = "" ] && echo "$0 [wordsfile] [textfile]" && exit 1 

INFILE="$2"
WORDS="${2}.words"
EXWORDS="$1"

echo "File $INFILE has `cat $INFILE | wc -w ` words."
echo "Excluding the `cat $EXWORDS | wc -w` words."

echo "Extracting words from file and removing common words..."
grep -o -E '\w{3,}' $INFILE | grep -x -i -v -f $EXWORDS > $WORDS

echo "Top 10 most frequest words in $INFILE are..."
cat "$WORDS" | tr [:upper:] [:lower:] | sort | uniq -c | sort -rn | head -10

# Clean up
rm $WORDS
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top