Question

I want to solve a common but very specific problem: due to OCR errors, a lot of subtitle files contain the character "I" (upper case i) instead of "l" (lower case L).

My plan of attack is:

  1. Process the file word by word
  2. Pass each word to the hunspell spellchecker ("echo the-word | hunspell -l" produces no response at all if it is valid, and a response if it is bad)
  3. If it is a bad word, AND it has uppercase Is in it, then replace these with lowercase l and try again. If it is now a valid word, replace the original word.

I could certainly tokenize and reconstruct the entire file in a script, but before I go down that path I was wondering if it is possible to use awk and/or sed for these kinds of conditional operations at the word-level?

Any other suggested approaches would also be very welcome!

Was it helpful?

Solution

You don't really need more than bash for this:

while read line; do
  words=( $line )
  for ((i=0; i<${#words[@]}; i++)); do
    word=${words[$i]}
    if [[ $(hunspell -l <<< $word) ]]; then
      # hunspell had some output
      tmp=${word//I/l}
      if [[ $tmp != $word ]] && [[ -z $(hunspell -l <<< $tmp) ]]; then
        # no output for new word, therefore it's a dictionary word
        words[$i]=$tmp
      fi
    fi
  done
  # print the new line
  echo "${words[@]}"
done < filename > filename.new

It does seem to make more sense to pass the whole file to hunspell, and parse the output of that.

OTHER TIPS

Two suggestions:

  1. Fix the problem closer to where it originates, i.e. near the OCR Software. Can it be made to consult a dictionary and don't even come up with non-words containing 'I'? If not, try a different OCR program that can.
  2. Running each word through hunspell creates a process for each word, which is a massive waste of CPU cycles. Try using several passes, where the first pass finds all 'I' words, then filter out correct words, then replace each correctable word.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top