Using awk for conditional find/replace
-
11-02-2021 - |
Question
I want to solve a common but very specific problem: due to OCR errors, a lot of subtitle files contain the character "I" (upper case i) instead of "l" (lower case L).
My plan of attack is:
- Process the file word by word
- Pass each word to the hunspell spellchecker ("echo the-word | hunspell -l" produces no response at all if it is valid, and a response if it is bad)
- If it is a bad word, AND it has uppercase Is in it, then replace these with lowercase l and try again. If it is now a valid word, replace the original word.
I could certainly tokenize and reconstruct the entire file in a script, but before I go down that path I was wondering if it is possible to use awk and/or sed for these kinds of conditional operations at the word-level?
Any other suggested approaches would also be very welcome!
Solution
You don't really need more than bash for this:
while read line; do
words=( $line )
for ((i=0; i<${#words[@]}; i++)); do
word=${words[$i]}
if [[ $(hunspell -l <<< $word) ]]; then
# hunspell had some output
tmp=${word//I/l}
if [[ $tmp != $word ]] && [[ -z $(hunspell -l <<< $tmp) ]]; then
# no output for new word, therefore it's a dictionary word
words[$i]=$tmp
fi
fi
done
# print the new line
echo "${words[@]}"
done < filename > filename.new
It does seem to make more sense to pass the whole file to hunspell, and parse the output of that.
OTHER TIPS
Two suggestions:
- Fix the problem closer to where it originates, i.e. near the OCR Software. Can it be made to consult a dictionary and don't even come up with non-words containing 'I'? If not, try a different OCR program that can.
- Running each word through hunspell creates a process for each word, which is a massive waste of CPU cycles. Try using several passes, where the first pass finds all 'I' words, then filter out correct words, then replace each correctable word.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow