how to remove all non english characters and words using NLTK > [closed]

Question

I never worked with nltk before. There could be a better solution too. In my code snippet I am simply doing the following:

Reading a file that needs to be checked for non-english/english words named as frequencyList.txt to a variable named as lines.
Then I am opening a new file named as eng_words_only.txt. This file will contain the english words only. Initially this file will be empty, later after executing the script this file will contain all the English language words present in frequencyList.txt
Now for every word in frequencyList.txt I check if it is also present in wordnet. If the word is present then I write this word to the eng_words_only.txt file, else I do nothing. Please see I am using wordnet just for demo purpose. It doesn't contains all the English language words!

Code:

from nltk.corpus import wordnet

fList = open("frequencyList.txt","r")#Read the file
lines = fList.readlines()

eWords = open("eng_words_only.txt", "a")#Open file for writing

for w in lines:
    if not wordnet.synsets(w):#Comparing if word is non-English
        print 'not '+w
    else:#If word is an English word
        print 'yes '+w
        eWords.write(w)#Write to file 
        
eWords.close()#Close the file

Testing: I first created a file named as frequencyList.txt with the following contents:

cat 
meoooow 
mouse

then upon executing the code snippet you'll see the following output in the console:

not cat

not meoooow

yes mouse

Then a file will be created eng_words_only.txt which contains only the words that were supposed to be of the English language. The eng_words_only.txt will contain only mouse word. You may notice that cat is an English word but it is still not in the eng_words_only.txt file. This is the reason why you should use a good source instead of wordnet. Please note: The python script file and the frequencyList.txt should be in the same directory. Also, instead of frequencyList.txt you can use any of your file that you want to check/investigate. In that case don't forget to change the files names in the code snippet too.

Second Solution: Although you didn't ask for it but still there is an other way too to do this English word test.

Here is the code: Here the wordlist-eng.txt is the file which contains the English words. You have to keep

wordlist-eng.txt, frequencyList.txt and the python script in the same directory.

with open("wordlist-eng.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

fList = open("frequencyList.txt","r")
lines = fList.readlines()
fList.close()

eWords = open("eng_words_only.txt", "a")

for w in lines:
    if w.strip().lower() in english_words:
        eWords.write(w)
    else: pass
eWords.close()

After executing the script the eng_words_only.txt will contain all the English words that were present in frequencyList.txt file.

I hope this was helpful.