Looking for dictionary words in text file using dictionary in python

https://stackoverflow.com/questions/14860626

09-03-2022
|

質問

I read the how to check dictionary words And I got the idea to check my text file using dictionaries. I have read the pyenchant instructions, and I thought that if I use get_tokenizer to give me back all the dictionary words in the text file.

So here is where I'm stuck: I want my program to give me all groups of dictionary words in the form of a paragraph. As soon as it encounters any junk characters, considers that a paragraph break, and ignores everything from there till it finds X number of consecutive words.

I want it to read a text file in the sequence of filename_nnn.txt, parse it, and write to parsed_filname_nnn.txt. I have not got around to doing any file manipulation.

What I have so far:

import enchant
from enchant.tokenize import get_tokenizer, HTMLChunker
dictSentCheck = get_tokenizer("en_US")
sentCheck = raw_input("Check Sentense: ")

def check_dictionary():
    outcome = dictCheck.check(wordCheck) 
    test = [w[0] for w in dictSentCheck(sentCheck)]

------ sample text -----

English cricket cuts ties with Zimbabwe Wednesday, 25 June, 2008 text<void(0);><void(0);> <void(0);>email <void(0);>print EMAIL THIS ARTICLE your name: your email address: recipient's name: recipient's email address: <;>add another recipient your comment: Send Mail<void(0);> close this form <http://ad.au.doubleclick.net/jump/sbs.com.au/worldnews;sz=300x250;tile=2;ord=123456789?> The England and Wales Cricket Board (ECB) announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year.

The script should return:

English cricket cuts ties with Zimbabwe Wednesday

The England and Wales Cricket Board (ECB) announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year

I accepted abarnert's response. Below is my final script. Note it is VERY inefficient, and should be cleaned up some. Also disclaimer I have not coded since college a LONG time ago.

import enchant
from enchant.tokenize import get_tokenizer
import os

def clean_files():
    os.chdir("TARGET_DIRECTORY")
    for files in os.listdir("."):
           #get the numbers out file names 
           file_number = files[files.rfind("_")+1:files.rfind(".")]

           #Print status to screen
           print "Working on file: ", files

           #Read and process original file
           original_file = open("name_"+file_number+".txt", "r+")
           read_original_file = original_file.read();

           #Start the parsing of the files
           token_words = tokenize_words(read_original_file)
           parse_result = ('\n'.join(split_on_angle_brackets(token_words,file_number)))
           original_file.close()

           #Commit changes to parsed file
           parsed_file = open("name_"+file_number+"_parse.txt", "wb")
           parsed_file.write(parse_result);
           parsed_file.close()

def tokenize_words(file_words):
    tokenized_sentences = get_tokenizer("en_US")
    word_tokens = tokenized_sentences(file_words)
    token_result = [w[0] for w in word_tokens]
    return token_result

def check_dictionary(dict_word):
    check_word = enchant.Dict("en_US")
    validated_word = check_word.check(dict_word)
    return validated_word

def split_on_angle_brackets(token_words, file_number):
    para = []
    bracket_stack = 0
    ignored_words_per_file = open("name_"+file_number+"_ignored_words.txt", "wb")
    for word in token_words:
        if bracket_stack:
            if word == 'gt':
                bracket_stack -= 1
            elif word == 'lt':
                bracket_stack += 1
        else:
            if word == 'lt':
                if len(para) >= 7:
                    yield ' '.join(para)
                para = []
                bracket_stack = 1
            elif word != 'amp':
                if check_dictionary(word) == True:
                    para.append(word)
                    #print "append ", word
                else:
                       print "Ignored word: ", word
                       ignored_words_per_file.write(word + " \n")
    if para:
        yield ' '.join(para)

    #Close opened files
    ignored_words_per_file.close()

clean_files()

解決

I'm still not sure what exactly your problem is, or what your code is supposed to do.

But this line seems to be the key:

test = [w[0] for w in dictSentCheck(sentCheck)]

That gives you a list of all words. It includes things like lt and gt as words. And you want to strip out anything inside an lt and gt pair.

And, as you say in your comments, "I may set the required number of consecutive words to 7".

So, something like this:

def split_on_angle_brackets(words):
    para = []
    bracket_stack = 0
    for word in words:
        if bracket_stack:
            if word == 'gt':
                bracket_stack -= 1
            elif word == 'lt':
                bracket_stack += 1
        else:
            if word == 'lt':
                if len(para) >= 7:
                    yield ' '.join(para)
                para = []
                bracket_stack = 1
            else:
                para.append(word)
    if para:
        yield ' '.join(para)

If you use it with your sample data:

print('\n'.join(split_on_angle_brackets(test)))

You get this:

English cricket cuts ties with Zimbabwe Wednesday June text
print EMAIL THIS ARTICLE your name your email address recipient's name recipient's email address
add another recipient your comment Send Mail
The England and Wales Cricket Board ECB announced it was suspending all ties with Zimbabwe and was cancelling Zimbabwe's tour of England next year

That doesn't match your sample output, but I can't think of any rule that would provide your sample output, so instead I'm trying to implement the rule you described.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow