سؤال

I have written a function that uses an nltk tokenizer to preprocess .txt files. Basically, the function takes a .txt file, modifies it so that each sentence appears on a separate line, and overwrites the modified file on the old file.

I would like to modify the function (or maybe to create another function) to also insert spaces before punctuation and sometimes after punctuation, as in the case of a parenthesis. In other words, leaving aside what the function already does, I also would like it to change "I want to write good, clean sentences." into "I want to write good , clean sentences ."

I am a beginner, and I suspect I probably am just missing something pretty simple. A little help would be much appreciated.

My existing code is below:

import nltk.data 
def readtowrite(filename):
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
with open(filename, 'r+') as f:
    fout = str(f.read())
    stuff = str('\n'.join(sent_detector.tokenize(fout.strip())))
    f.seek(0)
    f.write(stuff)
هل كانت مفيدة؟

المحلول

Here is the answer I came up with. Basically, I created a separate function to insert spaces before and after the punctuation in a sentence. I then called that function in the readtowrite function.

Code below:

import string 
import nltk.data

def strip_punct(sentence):
    wordlist = []
    for word in sentence:
        for char in word:
            cleanword = ""
            if char in string.punctuation:
                char = " " + char + " "
            cleanword += char
        wordlist.append(cleanword)
    return ''.join(wordlist)

def readtowrite(filename):
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    with open(filename, 'r+') as f:
        fout = str(f.read())
        stuff = str('\n'.join(sent_detector.tokenize(fout.strip())))
        morestuff = str(strip_punct(stuff))
        f.seek(0)
        f.write(morestuff)

نصائح أخرى

I think loading nltk.data.load('tokenizers/punkt/english.pickle') is equivalent to calling the sent_tokenize() and word_tokenize function in NLTK.

Maybe this script will be more helpful:

def readtowrite(infile, outfile):
  with open(outfile, 'w') as fout:
    with open(filename, 'r') as fin:
      output = "\n".join([" ".join(word_tokenize(i)) for i in sent_tokenize(str(f.read()))])
      fout.write(output)
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top