I’m having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following AttributeError: 'list' object has no attribute 'lower'.

I just can’t figure out what I’m doing wrong, although it’s my first time of doing something like this. Below are my lines of code.I’ll appreciate any suggestions, thanks

    import nltk
    from nltk.corpus import stopwords
    s = open("C:\zircon\sinbo1.txt").read()
    tokens = nltk.word_tokenize(s)
    def cleanupDoc(s):
            stopset = set(stopwords.words('english'))
        tokens = nltk.word_tokenize(s)
        cleanup = [token.lower()for token in tokens.lower() not in stopset and  len(token)>2]
        return cleanup
    cleanupDoc(s)
有帮助吗?

解决方案

You can use the stopwords lists from NLTK, see How to remove stop words using nltk or python.

And most probably you would also like to strip off punctuation, you can use string.punctuation, see http://docs.python.org/2/library/string.html:

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a foo bar, bar black sheep."
>>> stop = set(stopwords.words('english') + list(string.punctuation))
>>> [i for i in word_tokenize(sent.lower()) if i not in stop]
['foo', 'bar', 'bar', 'black', 'sheep']

其他提示

import nltk
from nltk.corpus import stopwords
def cleanupDoc(s):
     stopset = set(stopwords.words('english'))
     tokens = nltk.word_tokenize(s)
     cleanup = " ".join(filter(lambda word: word not in stopset, s.split()))
     return cleanup
s = "I am going to disco and bar tonight"
tokens = nltk.word_tokenize(s)
x = cleanupDoc(s)
print x

This code can help in solving the above problem.

From the error message, it seems like you're trying to convert a list, not a string, to lowercase. Your tokens = nltk.word_tokenize(s) is probably not returning what you expect (which seems to be a string).

It would be helpful to know what format your sinbo.txt file is in.

A few syntax issues:

  1. Import should be in lowercase: import nltk

  2. The line s = open("C:\zircon\sinbo1.txt").read() is reading the whole file in, not a single line at a time. This may be problematic because word_tokenize works on a single sentence, not any sequence of tokens. This current line assumes that your sinbo.txt file contains a single sentence. If it doesn't, you may want to either (a) use a for loop on the file instead of using read() or (b) use punct_tokenizer on a whole bunch of sentences divided by punctuation.

  3. The first line of your cleanupDoc function is not properly indented. your function should look like this (even if the functions within it change).

    import nltk
    from nltk.corpus import stopwords 
    def cleanupDoc(s):
     stopset = set(stopwords.words('english'))
     tokens = nltk.word_tokenize(s)
     cleanup = [token.lower() for token in tokens if token.lower() not in stopset and  len(token)>2]
     return cleanup
    

In your particular case the error is in cleanup = [token.lower()for token in tokens.lower() not in stopset and len(token)>2]

tokens is a list, so you cannot do tokens.lower() operation on a list. So, another way of writing the above code would be,

cleanup = [token.lower()for token in tokens if token.lower() not in stopset and  len(token)>2]

I hope this helps.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top