Getting rid of stop words and document tokenization using NLTK

Question 1

You can use the stopwords lists from NLTK, see How to remove stop words using nltk or python.

And most probably you would also like to strip off punctuation, you can use string.punctuation, see http://docs.python.org/2/library/string.html:

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a foo bar, bar black sheep."
>>> stop = set(stopwords.words('english') + list(string.punctuation))
>>> [i for i in word_tokenize(sent.lower()) if i not in stop]
['foo', 'bar', 'bar', 'black', 'sheep']

Question 2

import nltk
from nltk.corpus import stopwords
def cleanupDoc(s):
     stopset = set(stopwords.words('english'))
     tokens = nltk.word_tokenize(s)
     cleanup = " ".join(filter(lambda word: word not in stopset, s.split()))
     return cleanup
s = "I am going to disco and bar tonight"
tokens = nltk.word_tokenize(s)
x = cleanupDoc(s)
print x

This code can help in solving the above problem.

Question 3

From the error message, it seems like you're trying to convert a list, not a string, to lowercase. Your tokens = nltk.word_tokenize(s) is probably not returning what you expect (which seems to be a string).

It would be helpful to know what format your sinbo.txt file is in.

A few syntax issues:

Import should be in lowercase: import nltk
The line s = open("C:\zircon\sinbo1.txt").read() is reading the whole file in, not a single line at a time. This may be problematic because word_tokenize works on a single sentence, not any sequence of tokens. This current line assumes that your sinbo.txt file contains a single sentence. If it doesn't, you may want to either (a) use a for loop on the file instead of using read() or (b) use punct_tokenizer on a whole bunch of sentences divided by punctuation.

The first line of your cleanupDoc function is not properly indented. your function should look like this (even if the functions within it change).

import nltk
from nltk.corpus import stopwords 
def cleanupDoc(s):
 stopset = set(stopwords.words('english'))
 tokens = nltk.word_tokenize(s)
 cleanup = [token.lower() for token in tokens if token.lower() not in stopset and  len(token)>2]
 return cleanup

Question 4

In your particular case the error is in cleanup = [token.lower()for token in tokens.lower() not in stopset and len(token)>2]

tokens is a list, so you cannot do tokens.lower() operation on a list. So, another way of writing the above code would be,

cleanup = [token.lower()for token in tokens if token.lower() not in stopset and  len(token)>2]

I hope this helps.