Question

I have a csv file with 10 rows of text in one column. For each row, i would like to remove the stopwords and get back the same csv file just minus the stopwords.

This is my code:

def remove_stopwords(filename):
  new_text_list=[]
  cr = csv.reader(open(filename,"rU").readlines()[1:])
  cachedStopWords = stopwords.words("english")

  for row in cr:
    text = ' '.join([word for word in row.split() if word not in cachedStopWords])
    print text
    new_text_list.append(text)

However i keep getting this error:

AttributeError: 'list' object has no attribute 'split'

So it seems that rows in my csv file cannot be split using .split because they are a list. How can i get around this?

Here is how my csv file looks like

Text

I am very pleased with the your software for contractors. It is tailored quite neatly for the construction industry.

We have two different companies, one is real estate management and one is health and nutrition services. It works great for both.

So the above example is the first 3 lines of my csv file. When i run this line of code:

cr = csv.reader(open(filename,"rU").readlines()[1:])
print cr[2]

I get this:

['We have two different companies, one is real estate management and one is health and nutrition services. It works great for both.']

Thanks,

Was it helpful?

Solution

Your data file is not a CSV -- the words are separated by whitespace, not commas. So you don't need the CSV module for this. Instead, just read each line from the file and use row = line.split() to split the line on whitespace.

def remove_stopwords(filename):
    new_text_list = []
    cachedStopWords = set(stopwords.words("english"))
    with open(filename, "rU") as f:
        next(f)  # skip one line
        for line in f:
            row = line.split()
            text = ' '.join([word for word in row
                             if word not in cachedStopWords])
            print(text)
            new_text_list.append(text)

By the way, checking membership in a set is an O(1) operation, while checking membership in a list is an O(n) operation. So it's advantageous to make cachedStopWords a set:

cachedStopWords = set(stopwords.words("english"))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top