Creating a Set is giving different output than expected

https://stackoverflow.com/questions/19721814

02-07-2022
|

Question

I am combining the processed data of two essays into one. I want to create a set two count how many different words are used as well as other analysis. However, when I combine them, and do set(entire), I am returned with just a set of letters. I have the code below as well as the output I am getting. I would like for the output to be all the words being used.

print set(entire)
set([' ', '1', '0', '3', '2', '5', '4', '6', '9', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x'])



from __future__ import division
import nltk
import csv
import re
from string import punctuation
import enchant
from enchant.checker import SpellChecker

dictionary = enchant.Dict("en_US")
chkr = SpellChecker("en_US")

with open('2012ShortAnswers.csv', 'rb') as csvfile:
    data = csv.reader(csvfile, delimiter=",")

    writer = csv.writer(open('2012output.csv', 'wb'))

    for row in data:

        row3 = row[3]
        row3 = row3.lower().replace('  ', ' ')
        row4 = row[4]
        row4 = row4.lower().replace('  ', ' ')

        row3 = row3.replace('\n', '')
        row4 = row4.replace('\n', '')

        for p in list(punctuation):
            row3 = row3.replace(p, '')
            row4 = row4.replace(p, '')

        entire = row3 + row4
        set(entire)

Solution

row3 and row4 are strings. At no point do you split them into words. When you do set on a string, it makes a set of the characters in the string.

Perhaps try row3 = row3.split() and likewise for row4, then do set(row3+row4).

That won't really fix it, though, since right now you aren't doing anything with that set. You should create some other set outside the loop and add to it on each loop iteration. Right now you create a set on each iteration but just throw it away.

OTHER TIPS

You are processing each line of input and overwriting all the previous lines, so in the end, your variables are just reflecting whatever the last line was.

You either need to make a set before entering the loop myset = set() and do myset.add(row3) inside, or append to a list inside the loop and then generate the set upon exiting the loop.

Also, know row3 is really the 4th column of the data, as split by commas, right? What are you hoping to get from the csv reader if this is an essay?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow