Parse a CSV file using python (to make a decision tree later) [closed]

https://stackoverflow.com/questions/2726167

02-10-2019
|

Question

First off, full disclosure: This is going towards a uni assignment, so I don't want to receive code. :). I'm more looking for approaches; I'm very new to python, having read a book but not yet written any code.

The entire task is to import the contents of a CSV file, create a decision tree from the contents of the CSV file (using the ID3 algorithm), and then parse a second CSV file to run against the tree. There's a big (understandable) preference to have it capable of dealing with different CSV files (I asked if we were allowed to hard code the column names, mostly to eliminate it as a possibility, and the answer was no).

The CSV files are in a fairly standard format; the header row is marked with a # then the column names are displayed, and every row after that is a simple series of values. Example:

# Column1, Column2, Column3, Column4
Value01, Value02, Value03, Value04
Value11, Value12, Value13, Value14

At the moment, I'm trying to work out the first part: parsing the CSV. To make the decisions for the decision tree, a dictionary structure seems like it's going to be the most logical; so I was thinking of doing something along these lines:

Read in each line, character by character
If the character is not a comma or a space
    Append character to temporary string
If the character is a comma
    Append the temporary string to a list
    Empty string
Once a line has been read
    Create a dictionary using the header row as the key (somehow!)
    Append that dictionary to a list

However, if I do things that way, I'm not sure how to make a mapping between the keys and the values. I'm also wondering whether there is some way to perform an action on every dictionary in a list, since I'll need to be doing things to the effect of "Everyone return their values for columns Column1 and Column4, so I can count up who has what!" - I assume that there is some mechanism, but I don't think I know how to do it.

Is a dictionary the best way to do it? Would I be better off doing things using some other data structure? If so, what?

Solution

Python has some pretty powerful language constructs builtin. You can read lines from a file like:

with open(name_of_file,"r") as file:
    for line in file:
         # process the line

You can use the string.split function to separate the line along commas, and you can use string.strip to eliminate intervening whitespace. Python has very powerful lists and dictionaries.

To create a list, you simply use empty brackets like [], while to create an empty dictionary you use {}:

mylist = []; # Creates an empty list
mydict = {}; # Creates an empty dictionary

You can insert into the list using the .append() function, while you can use indexing subscripts to insert into the dictionary. For example, you can use mylist.append(5) to add 5 to the list, while you can use mydict[key]=value to associate the key key with the value value. To test whether a key is present in the dictionary, you can use the in keyword. For example:

if key in mydict:
   print "Present"
else:
   print "Absent"

To iterate over the contents of a list or dictionary, you can simply use a for-loop as in:

for val in mylist:
    # do something with val

for key in mydict:
    # do something with key or with mydict[key]

Since, in many cases, it is necessary to have both the value and index when iterating over a list, there is also a builtin function called enumerate that saves you the trouble of counting indices yourself:

for idx, val in enumerate(mylist):
    # do something with val or with idx. Note that val=mylist[idx]

The code above is identical in function to:

idx=0
for val in mylist:
   # process val, idx
   idx += 1

You could also iterate over the indices if you so chose:

for idx in xrange(len(mylist)):
    # Do something with idx and possibly mylist[idx]

Also, you can get the number of elements in a list or the number of keys in a dictionary using len.

It is possible to perform an operation on each element of a dictionary or list via the use of list comprehension; however, I would recommend that you simply use for-loops to accomplish that task. But, as an example:

>>> list1 = range(10)
>>> list1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list2 = [2*x for x in list1]
>>> list2
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

When you have the time, I suggest you read the Python tutorial to get some more in-depth knowledge.

OTHER TIPS

Example using the csv module from docs.python.org:

import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
    print row

Instead of printing the rows, you could just save each row into a list, and then process it in the ID3 later.

database.append(row)

Short answer: don't waste time and mental energy (1) reimplementing the built-in csv module (2) reading the csv module's source (it's written in C) -- just USE it!

Look at csv.DictReader.

Example:

import csv
reader = csvDictReader(open('my_file.csv','rb') # 'rb' = read binary
for d in reader:
    print d # this will print out a dictionary with keys equal to the first row of the file.

Take a look at the built-in CSV module. Though you probably can't just use it, you can take a sneak peek at the code...

If that's a no-no, your (pseudo)code looks perfectly fine, though you should make use of the str.split() function and use that, reading the file line-by-line.

Parse the CSV correctly

I'd avoid using str.split() to parse the fields because str.split() will not recognize quoted values. And many real-world CSV files use quotes. http://en.wikipedia.org/wiki/Comma-separated_values

Example record using quoted values:

1997,Ford,E350,"Super, luxurious truck"

If you use str.split(), you'll get a record like this with 5 fields:

('1997', 'Ford', 'E350', '"Super', ' luxurious truck"')

But what you really want are records like this with 4 fields:

('1997', 'Ford', 'E350', 'Super, luxurious truck')

Also, besides commas being in the data, you may have to deal with newlines "\r\n" or just "\n" in the data. For example:

1997,Ford,E350,"Super
luxurious truck"
1997,Ford,E250,"Ok? Truck"

So be careful using:

file = open('filename.csv', 'r')
for line in file:
    # problem here, "line" may contain partial data

Also, like John mentioned, the CSV standard is, that in quotes, if you get a double-double quote, then it turns into one quote.

1997,Ford,E350,"Super ""luxurious"" truck"

('1997', 'Ford', 'E350', 'Super "luxurious" truck')

So I'd suggest to modify your finite state machine like this:

Parse each character at a time.
Check to see if it's a quote, then set the state to "in quote"
If "in quote", store all the characters in the current field until there's another quote.
If "in quote", and there's another quote, store the quote character in the field data. (not the end, because a blank field shouldn't be `data,"",data` but instead `data,,data`)
If not "in quote", store the characters until you find a comma or newline.
If comma, save field and start a new field.
If newline, save field, save record, start a new record and a new field.

On a side note, interestingly, I've never seen a header commented out using # in a CSV. So to me, that would imply that you may have to look for commented lines in the data too. Using # to comment out a line in a CSV file is not standard.

Adding found fields into a record dictionary using header keys

Depending on memory requirements, if the CSV is small enough (maybe 10k to 100k records), using a dictionary is fine. Just store a list of all the column names so you can access the column name by index (or number). Then in the finite state machine, increment the column index when you find a comma, and reset to 0 when you find a newline.

So if your header is header = ['Column1', 'Column2'] Then when you find a data character, add it like this:

record[header[column_index]] += character

I don't know too much about the builtin csv module that @Kaloyan Todorov talks about, but, if you're reading comma separated lines, then you can easily do this:

for line in file:
    columns = line.split(',')
    for column in columns:
        print column.strip()

This will print all the entries of each line without the leading a tailing whitespaces.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow