Python Gurus,
In the past, I've been using Perl to go through very large text files for data mining. Recently I've decided to switch over since I believe Python makes it easier for me to go through my code and figure out what's going on.
The unfortunate (or maybe fortunate?) thing about Python is that it's extremely difficult to store and organize data when compared to Perl since I can't create hashes of hashes via autovivification. I'm also not able to sum the elements of the dictionary of dictionaries.
Maybe there's an elegant solution to my problem.
I have hundreds of files with several hundred rows of data (all can fit in memory). The goal is to combine these two files, but with certain criteria:
For each level (only showing one level below) I need to create a row for each defect class that was found in all the files. Not all files have the same defects.
For each level and defect class sum up all the GEC & BEC values found in all the files.
Final output should look like (updated sample output, typo):
Level defectClass BECtotals GECtotals
1415PA, 0, 643, 1991
1415PA, 1, 1994, 6470
...and so on.....
File one:
Level, defectClass, BEC, GEC
1415PA, 0, 262, 663
1415PA, 1, 1138, 4104
1415PA, 107, 2, 0
1415PA, 14, 3, 4
1415PA, 15, 1, 0
1415PA, 2, 446, 382
1415PA, 21, 5, 0
1415PA, 23, 10, 5
1415PA, 4, 3, 16
1415PA, 6, 52, 105
File two:
level, defectClass, BEC, GEC
1415PA, 0, 381, 1328
1415PA, 1, 856, 2366
1415PA, 107, 7, 11
1415PA, 14, 4, 1
1415PA, 2, 315, 202
1415PA, 23, 4, 7
1415PA, 4, 0, 2
1415PA, 6, 46, 42
1415PA, 7, 1, 7
I'm having the biggest problem being able to do the summations on the dictionaries. This is the code I have so far (not working):
import os
import sys
class AutoVivification(dict):
"""Implementation of perl's autovivification feature. Has features from both dicts and lists,
dynamically generates new subitems as needed, and allows for working (somewhat) as a basic type.
"""
def __getitem__(self, item):
if isinstance(item, slice):
d = AutoVivification()
items = sorted(self.iteritems(), reverse=True)
k,v = items.pop(0)
while 1:
if (item.start < k < item.stop):
d[k] = v
elif k > item.stop:
break
if item.step:
for x in range(item.step):
k,v = items.pop(0)
else:
k,v = items.pop(0)
return d
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
def __add__(self, other):
"""If attempting addition, use our length as the 'value'."""
return len(self) + other
def __radd__(self, other):
"""If the other type does not support addition with us, this addition method will be tried."""
return len(self) + other
def append(self, item):
"""Add the item to the dict, giving it a higher integer key than any currently in use."""
largestKey = sorted(self.keys())[-1]
if isinstance(largestKey, str):
self.__setitem__(0, item)
elif isinstance(largestKey, int):
self.__setitem__(largestKey+1, item)
def count(self, item):
"""Count the number of keys with the specified item."""
return sum([1 for x in self.items() if x == item])
def __eq__(self, other):
"""od.__eq__(y) <==> od==y. Comparison to another AV is order-sensitive
while comparison to a regular mapping is order-insensitive. """
if isinstance(other, AutoVivification):
return len(self)==len(other) and self.items() == other.items()
return dict.__eq__(self, other)
def __ne__(self, other):
"""od.__ne__(y) <==> od!=y"""
return not self == other
for filename in os.listdir('/Users/aleksarias/Desktop/DefectMatchingDatabase/'):
if filename[0] == '.' or filename == 'YieldToDefectDatabaseJan2014Continued.csv':
continue
path = '/Users/aleksarias/Desktop/DefectMatchingDatabase/' + filename
for filename2 in os.listdir(path):
if filename2[0] == '.':
continue
path2 = path + "/" + filename2
techData = AutoVivification()
for file in os.listdir(path2):
if file[0:13] == 'SummaryRearr_':
dataFile = path2 + '/' + file
print('Location of file to read: ', dataFile, '\n')
fh = open(dataFile, 'r')
for lines in fh:
if lines[0:5] == 'level':
continue
lines = lines.strip()
elements = lines.split(',')
if techData[elements[0]][elements[1]]['BEC']:
techData[elements[0]][elements[1]]['BEC'].append(elements[2])
else:
techData[elements[0]][elements[1]]['BEC'] = elements[2]
if techData[elements[0]][elements[1]]['GEC']:
techData[elements[0]][elements[1]]['GEC'].append(elements[3])
else:
techData[elements[0]][elements[1]]['GEC'] = elements[3]
print(elements[0], elements[1], techData[elements[0]][elements[1]]['BEC'], techData[elements[0]][elements[1]]['GEC'])
techSumPath = path + '/Summary_' + filename + '.csv'
fh2 = open(techSumPath, 'w')
for key1 in sorted(techData):
for key2 in sorted(techData[key1]):
BECtotal = sum(map(int, techData[key1][key2]['BEC']))
GECtotal = sum(map(int, techData[key1][key2]['GEC']))
fh2.write('%s,%s,%s,%s\n' % (key1, key2, BECtotal, GECtotal))
print('Created file at:', techSumPath)
input('Go check the file!!!!')
Thanks for taking a look at this!!!!!
Alex