Question

Python Gurus,

In the past, I've been using Perl to go through very large text files for data mining. Recently I've decided to switch over since I believe Python makes it easier for me to go through my code and figure out what's going on. The unfortunate (or maybe fortunate?) thing about Python is that it's extremely difficult to store and organize data when compared to Perl since I can't create hashes of hashes via autovivification. I'm also not able to sum the elements of the dictionary of dictionaries.

Maybe there's an elegant solution to my problem.

I have hundreds of files with several hundred rows of data (all can fit in memory). The goal is to combine these two files, but with certain criteria:

  1. For each level (only showing one level below) I need to create a row for each defect class that was found in all the files. Not all files have the same defects.

  2. For each level and defect class sum up all the GEC & BEC values found in all the files.

  3. Final output should look like (updated sample output, typo):

Level defectClass BECtotals GECtotals
1415PA, 0, 643, 1991
1415PA, 1, 1994, 6470
...and so on.....

File one:

Level,  defectClass,    BEC,    GEC
1415PA,      0,         262,    663
1415PA,      1,         1138,   4104
1415PA,    107,     2,  0
1415PA,     14,         3,  4
1415PA,     15,         1,  0
1415PA,      2,         446,    382
1415PA,     21,         5,  0
1415PA,     23,         10, 5
1415PA,      4,         3,  16
1415PA,      6,        52,  105

File two:

level,  defectClass,   BEC, GEC
1415PA, 0,     381, 1328
1415PA, 1,     856, 2366
1415PA, 107,       7,   11
1415PA, 14,    4,   1
1415PA, 2,     315, 202
1415PA, 23,    4,   7
1415PA, 4,     0,   2
1415PA, 6,     46,  42
1415PA, 7,     1,   7

I'm having the biggest problem being able to do the summations on the dictionaries. This is the code I have so far (not working):

import os
import sys


class AutoVivification(dict):
    """Implementation of perl's autovivification feature. Has features from both dicts and lists,
    dynamically generates new subitems as needed, and allows for working (somewhat) as a basic type.
    """
    def __getitem__(self, item):
    if isinstance(item, slice):
        d = AutoVivification()
        items = sorted(self.iteritems(), reverse=True)
        k,v = items.pop(0)
        while 1:
        if (item.start < k < item.stop):
            d[k] = v
        elif k > item.stop:
            break
        if item.step:
            for x in range(item.step):
            k,v = items.pop(0)
        else:
            k,v = items.pop(0)
        return d
    try:
        return dict.__getitem__(self, item)
    except KeyError:
        value = self[item] = type(self)()
        return value

    def __add__(self, other):
    """If attempting addition, use our length as the 'value'."""
    return len(self) + other

    def __radd__(self, other):
    """If the other type does not support addition with us, this addition method will be tried."""
    return len(self) + other

    def append(self, item):
    """Add the item to the dict, giving it a higher integer key than any currently in use."""
    largestKey = sorted(self.keys())[-1]
    if isinstance(largestKey, str):
        self.__setitem__(0, item)
    elif isinstance(largestKey, int):
        self.__setitem__(largestKey+1, item)

    def count(self, item):
    """Count the number of keys with the specified item."""
    return sum([1 for x in self.items() if x == item])

    def __eq__(self, other):
    """od.__eq__(y) <==> od==y. Comparison to another AV is order-sensitive
    while comparison to a regular mapping is order-insensitive. """
    if isinstance(other, AutoVivification):
        return len(self)==len(other) and self.items() == other.items()
    return dict.__eq__(self, other)

    def __ne__(self, other):
    """od.__ne__(y) <==> od!=y"""
    return not self == other

for filename in os.listdir('/Users/aleksarias/Desktop/DefectMatchingDatabase/'):
    if filename[0] == '.' or filename == 'YieldToDefectDatabaseJan2014Continued.csv':
    continue
    path = '/Users/aleksarias/Desktop/DefectMatchingDatabase/' + filename

    for filename2 in os.listdir(path):
    if filename2[0] == '.':
        continue
    path2 = path + "/" + filename2
    techData = AutoVivification()

    for file in os.listdir(path2):
        if file[0:13] == 'SummaryRearr_':
        dataFile = path2 + '/' + file
        print('Location of file to read: ', dataFile, '\n')
        fh = open(dataFile, 'r')

        for lines in fh:
            if lines[0:5] == 'level':
            continue
            lines = lines.strip()
            elements = lines.split(',')

            if techData[elements[0]][elements[1]]['BEC']:
            techData[elements[0]][elements[1]]['BEC'].append(elements[2])
            else:
            techData[elements[0]][elements[1]]['BEC'] = elements[2]

            if techData[elements[0]][elements[1]]['GEC']:
            techData[elements[0]][elements[1]]['GEC'].append(elements[3])
            else:
            techData[elements[0]][elements[1]]['GEC'] = elements[3]


            print(elements[0], elements[1], techData[elements[0]][elements[1]]['BEC'], techData[elements[0]][elements[1]]['GEC'])

    techSumPath = path + '/Summary_' + filename + '.csv'
    fh2 = open(techSumPath, 'w')
    for key1 in sorted(techData):
    for key2 in sorted(techData[key1]):
        BECtotal = sum(map(int, techData[key1][key2]['BEC']))
        GECtotal = sum(map(int, techData[key1][key2]['GEC']))
        fh2.write('%s,%s,%s,%s\n' % (key1, key2, BECtotal, GECtotal))
    print('Created file at:', techSumPath)
    input('Go check the file!!!!')

Thanks for taking a look at this!!!!!
Alex

Was it helpful?

Solution

I'm going to suggest a different approach: if you're processing tabular data, you should look at the pandas library. Your code becomes something like

import pandas as pd

filenames = "fileone.txt", "filetwo.txt"  # or whatever

dfs = []
for filename in filenames:
    df = pd.read_csv(filename, skipinitialspace=True)
    df = df.rename(columns={"level": "Level"})
    dfs.append(df)

df_comb = pd.concat(dfs)
df_totals = df_comb.groupby(["Level", "defectClass"], as_index=False).sum()
df_totals.to_csv("combined.csv", index=False)

which produces

dsm@notebook:~/coding/pand$ cat combined.csv 
Level,defectClass,BEC,GEC
1415PA,0,643,1991
1415PA,1,1994,6470
1415PA,2,761,584
1415PA,4,3,18
1415PA,6,98,147
1415PA,7,1,7
1415PA,14,7,5
1415PA,15,1,0
1415PA,21,5,0
1415PA,23,14,12
1415PA,107,9,11

Here I've read every file into memory simultaneously and combined them into one big DataFrame (like an Excel sheet), but we could just as easily have done the groupby operation file by file so we'd only need to have one file in memory at a time if we liked.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top