Question

I made the following code which works but I want to improve it. I don't want to re-read the file, but if I delete sales_input.seek(0) it won't iterate throw each row in sales. How can i improve this?

def computeCritics(mode, cleaned_sales_input = "data/cleaned_sales.csv"):
    if mode == 1:
        print "creating customer.critics.recommendations"
        critics_output = open("data/customer/customer.critics.recommendations", 
                              "wb")
        ID = getCustomerSet(cleaned_sales_input)
        sales_dict = pickle.load(open("data/customer/books.dict.recommendations", 
                                      "r"))
    else: 
        print "creating books.critics.recommendations"
        critics_output = open("data/books/books.critics.recommendations", 
                              "wb")
        ID = getBookSet(cleaned_sales_input)
        sales_dict = pickle.load(open("data/books/users.dict.recommendations", 
                                      "r"))
    critics = {}
    # make critics dict and pickle it 
    for i in ID:
        with open(cleaned_sales_input, 'rb') as sales_input:
            sales = csv.reader(sales_input)  # read new 
            for j in sales:
                if mode == 1:
                    if int(i) == int(j[2]):
                        sales_dict[int(j[6])] = 1
                else: 
                    if int(i) == int(j[6]):
                        sales_dict[int(j[2])] = 1
            critics[int(i)] = sales_dict
    pickle.dump(critics, critics_output)
    print "done"

cleaned_sales_input looks like

6042772,2723,3546414,9782072488887,1,9.99,314968
6042769,2723,3546414,9782072488887,1,9.99,314968
...

where number 6 is the book ID and number 0 is the customer ID

I want to get a dict wich looks like

critics = {
    CustomerID1: {
        BookID1: 1,
        BookID2: 0,
        ........
        BookIDX: 0
    },
    CustomerID2: {
        BookID1: 0,
        BookID2: 1,
        ...
    }
}

or

critics = {
    BookID1: {
        CustomerID1: 1,
        CustomerID2: 0,
        ........
        CustomerIDX: 0
    },
    BookID1: {
        CustomerID1: 0,
        CustomerID2: 1,
        ...
        CustomerIDX: 0
    }
}

I hope this isn't to much information

Was it helpful?

Solution

Here are some suggestions:

Let's first look at this code pattern:

for i in ID:
    for j in sales:
        if int(i) == int(j[2])

notice that i is only being compared with j[2]. That's its only purpose in the loop. int(i) == int(j[2]) can only be True at most once for each i.

So, we can completely remove the for i in ID loop by rewriting it as

for j in sales:
    key = j[2]
    if key in ID:

Based on the function names getCustomerSet and getBookSet, it sounds as if ID is a set (as opposed to a list or tuple). We want ID to be a set since testing membership in a set is O(1) (as opposed to O(n) for a list or tuple).


Next, consider this line:

critics[int(i)] = sales_dict

There is a potential pitfall here. This line is assigning sales_dict to critics[int(i)] for each i in ID. Each key int(i) is being mapped to the very same dict. As we loop through sales and ID, we are modifying sales_dict like this, for example:

sales_dict[int(j[6])] = 1

But this will cause all values in critics to be modified simultaneously, since all keys in critics point to the same dict, sales_dict. I doubt that is what you want.

To avoid this pitfall, we need to make copies of the sales_dict:

critics = {i:sales_dict.copy() for i in ID}

def computeCritics(mode, cleaned_sales_input="data/cleaned_sales.csv"):
    if mode == 1:
        filename = 'customer.critics.recommendations'
        path = os.path.join("data/customer", filename)
        ID = getCustomerSet(cleaned_sales_input)
        sales_dict = pickle.load(
            open("data/customer/books.dict.recommendations", "r"))
        key_idx, other_idx = 2, 6
    else:
        filename = 'books.critics.recommendations'
        path = os.path.join("data/books", filename)        
        ID = getBookSet(cleaned_sales_input)
        sales_dict = pickle.load(
            open("data/books/users.dict.recommendations", "r"))
        key_idx, other_idx = 6, 2

    print "creating {}".format(filename)
    ID = {int(item) for item in ID}
    critics = {i:sales_dict.copy() for i in ID}
    with open(path, "wb") as critics_output:
        # make critics dict and pickle it
        with open(cleaned_sales_input, 'rb') as sales_input:
            sales = csv.reader(sales_input)  # read new
            for j in sales:
                key = int(j[key_idx])
                if key in ID:
                    other_key = int(j[other_idx])
                    critics[key][other_key] = 1                    
                critics[key] = sales_dict
        pickle.dump(dict(critics), critics_output)
        print "done"

OTHER TIPS

@unutbu's answer is better but if you are stuck with this structure you can put the whole file in memory:

sales = []
with open(cleaned_sales_input, 'rb') as sales_input:
     sales_reader = csv.reader(sales_input)       
     [sales.append(line) for line in sales_reader]

     for i in ID:
        for j in sales:
            #do stuff
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top