Question

I'm attempting to create a graph of users in Python using the networkx package. My raw data is individual payment transactions, where the payment data includes a user, a payment instrument, an IP address, etc. My nodes are users, and I am creating edges if any two users have shared an IP address.

From that transaction data, I've created a Pandas dataframe of unique [user, IP] pairs. To create edges, I need to find [user_a, user_b] pairs where both users share an IP. Let's call this DataFrame 'df' with columns 'user' and 'ip'.

I keep running into memory problems, and have tried a few different solutions outlined before. For reference, the raw transaction list was ~500,000, includes ~130,000 users, ~30,000 IPs, and likely ~30,000,000 links.

  1. Join df to itself, sort pairs and remove duplicates (so that [X, Y] and [Y, X] don't both show up as unique pairs).

    df_pairs = df.join(df, how='inner', lsuffix='l', rsuffix='r')
    df_sorted_pairs = [np.sort([df_pairs['userl'][i], df_pairs['userr'][i]]) for i in range(len(df_pairs))]
    edges = np.asarray(pd.DataFrame(df_sorted_pairs).drop_duplicates())
    

    This works pretty well, but gives me a Memory Error fairly quickly, as joining a table to itself grows very quickly.

  2. Create a matrix, where users are the rows, IPs are the columns, and matrix elements are 1 if that user transacted on the IP and 0 otherwise. Then X.dot(X.transpose()) is a square matrix whose elements (i,j) represent how many IPs were shared by user i and user j.

    user_list = df['user'].unique()
    ip_list = df['ip'].unique()
    df_x = pd.DataFrame(index=user_list, columns=ip_list)
    df_x.fillna(0, inplace=True)
    for row in range(len(df)):
        df_x[df['ip'][row]][df['user'][row]] = 1
    df_links = df_x.dot(df_x.transpose())
    

    This works extremely well unless len(ip_list) > 5000. Just creating the empty dataframe of say, 500,000 rows x 200,000 columns gives a Memory Error.

  3. Brute force. Iterate across the users one by one. For each user, find the distinct IPs. For each IP, find the distinct users. Those resulting users are therefore linked to the user in the current iteration. Add that [User1, User2] list to master list of links.

    user_list = df['user'].unique()
    ip_list = df['ip'].unique()
    links=[]
    for user in user_list:
        related_ip_list = df[df['user'] == user]['ip'].unique()
        for ip in related_ip_list:
            related_user_list = df[df['ip'] == ip]['user'].unique()
            for related_user in related_user_list:
                if related_user != user:
                    links.append([user, related_user])
    

    This works, but extremely slow. It ran for 3 hours and finally gave me a Memory Error. Because links was being saved along the way, I could check how big it got - about 23,000,000 links.

Any advice would be appreciated. Have I simply gone too far into "Big Data" where traditional methods like the above aren't going to cut it? I didn't think having 500,000 transactions qualified as "Big Data" but I guess storing a 130,000 x 30,000 matrix or creating a list with 30,000,000 elements is pretty large?

Was it helpful?

Solution

I think your problem is that a matrix representation is not going to cut it:

Note that memory wise, you do very inefficient stuff. For example, you create a matrix with a lot of zeros that need to be allocated in RAM. It would be a lot more efficient to not have any object in RAM for a connection that does not exist instead of a zero float. You "abuse" linear algebra math to solve your problem, which makes you use a lot of RAM. (The amount of elements is in your matrix is 130k*30k = a gazilion, but you "only" have 30m links that you actually care about)

I truly feel for you, because pandas was the first library I learned and I was trying to solve almost every problem with pandas. I noticed over time though that the matrix approach is not optimal for a lot of problems.

There is a "spare matrix" somewhere in numpy, but let's not go there.

let me suggest another approach:

use a simple default dict:

from collections import defaultdict

# a dict that makes an empty set if you add a key that doesnt exist
shared_ips = defaultdict(set)

# for each ip, you generate a set of users
for k, row in unique_user_ip_pairs.iterrows():
    shared_ips[row['ip']].add(row['user'])

#filter the the dict for ips that have more than 1 user
shared_ips = {k, v for k, v in shared_ips.items() if len(v) > 1}

I'm not sure if this is 100% going to solve your usercase, but note the efficiency:

This will at most duplicate the RAM usage from your initial unique user-ip pairs object. But you will get the information which ip was shared amongst which users.

The big lesson is this:

If most cells in a matrix represent the same type of information, don't use a matrix approach if you run into memory problems

I've seen so many pandas solutions for problems that could have been done with the simple usage of pythons builtin types like dict, set, frozenset and Counters. Especially people coming to Python from statistical toolboxes like MATLAB and R or Excel are very very prone to it (they sure like them tables). I suggest that one tries to not make pandas his personal builtin library where he resorts to first...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top