How can I efficiently create a user graph based on transaction data using Python?

Question

I think your problem is that a matrix representation is not going to cut it:

Note that memory wise, you do very inefficient stuff. For example, you create a matrix with a lot of zeros that need to be allocated in RAM. It would be a lot more efficient to not have any object in RAM for a connection that does not exist instead of a zero float. You "abuse" linear algebra math to solve your problem, which makes you use a lot of RAM. (The amount of elements is in your matrix is 130k*30k = a gazilion, but you "only" have 30m links that you actually care about)

I truly feel for you, because pandas was the first library I learned and I was trying to solve almost every problem with pandas. I noticed over time though that the matrix approach is not optimal for a lot of problems.

There is a "spare matrix" somewhere in numpy, but let's not go there.

let me suggest another approach:

use a simple default dict:

from collections import defaultdict

# a dict that makes an empty set if you add a key that doesnt exist
shared_ips = defaultdict(set)

# for each ip, you generate a set of users
for k, row in unique_user_ip_pairs.iterrows():
    shared_ips[row['ip']].add(row['user'])

#filter the the dict for ips that have more than 1 user
shared_ips = {k, v for k, v in shared_ips.items() if len(v) > 1}

I'm not sure if this is 100% going to solve your usercase, but note the efficiency:

This will at most duplicate the RAM usage from your initial unique user-ip pairs object. But you will get the information which ip was shared amongst which users.

The big lesson is this:

If most cells in a matrix represent the same type of information, don't use a matrix approach if you run into memory problems

I've seen so many pandas solutions for problems that could have been done with the simple usage of pythons builtin types like dict, set, frozenset and Counters. Especially people coming to Python from statistical toolboxes like MATLAB and R or Excel are very very prone to it (they sure like them tables). I suggest that one tries to not make pandas his personal builtin library where he resorts to first...