Question

I have a very large dictionary containing tuples as keys and their values. This dictionary is supposed to represent an adjacency matrix with word co-occurrence vectors, eg 'work' appears with 'experience' 16 times and 'work' appears with 'services' 15 times. Whether or not this is the preferred storage method is another issue (with the massive amount of data I have, nested dictionaries became a nightmare for traversal), but it's simply what I have for right now.

Frequency:{ 
('work', 'experience'): 16, 
('work', 'services'): 25, 
('must', 'services'): 15, 
('data', 'services'): 10,     
...
...}

Thanks to a previous post, I've been able to do a simple binary adjacency matrix with NetworkX, simply by using this methodology:

A=Frequency.keys()
networkx.Graph(A)

That result was great then, but my question is what do I have to do to convert Frequency into an adjacency matrix using its co-occurrence value as the value in the matrix, so that the result would it would look something along the lines of this:

array([[ 0.,  16.,  25.,  0.],
       [ 16.,  0.,  1.,  0.],
       [ 25.,  1.,  0.,  1.],
       [ 10.,  0.,  0.,  0.]
       ...)

I apologize if this is similar to previous posts, but I just can't find the correct way to convert these tuples to a matrix that I can use in NetworkX. I'm assuming I would use numpy, but I cannot find any documentation for a method like this.

Thanks in advance,

Ron

Was it helpful?

Solution

This answer may be of help. With your sample data:

>>> frequency = {('work', 'experience'): 16, 
...              ('work', 'services'): 25, 
...              ('must', 'services'): 15, 
...              ('data', 'services'): 10}
>>> keys = np.array(frequency.keys())
>>> vals = np.array(frequency.values())
>>> keys
array([['work', 'services'],
       ['must', 'services'],
       ['work', 'experience'],
       ['data', 'services']], 
      dtype='|S10')
>>> vals
array([25, 15, 16, 10])
>>> unq_keys, key_idx = np.unique(keys, return_inverse=True)
>>> key_idx = key_idx.reshape(-1, 2)
>>> unq_keys
array(['data', 'experience', 'must', 'services', 'work'], 
      dtype='|S10')
>>> key_idx
array([[4, 3],
       [2, 3],
       [4, 1],
       [0, 3]])
>>> n = len(unq_keys)
>>> adj = np.zeros((n, n) ,dtype=vals.dtype)
 >>> adj[key_idx[:,0], key_idx[: ,1]] = vals
>>> adj
array([[ 0,  0,  0, 10,  0],
       [ 0,  0,  0,  0,  0],
       [ 0,  0,  0, 15,  0],
       [ 0,  0,  0,  0,  0],
       [ 0, 16,  0, 25,  0]])
>>> adj += adj.T
>>> adj
array([[ 0,  0,  0, 10,  0],
       [ 0,  0,  0,  0, 16],
       [ 0,  0,  0, 15,  0],
       [10,  0, 15,  0, 25],
       [ 0, 16,  0, 25,  0]])

OTHER TIPS

You could create a dictionary to map the words in your tuples to integers, parsing the tuples in your Frequency's keys, and then create a numpy array of dimension nxn where n is the total number of words you have, and finally fill that array using your Frequency dict.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top