Frage

I'm using python to generate a graph from a file. When I run my code it uses around 7 GB of my RAM!! (the graph has 1,600,00 nodes)

The input file is something like this:

1 2
1 3
1 4
2 4

each row represents an edge. in this example we have 4 nodes and 4 edges.

Here is my code:

class Graph(object):

    def __init__(self):
        self.node_list = []
        self.edge = []
        self.neighbors = {}
        with open(infile, "r") as source_file:
            for row in csv.reader(source_file, delimiter='\t'):
                self.node_list.append(int(row[0]))
                self.node_list.append(int(row[1]))
                self.edge.append(row)
        for node in self.edge:
            if node[0] in self.neighbors:
                self.neighbors[node[0]].append(node[1])
            else:
                self.neighbors[node[0]] = [node[1]]
            if node[1] in self.neighbors:
                self.neighbors[node[1]].append(node[0])
            else:
               self.neighbors[node[1]] = [node[0]]
        self.node_list = list(set(self.node_list))

g = Graph()

Thanks in advance.

War es hilfreich?

Lösung

You're using the wrong data structures in the wrong way.

Your code creates a huge number of redundant entries in the node list. Using your example, the node list will be:

[1, 2, 1, 3, 1, 4, 2, 4]

which when handling 1'600'000 nodes is going to add considerably to your data requirements. Then you store a complete copy of the input file in self.edge which is never released.

You don't even need a class for what you are doing:

import collections

graph = collections.defaultdict(list)
with open(infile) as inf:
    for line in inf.read():
        p, q = line.split()
        if p not in graph[q]:
            graph[q].append(p)
        if q not in graph[p]:
            graph[p].append(q)

graph now contains a minimal representation of your input file. This is a rather old pattern which has worked well. You may find a package like NetworkX to be useful if you want to do anything with the graph once created.

Andere Tipps

IMHO self.node_list is useless once __init__ has run. You shouldn't bind it to self. Moreover I cannot understand why you need self.edge in such a format. You probably don't need it at all.

self.neighbors probably provides a good enough descritpion of your graph.

You could use this solution or its variations to improve memory usage. Give a try.

The idea is that you associate to each node an integer ordinal tag and you use that tag to access to the list containing the neighors lists.

import csv 
class Graph(object):
        def __init__(self,infile):
        tags = {} 
        self.neighbors = []
        with open(infile, "r") as source_file:
            for row in csv.reader(source_file, delimiter=' '):
            node1 = int(row[0]) 
            node2 = int(row[1]) 
            if node1 not in tags:
                tags[node1] = len(tags) 
                self.neighbors.append([])
            if node2 not in tags:
                tags[node2] = len(tags) 
                self.neighbors.append([])
            self.neighbors[tags[node1]].append(tags[node2])  
            self.neighbors[tags[node2]].append(tags[node1])  
    def getNodes(self):
        return xrange(len(self.neighbors))
    def getNeighs(self,i):
        return self.neighbors[i]
g = Graph('infile.txt')

print "MyNodes:\n ",
print [i for i in g.getNodes()] 
print 'myNeighs\n ', 
for i in g.getNodes(): 
    print g.getNeighs(i),

let me show a possible variant. Notice that in both cases the dictionary 'tags' can be garbage collected (i.e. deleted) after __init__ has run because nothing refers to it. You are left with lists only.

import csv 
class Graph2(object):
        def __init__(self,infile):
        tags = {} 
        self.inverseTags = [] 
        self.neighbors = []
        with open(infile, "r") as source_file:
            for row in csv.reader(source_file, delimiter=' '):
            node1 = int(row[0]) 
            node2 = int(row[1]) 
            if node1 not in tags:
                tags[node1] = len(tags) 
                self.neighbors.append([])
                self.inverseTags.append(node1)
            if node2 not in tags:
                tags[node2] = len(tags) 
                self.neighbors.append([])
                self.inverseTags.append(node2)
            self.neighbors[tags[node1]].append(tags[node2])  
            self.neighbors[tags[node2]].append(tags[node1])  
    def getNodes(self):
        return xrange(len(self.neighbors))
    def getNodesOriginalNames(self):
        return (self.inverseTags[i] for i in xrange(len(self.neighbors)))
    def getNeighs(self,i):
        return self.neighbors[i]
    def getNeighsOriginalName(self,i):
        return [self.inverseTags[j] for j in self.neighbors[i]]
g = Graph2('infile.txt')

print "MyNodes:\n ",
print [i for i in g.getNodesOriginalNames()] 
print 'myNeighs\n ', 
for i in g.getNodes(): 
    print g.getNeighsOriginalName(i),

Of course if in your csv file the node's names are already zero-based ordinals you can save all the translation work that __init__ is doing.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top