The efficiency when using a big data structure in a function in Python
-
10-10-2019 - |
Pergunta
I need to use a big data structure, more specifically, a big dictionary to do the looking up job.
At the very first my code is like this:
#build the dictionary
blablabla
#look up some information in the ditionary
blablabla
As I need to look up many times, I begin to realize that it is a good idea to implement it as a function, say lookup(info).
Then here comes the problem, how should I deal with the big dictionary?
Should I use lookup(info, dictionary) to pass it as an argument, or should I just initialize the dictionary in main() and just use it as an global variable?
The first one seems more elegant because I think maintaining global variable is troublesome. But on the other hand, I'm not sure of the efficiency of passing a big dictionary to a function. It will be called many times and it will certainly be a nightmare if the argument passing is inefficient.
Thanks.
Edit1:
I just made an experiment of the above two ways:
Here's the snippet of the codes. lookup1 implements the argument passing looking up while lookup2 use global data structure "big_dict".
class CityDict():
def __init__():
self.code_dict = get_code_dict()
def get_city(city):
try:
return self.code_dict[city]
except Exception:
return None
def get_code_dict():
# initiate code dictionary from file
return code_dict
def lookup1(city, city_code_dict):
try:
return city_code_dict[city]
except Exception:
return None
def lookup2(city):
try:
return big_dict[city]
except Exception:
return None
t = time.time()
d = get_code_dict()
for i in range(0, 1000000):
lookup1(random.randint(0, 10000), d)
print "lookup1 is %f" % (time.time() - t)
t = time.time()
big_dict = get_code_dict()
for i in range(0, 1000000):
lookup2(random.randint(0, 1000))
print "lookup2 is %f" % (time.time() - t)
t = time.time()
cd = CityDict()
for i in range(0, 1000000):
cd.get_city(str(i))
print "class is %f" % (time.time() - t)
This is the output:
lookup1 is 8.410885
lookup2 is 8.157661
class is 4.525721
So it seems that the two ways are almost the same, and yes, the global variable method is a little bit more efficient.
Edit2:
Added the class version suggested by Amber, and then test the efficiency again. Then we could see from tthe results that Amber is right, we should use the class version.
Solução
Neither. Use a class, which is specifically designed for grouping functions (methods) with data (members):
class BigDictLookup(object):
def __init__(self):
self.bigdict = build_big_dict() # or some other means of generating it
def lookup(self):
# do something with self.bigdict
def main():
my_bigdict = BigDictLookup()
# ...
my_bigdict.lookup()
# ...
my_bigdict.lookup()
Outras dicas
Answering the core question, parameter passing is not inefficient, it's not like your values will get copied around. Python passed references around, which is not to say that the way parameters are passed fits the well-known schemes of "pass-by-value" or "pass-by-reference".
It's best imagined as initializing the value of a variable local to the called function with a reference value provided by the caller, which are passed by value.
Still, the suggestion to use a class is probably a good idea.