Append to JSON in Python (Optimally due to RAM constraint)
Question
I'm trying to find the optimal way to append some data to a json file using Python. Basically what happens is I have about say 100 threads open storing data to an array. When they are done they send that to a json file using json.dump. However since this can take a few hours for the array to build up I end up running out of RAM eventually. So I'm trying to see what's the best way to use the least amount of RAM in this process. The following is what I have which consumes to much RAM.
i = 0
twitter_data = {}
for null in range(0,1):
while True:
try:
for friends in Cursor(api.followers_ids,screen_name=self.ip).items():
twitter_data[i] = {}
twitter_data[i]['fu'] = self.ip
twitter_data[i]['su'] = friends
i = i + 1
except tweepy.TweepError, e:
print "ERROR on " + str(self.ip) + " Reason: ", e
with open('C:/Twitter/errors.txt', mode='a') as a_file:
new_ii = "ERROR on " + str(self.ip) + " Reason: " + str(e) + "\n"
a_file.write(new_ii)
break
## Save data
with open('C:/Twitter/user_' + str(self.id) + '.json', mode='w') as f:
json.dump(twitter_data, f, indent=2, encoding='utf-8')
Thanks
Solution
My take, building on the idea from Glenn's answer but serializing a big dict as requested by the OP and using the more pythonic enumerate
instead of manually incrementing i
(errors can be taken into account by keeping a separate count for them and subtracting it from i
before wriring to f
):
with open('C:/Twitter/user_' + str(self.id) + '.json', mode='w') as f:
f.write('{')
for i, friends in enumerate(Cursor(api.followers_ids,screen_name=self.ip).items()):
if i>0:
f.write(", ")
f.write("%s:%s" % (json.dumps(i), json.dumps(dict(fu=self.ip, su=friends))))
f.write("}")
OTHER TIPS
Output the individual items as an array as they're created, creating the JSON formatting for the array around it manually. JSON is a simple format, so this is trivial to do.
Here's a simple example that prints out a JSON array, without having to hold the entire contents in memory; only a single element in the array needs to be stored at once.
def get_item():
return { "a": 5, "b": 10 }
def get_array():
results = []
yield "["
for x in xrange(5):
if x > 0:
yield ","
yield json.dumps(get_item())
yield "]"
if __name__ == "__main__":
for s in get_array():
sys.stdout.write(s)
sys.stdout.write("\n")