Python- find the unique values from a large json file efficienctly

Question 1

This will give you the set of unique score values (only) as ints. You'll need the 150 MB of free memory. It uses re.finditer() to parse which is about three times faster than the json parser (on my computer).

import re
import time
t = time.time()
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
    data = f.read()
s = set(m.group(1) for m in obj.finditer(data))
s = set(map(int, s))
print time.time() - t

Using re.findall() also seems to be about three times faster than the json parser, it consumes about 260 MB:

import re
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
    data = f.read()
s = set(obj.findall(data))

Question 2

I don't think there is any way to improve things by much. The slow part is probably just the fact that at some point you need to parse the whole JSON file. Whether you do it all up front (with json.load) or little by little (when consuming the generator from ijson.items), the whole file needs to be processed eventually.

The advantage to using ijson is that you only need to have a small amount of data in memory at any given time. This probably doesn't matter too much for a file with a hundred or so megabytes of data, but would be a very big deal if your data file grew to be gigabytes or more. Of course, this may also depend on the hardware you're running on. If your code is going to run on an embedded system with limited RAM, limiting your memory use is much more important. On the other hand, if it is going to be running on a high performance server or workstation with lots and lots of ram available, there's may not be any reason to hold back.

So, if you don't expect your data to get too big (relative to your system's RAM capacity), you might try testing to see if using json.load to read the whole file at the start, then getting the unique values with a set is faster. I don't think there are any other obvious shortcuts.

Question 3

On my system, the straightforward code below handles 10,000,000 scores (139 megabytes) in 18 seconds. Is that too slow?

#!/usr/local/cpython-2.7/bin/python

from __future__ import print_function

import json  # since json file is large, hence making use of ijson

with open('data_large', 'r') as file_:
    content = json.load(file_)
    print(set(element['score'] for element in content))

Question 4

Try using a set

set([x['score'] for x in scores])

For example

>>> scores = [{"score" : 78}, {"score": 65} , {"score" : 65}]
>>> set([x['score'] for x in scores])
set([65, 78])