Question

I was playing around with PostgreSQL9.3 and decided to try out the JSON type. So I wrote a Query which counts certain JSON Data and groups them by a user_id, but the process took very long to finish. Could somebody give me a hint to improve the performance?

But let me show you the test data and the relations first:

I downloaded some test data which contained one JSON Document per line. The JSON Document was composed out of one array named "hashtags", one array named "urls" and one field with a field named "user_id" (non-unique).

Test data:

{"hashtags": ["a", "b", "c"],"user_id": 210824047, "urls": ["http://www.google.com"]}
{"hashtags": ["b"], "user_id": 78149438, "urls": ["http://www.facebook.com"]}
{"hashtags": ["a", "c"], "user_id": 123791448, "urls": ["http://www.twitter.com","http://www.facebook.com"]}
...

Then I created a simple relation with an auto-incrementing primary key and a column which contains an JSON type like this:

CREATE TABLE tweets(id BIGSERIAL PRIMARY KEY, tweet JSON);

I loaded the data into postgresql where one JSON Document now equals one row:

id | tweet
---+---------------------------------------------------------------------------------
1  |{"hashtags": ["a", "b", "c"],"user_id": 210824047, "urls": ["http://www.google.com"]}
2  |{"hashtags": ["b"], "user_id": 78149438, "urls": ["http://www.facebook.com"]}
3  |{"hashtags": ["a", "c"], "user_id": 123791448, "urls": ["http://www.twitter.com","http://www.facebook.com"]}

Now I wanted to know how often a hastag was used by a certain user_id. To do that I had to use a sub-query which used the json_array_elements() function. The function takes a specified JSON array and turns it into a column.

The Query looks like this:

    select foo.uid, foo.tag, count(foo.*) from 
        (select (tweet::json->>'user_id')::int as uid, 
        json_array_elements(tweet::json->'hashtags')::text as tag from tweets) 
    as foo group by foo.uid, foo.tag;

I got the result I wanted, but the query took very long to finish. For example I loaded 200.000 JSON Documents into postgresql, which is slightly less than 25MB, and it took around 10mins until I got a result. I also tried various indexes, but the performance only improved by half a minute at best.

Now being a postgresql beginner I ask myself, if it is really that slow or did I do something wrong? if it is the latter, what can I do against it?

Your help will be much appreciated.

Was it helpful?

Solution

JSON is still pretty new to PostgreSQL and isn't terribly well optimized yet, but is improving rapidly.

When 9.4 is released later this year, it looks like there will be a huge improvement in this area (I get about 2 seconds for your query with 196,608 records).

Actually that improvement is also in 9.3.3, which should be out in a week.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top