Вопрос

In Scala program I use JDBC to get data from a simple table with 20 rows in SQL DB (Hive). Table contains movie titles rated by users with rows in the following frormat:

user_id, movie_title, rating, date.

I start first JDBC cursor enumerating users. Next, with JDBC cursor 2, for every user I find movie titles he rated. Next, JDBC cursor 3, for every title the current user rated, I find other users who also rated this title. As a result I get a groups of users where every user rated at least one similar title with the first user who started this group. I need to get all such groups existing in the dataset.

So to group users by movie I do 3 nested select requests, pseudo-code:

1) select distinct user_id  
     2) for each user_id: 
         select distinct movie_title  //select all movies that user saw
            3) for each movie_title:
                select distinct user_id  //select all users who saw this movie

On a local table with 20 rows these nested queries work 26 min! Program returns first user_id after a minute!

Providing that real app will have to deal with 10^6 users, is there any way to optimize 3 nested selects in this case?

Это было полезно?

Решение

Without seeing the exact code is difficult to assess why it is taking so long. Given you've got 20 rows you there must be something fundamentally wrong there.

However as a general advise, I'd suggest looking back at the solution and thinking whether it can't be run with a single SQL query (instead of running hundreds of queries), which will allow you to benefit from features like indexes and save you huge amount of network traffic.

Assuming you have the following table Movies(user_id: NUMERIC, movie_title: VARCHAR(50), rating: NUMERIC, date: DATE) try running something along those lines (haven't tested it so might need to tweak it a bit):

SELECT DISTINCT m1.user_id, m2.user_id
FROM Movies m1, Movies m2
WHERE m1.user_id != m2.user_id
  AND m1.movie_title = m2.movie_title

Once you've got the results you can group them in your Java/Scala code by first user_id and load it to the Multimap-like data structure.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top