Finding the most popular likes in my friend network

https://stackoverflow.com/questions/12456515

02-07-2021
|

Question

I am working on a way to find the most popular likes in my friend network. "Most popular in my friend network" is defined as "having the most number of likes by my friends".

Suppose each friend has an unique id and has a number of liked pages. So, given an array of such friends, I want to find the likes which is liked by the most number of friends, and also the friends who liked this thing. Essentially, I want to show something like "Your friend X, Y & Z likes this."

My first solution is to use a Map (for storing the reverse mapping: like->set)and a Priority Queue (for finding the top N). Here is my algorithm (using C++ STL):

map< like, set<friend> > like2friendsMap;
for each friend {
  for each like {
    like2friendsMap[like].insert(friend); //populate the map
  }
}

priority_queue< pair<like, int> > pq;
for each like in like2friendsMap {
  int count = like2friendsMap[like].size(); //no. of friends who like this or "popularity"
  pq.push(like, count); //count is the priority
}

map< like, set<friend> > result
for i in 1 to N { //N is how many popular items I want
   result = pq.top();  //gives me the element with highest priority (most popular like)
   pq.pop();
}

As STL internally uses Red Black Tree for implementing map and min/max heap for priority queue, this approach seems pretty fast to me. But if I have 100s of friends and each have 100s of likes, the memory usage would be huge. Of course, instead of storing the whole objects, I should use friend id and like id for all calculations which would reduce memory usage a lot.

What other algorithm or data structure can be used to improve efficiency (increase speed, reduce memory)? For some reason, I can't store the list of friends against each like, it must be computed in run time. I am developing this using C++, so solutions using STL or boost would be even better.

Solution

create an integer list allPages which can be referenced by page
initialize it with 0
for every friend f
{
    for every page p liked by f
    {
        allPages[p]++;
    }
}
get the maximum of the allPages[p]

If P is the number of pages, then it will have space complexity of O(P).

If F is the number of friends and L be the average pages liked by everyone. Then its time complexity would be O(F*L). So even if you traverse all the friends again to check who all liked the page, that won't add up much to the complexity.

O(F*L) + O(F) would remain O(F*L)

I think its better to iterate again rather than storing the friends.

Or you can store the reverse reference itself for the page. That is for every page, store the list of friends who liked. That won't take much space and will do you job in minimum complexity.

OTHER TIPS

I fail to see why you're using a priority_queue. Sure, it efficiently keeps track of the maximum element while the container is altered. But you only need single-shot operation, after the first step. In summary:

priority_queue< pair<like, int> > pq;
std::priority_queue< pair<like, int> >::const_iterator max_friends = pq.begin()
for(i = like2friendsMap.begin() to .end())  {
  if (max_friends->size() < i->size()) max_friends = i;
}

Of course this works just for N=1, but that's sufficient for the `your friends X, Y and Z like this" top pick.

Since you're interested in finding "the most popular likes", does this mean you're only interested in the 'top few', e.g., top 5, top 10, etc.? If so, one possible approach would be to re-order things so that you iterate per-like, calculate N, the number of friends associated with that like, and then only do further processing on that like if it makes it into the running 'top X' list. The tricky part is calculating N efficiently with such a looping structure (naive implementation would loop over each friend.like for each friend, per like..yuck..), but the benefit would be that if N is small enough, you can drop all data related to that like from memory and not do any further processing on it. That is, if you have a 'top 10 list', and you've already added 10 likes to that list, and the current like's N is smaller than the smallest N in the 'top 10 list', you know that like is irrelevant. Basically, you make a trade where you do some redundant looping in exchange for a drastically reduced memory footprint. Those loops could also be parallelized reasonably, so maybe the extra looping isn't so bad. Hard to say if it is more efficient for your particular use-case without trying it, but it might be worth exploring in this direction if a 'top 10'-style output meets your requirements.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow