How do sites like LinkedIn efficiently display 1st/2nd/3rd-level relationship next to each person's name?

https://stackoverflow.com/questions/1556451

21-09-2019
|

Question

I recently botched a job interview by poorly answering a straightforward question: how do sites like LinkedIn efficiently show the relationship distance (1st/2nd/3rd) from you to every person displayed on a page (e.g. in people search results, list of people working in a company, etc.)?

<EDIT> I got the essential "trick" of the solution: finding "distance from me" is a common operation (e.g. 20x+ on a single page, 100's per login session), so you can do part of the "distance of me to X", cache it, and then re-use that cached partial result many times in order to make other operations much cheaper. I also guessed that the partial result was likely to be my second-level connections, because "cache all 3rd-level connections" would be too costly in RAM and CPU.</EDIT>

But when trying to convert this insight into a solution, I came up with a bumbling answer involving creating persistent caches of 2nd-level connections of everyone on the site (which would have been hugely epensive in perf and complex to maintain), and I took an inexplicable detour into using Bloom Filters in an way that made little technical sense. I wouldn't have hired myself after an answer like that!

Later, as I thought about the problem without the pressure of an interview hanging over my head, I came up a more reasonable answer.

Build a very fast way to get the first-level connections for each of batch of user IDs (batch size up to ~1000?). This probably means a dedicated cluster of lots-of-RAM servers which can cache the entire network's 1st-level connections in memory. Luckily, 50M members x avg. 100 connections per member x 4 bytes per member ID = <25GB to cache in RAM, which is doable with reasonably-priced hardware. And the number of changes per day is going to be under 1%, so keeping the cache up-to-date is not too hard. (Note that a relational database would probably be a bad choice to implement this cache because the "lots of random I/O" access pattern kills relational DB performance.)
when a user logs in, cache his 2nd-level connections by fetching 1st-level connections of every 1st-level connections, and stick in a hashtable (key = 2nd-level ID, value = array of 1st-level connections which connect you). Also cache your first-level connections too so you can pull back both 1st- and 2nd-level via a single call back to your remote cache server. User IDs are easily partitionable, so a distributed cache like memcached may work well for this.
for any user ID, to find whether it's in your "network" and what relationship it is to you (1st, 2nd, 3rd), do the following:
1. if the ID is in your first-level connections, stop.
2. try looking up the ID in your cached 2nd-level connections hashtable. If found, return the array of connections which connect you.
3. fetch the ID's first level connections, and repeat step #2 for each of them. Aggregate all results into a single array and return them.
4. <EDIT> refactor into a batch implementation ("look up distance from me to N different users") so you can get all the remote results from step #3 without having to make up to N remote calls.</EDIT>

But I'm sure there are better answers to this. What's yours? If you want extra challenge, try simulating an inteview situation (can't look up solutions on the Web).

Note that the question was about an optimal solution, regardless of how LinkedIn actually does it today, which I looked up after I wrote my own answer above.

Solution

You may be able to leverage axioms about small world networks to optimize this type of traversal.

Small world networks are characterized by "hubs" the represent very dense interconnections of other nodes. Most nodes in the network will generally either connect within a few hops to a topologically nearby node (1-4 hops away) or will route through one or more such hubs. This is one of the main reasons that small world networks behave the way they do.

OTHER TIPS

Interestingly, 1970's technology would do a fair job of modeling this. The Network Database Model efficiently manages this type of relationship.

It's not efficient in terms of ad hoc queries or data model maintenance, so fell out of favor with the rise of relational data models.

If you think about it, doing this in SQL could be very processor intensive.

Given that and the fact that it will ultimately be used all over the place, and that space is relatively cheap...I would suggest creating an index using Lucene (or Lucene.NET) depending on your language preference. You could do a couple things this way.

You can either create a tree type data structure and recursively crawl your index looking for all the parent nodes or child nodes and their parent or child nodes depending on your needs at the time.

Or you could write out all the relationships as they are created (the space is cheap concept). This would be a write once process (which you wouldn't be updating all that often any ways). When a relationship is created or revoked you would queue an update to your index (queue because you wouldn't want to open for write for single requests...batch the index updates). Then you could read this really flat structure to get the IDs in question.

With the IDs in hand (from which ever search type you perform) you can then go to the DB to get the surrounding required information. Then cache your output to further minimize what would be a very fast search, db query, data building...but faster still if it just comes from cache.

Use something like Velocity, MemCached, or MemCached Win32 for your centralized caching across a web farm.

I'm not sure of the table structure, or complexity of the system, but here is a simple SQL Server example using a recursive CTE:

DECLARE @People table (PersonID int, Name varchar(10))
DECLARE @Network table (PersonID int, NetworkedPersonID int)
INSERT INTO @People VALUES (1,'AAA')
INSERT INTO @People VALUES (2,'BBB')
INSERT INTO @People VALUES (3,'CCC')
INSERT INTO @People VALUES (4,'DDD')
INSERT INTO @People VALUES (5,'EEE')
INSERT INTO @People VALUES (6,'FFF')
INSERT INTO @People VALUES (7,'GGG')
INSERT INTO @People VALUES (8,'HHH')
INSERT INTO @Network VALUES (1,2)
INSERT INTO @Network VALUES (1,3)
INSERT INTO @Network VALUES (2,5)
INSERT INTO @Network VALUES (2,7)
INSERT INTO @Network VALUES (4,8)
INSERT INTO @Network VALUES (7,8)
INSERT INTO @Network VALUES (7,3)
INSERT INTO @Network VALUES (8,9)
DECLARE @TargetPersonID  int
SET @TargetPersonID=1

;WITH NetworkLevels AS
(   SELECT
        NetworkedPersonID,1 AS NetworkLevel
        FROM @Network
        WHERE PersonID=@TargetPersonID
    UNION ALL
    SELECT
        n.NetworkedPersonID, l.NetworkLevel+1
        FROM @Network                n
            INNER JOIN NetworkLevels l ON n.PersonID=l.NetworkedPersonID
    WHERE l.NetworkLevel<=2
)
SELECT * FROM NetworkLevels

OUTPUT:

NetworkedPersonID NetworkLevel
----------------- ------------
2                 1
3                 1
5                 2
7                 2
8                 3
3                 3

(6 row(s) affected)

To implement

DistanceCategory(A,B): { 1, 2, 3+}

Use fact that connections are bidirectional.

Store 1st level connections as sorted list in some KV sore:

Key: [UserFromId,UserToId].
Value: UserToId

Pseudocode:

DistanceCategory(A,B)
{
    if ( exists([A,B]) )
        return 1;
    if ( firstCommonElement(getAll([A,B]), getAll([A,B])) != null )
        return 2;
    return 3;
}

Complexity: O(C1+C2). C1,C2 - number of connection of both users.

Isn't linkedin data represented as a big giant graph? and when a person logins, the system would have handle to its node, and then by doing breadth first traversal for 3 levels, the system would keep these nodes as a set(along with which level info), and when a person appears on webpage, the system does a lookup on this node set and gives out the relationship distance..

This is my guess. Please feel free to point out, what makes it impractical.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow