How to determine likely connections in a social network?

https://cs.stackexchange.com/questions/261

16-10-2019
|

Pergunta

I am curious in determining an approach to tackling a "suggested friends" algorithm.

Facebook has a feature in which it will recommended individuals to you which it thinks you may be acquainted with. These users normally (excluding the edge cases in which a user specifically recommends a friend) have a highly similar network to oneself. That is, the number of friends in common are high. I assume Twitter follows a similar path for their "Who To Follow" mechanism.

Stephen Doyle (Igy), a Facebook employee suggested that the related newsfeed that uses EdgeRank formula which seems to indicate that more is to valued than friends such as appearance is similar posts. Another user suggested the Google Rank system.

Facebook states their News Feed Optimization as $\sum u_{e}w_{e}d_{e}$ where

$u_{e}$ = affinity score between viewing user and edge creator
$w_{e}$ = weight for this edge (create, comment, like, tag, etc)
$d_{e}$ = time decay factor based on how long ago the edge was created

Summing these items is supposed to give an object's rank which I assume as Igy hinted, means something in a similar format is used for suggested friends.

So I'm guessing that this is the way in which connections for all types are done in general via a rank system?

Solução

You can think of the social graph as a matrix $\mathbf{M}$. One approach to the problem is to first calculate $\mathbf{M}^2$, which will give all of the paths of length two between two actors in the social network. This can be seen as the weight of the connection between these friends of friends. The next step is to select the columns from the row of $\mathbf{M}^2$ corresponding to the person of interest to get the best candidates for new friends.

Outras dicas

What you're looking for is a heuristic. No algorithm can say, given a graph of friends as the only input, whether two individuals not directly connected are friends or aren't; the friendship/acquaintance relation isn't guaranteed to be transitive (we can assume symmetry, but that might even be a stretch in real life). Any good heuristic will therefore need to be based on an understanding of how people interact, rather than some mathematical understanding of the nature of graphs of relations (although we will need to quantify the heuristic in these terms).

Suggesting friends of friends with equal probability is a relatively cheap but inaccurate heuristic. For instance, my father has friends, but I wouldn't say I'm friends with any of them (although I'd probably say I'm a friend of my father's for the purposes of, e.g., a social network). Having a person at a relatively close distance doesn't necessarily make them a great candidate.

Suggesting people to whom you have a great many extended connections also seems like a poor choice in general, because this will tend to lead to exponential growth of friends of people who pull ahead early on (the seven degrees of separation from Kevin Bacon game is an example of this).

I suggest a circuit-based model. Assume that each link is a resistor of resistance $R$. Then the best candidate for a new friend might be the individual with the lowest equivalent resistance. Here's a poorly-executed ASCII graphics example:

  _____
 /     \
a---c   f
|   | /
b   d---e
| \ |
g   h   i

Say we want to find new friends for a. a's current friends are b, c, and f. We evaluate the net equivalent resistance between a and each of d, e, g, h, and i:

pair   resistance
(a,d)   6/7
(a,e)  13/7
(a,g)   7/4
(a,h)   1/1
(a,i)   inf

According to this heuristic, d is the best candidate friend, followed closely by h. g is the next best bet, followed closely by e. i can never be a candidate friend by this heuristic. Whether you find the results of this heuristic to be representative of real human social interactions is what's important. Computationally speaking, this would involve finding a subgraph containing all paths between two individuals (or, perhaps interestingly, some meaningfully selected truncation of this), then evaluating the equivalent resistance between the source and sink nodes.

EDIT: So what's my social motivation for this? Well, this might be a rough model of how hard it is to get in touch with, and subsequently communicate possibly significant amounts of information through, intermediaries (friends). In CS terms (rather than physics terms), this might be construed as bandwidth between two nodes in a graph. Extensions of this system would be to allow different kinds of links between people with different weights (resistance, bandwidth, etc.) and proceed as above.

There is a lot of work done on this problem as the popularity of social networking has taken off. The problem is typically termed "Link Prediction" and very good and comprehensive surveys can be found here and here. The methods range from the very simple (e.g. Jaccard similarity between nodes) to the very complex (e.g. constructing statistical models of the generative connection process). It depends a lot on the specific features you have available in your dataset (e.g. just network structure, node attributes?, edge attributes, ...), but these surveys will give you a good idea for where to start.

Disclaimer: I am wildly guessing here; I have not read any genre research.

You could look at how many connections to nodes share relatively to the number of conncetions a node has. This is a very naive (as local) idea, but here goes.

Every node $N$ (person or some other concept) has a set of connections $C_N$. Now, given two nodes $N_1$ and $N_2$, suggest $N_2$ to $N_1$ if

$\qquad \displaystyle \frac{|C_{N_1} \cap C_{N_2}|}{|C_{N_1}|} \geq \alpha$

for some reasonable $\alpha \in [0,1]$ (and the other way round).

Another idea is more global: determine a set of nodes similar to the one at hand and propose connections that many of them share. So, define the set of similar nodes

$\qquad \displaystyle S_N = \left\{M : \frac{|C_N \cap C_M|}{N} \geq \alpha\right\}$

and the set plausible suggestions by

$\qquad \displaystyle \left\{ S : \frac{\sum_{M \in S_N} [S \in M]}{|S_N|} \geq \beta \right\}$

again for reasonable $\alpha, \beta \in [0,1]$.

In reality, you would certainly want to weight connections individually; for instance, elements of $S_N$ that you are already connected with should have larger import than such that are far away from you.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange