Question

I have 3 main questions about the algorithms in intelligent web (web 2.0)

Here the book I'm reading http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665 and I want to learn the algorithms in deeper

1. People You may follow (Twitter)

How can one determine the nearest result to my requests ? Data mining? which algorithms?

2. How you’re connected feature (Linkedin)

Simply algorithm works like that. It draws the path between two nodes let say between Me and the other person is C. Me -> A, B -> A connections -> C . It is not any brute force algorithms or any other like graph algorithms :)

3. Similar to you (Twitter, Facebook) This algorithms is similar to 1. Does it simply work the max(count) friend in common (facebook) or the max(count) follower in Twitter? or any other algorithms they implement? I think the second part is true because running the loop

 dict{count, person}
 for person in contacts:
        dict.add(count(common(person)))
 return dict(max)

is a silly act in every refreshing page.

4. Did you mean (Google) I know that they may implement it with phonetic algorithm http://en.wikipedia.org/wiki/Phonetic_algorithm simply soundex http://en.wikipedia.org/wiki/Soundex and here is the Google VP of Engineering and CIO Douglas Merrill speak http://www.youtube.com/watch?v=syKY8CrHkck#t=22m03s

What about first 3 questions? Any ideas are welcome !

Thanks

Was it helpful?

Solution

People who you may follow

You can use the factors based calculations:

factorA = getFactorA(); // say double(0.3)
factorB = getFactorB(); // say double(0.6)
factorC = getFactorC(); // say double(0.8)

result = (factorA+factorB+factorC) / 3 // double(0.5666666666666667)
// if result is more than 0.5, you show this person

So say in the case of Twitter, "People who you may follow" can based on the following factors (User A is the user viewing this "People who you may follow" feature, there may be more or less factors):

  • Relativity between frequent keywords found in User A's and User B's tweets
  • Relativity between the profile description of both users
  • Relativity between the location of User A and B
  • Are people User A is following follows User B?

So where do they compare "People who you may follow" from? The list probably came from a combination of people with high amount of followers (they are probably celebrities, alpha geeks, famous products/services, etc.) and [people whom User A is following] is following.

Basically there's a certain level of data mining to be done here, reading the tweets and bios, calculations. This can be done on a daily or weekly cron job when the server load is least for the day (or maybe done 24/7 on a separate server).

How are you connected

This is probably a smart work here to make you feel that loads of brute force has been done to determine the path. However after some surface research, I find that this is simple:

Say you are User A; User B is your connection; and User C is a connection of User B.

In order for you to visit User C, you need to visit User B's profile first. By visiting User B's profile, the website already save the info indiciating that User A is at User B's profile. So when you visit User C from User B, the website immediately tells you that 'User A -> User B -> User C', ignoring all other possible paths.

This is the max level as at User C, User Acannot go on to look at his connections until User C is User A's connection.

Source: observing LinkedIN

Similar to you

It's the exact same thing as #1 (People you may follow), except that the algorithm reads in a different list of people. The list of people that the algorithm reads in is the people whom you follow.

Did you mean

Well you got it right there, except that Google probably used more than just soundex. There's language translation, word replacement, and many other algorithms used for the case of Google. I can't comment much on this because it will probably get very complex and I am not an expert to handle languages.

If we research a little more into Google's infrastructure, we can find that Google has servers dedicated to Spelling and Translation services. You can get more information on Google platform at http://en.wikipedia.org/wiki/Google_platform.

Conclusion

The key to highly intensified algorithms is caching. Once you cache the result, you don't have to load it every page. Google does it, Stack Overflow does it (on most of the pages with list of questions) and Twitter not surprisingly too!

Basically, algorithms are defined by developers. You may use others' algorithms, but ultimately, you can also create your own.

OTHER TIPS

People you may follow

Could be one of many types of recommendation algorithms, maybe collaborative filtering?

How you are connected

This is just a shortest path algorithm on the social graph. Assuming there is no weight to the connections, it will simply use breadth-first.

Similar to you

Simply a re-arrangement of the data set using the same algorithm as People you may follow.

Check out the book Programming Collective Intelligence for a good introduction to the type of algorithms that are used for People you may follow and Similar to you, it has great python code available too.

  1. People You may follow From Twitter blog - "suggestions are based on several factors, including people you follow and the people they follow" http://blog.twitter.com/2010/07/discovering-who-to-follow.html So if you follow A and B and they both follow C, then Twitter will suggest C to you...
  2. How you’re connected feature I think you have answered this one.
  3. Similar to you As above and as you say, although the results are probably cached - so its only done once per session or maybe even less frequently...

Hope that helps, Chris

I don't use twitter; but with that in mind:

1). On the surface, this isn't that difficult: For each person I follow, see who they follow. Then for each of the people they follow, see who they follow, etc. The deeper you go, of course, the more number crunching it takes.

You can take this a bit further, if you can also efficiently extract the reverse: For those I follow, who also follows them?

For both ways, what's unsaid is a way to weight the tweeters to see if they're someone I'd really want to follow: A liberal follower may also follow a conservative tweeter, but that doesn't mean I'd want follow the conservative (see #3).

2). Not sure, thinking about it...

3). Assuming the bio and tweets are the only thing to go on, the hard parts are:

  • Deciding what attributes should exist (political affiliation, topic types, etc.)
  • Cleaning each 140 characters to data-mine.

Once you have the right set of attributes, then two different algorithms come to mind:

  • K means clustering, to decide which attributes I tend to discriminate on.
  • N-Nearest neighbor, to find the N most similar tweeters to you given the attributes I tend to give weight to.
  • EDIT: Actually, a decision tree is probably a FAR better way to do all of this...

This is all speculative, but it sounds fun if one were getting paid to do this.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top