Pergunta

I want to write algorithm that will find recommended friends for given user. I was wondering if K-nearest neighbors algorithm will work fine for that task? Firstly, I'm not sure if I understood this algorithm correctly, so please help me verify that.

Lets assume I want to match people by interests and I have data that consits of userId and user's rating on few interests, for example:

{
  'userId' : 1,
  'ratings': [
    'sport' : 5.0,
    'art' : 2.5,
    'games': 3.5,
    'books': 2.0 
  ]
}

My first questions is, can I simply put it on 2 separate euclidean spaces like on the picutre, where red point is user that we want to suggest friends? Another pair, which is game and books will be just another euclidean space.

enter image description here

Now, lets say my k = 3, I'll find nearest neighbors in both cases and... now what exactly? Do I look for union of those 2 results? Also, is it ok to simply match two unrelated categories like that? Or maybe I should match opposite features together, like for example some hard left politcal view versus hard right political view?

Another questions regarding that, can I use KNN when I have data that is not based on any ratings, but for example:

{
  'username' : 'john',
  'education': 'harvard', 
  'residence': 'new york',
  'occupation': 'ux developer at XYZ',
  'friends' : ['user1', 'user2', 'user3', ...]
}

And I'd like to match users based on all of those features and choose lets say 10 best matches. Is that possible with KNN, or I should seek for different algorithm?

#Edit:

Let's simplify that for now, I'm firstly trying to make sure I understand the algorithm correctly so I made dirty implementation of that algorithm. So just to be clear, performance, clean code, etc are not the factor here, treat it as working pseudo code, just to verify if logic is done right.

Firstly, I made User as simple as that

@AllArgsConstructor
class User {
    final int id;
    final double books;
    final double sport;
    final double games;
}

Each double represents his intrest in given category (1 - not intrested at all, 5 - very intrested). Best match would be someone who gave exact same rates for each category.

What I need to do is to represent each user as point in 3D space, so just to make it a bit more clear, I created class for that

@AllArgsConstructor
class Point3D {
    final int id;
    final double x;
    final double y;
    final double z;
}

And the whole algorithm is calculating user distance to every other user and then returns k best results.

public class KNN {
    private final Integer k;
    private final Set<Point3D> sphere;

    public KNN(Integer k, Set<User> users) {
        this.k = k;
        this.sphere = users.stream().map(this::asPoint).collect(Collectors.toSet());
    }

    public Map<Integer, Double> runAlgorithm(User user) {
        val distances = sphere.stream()
                .collect(Collectors.toMap(point -> point.id, point -> distance(point, asPoint(user))));
        val sortedDistances = distances
                .entrySet().stream().sorted(Map.Entry.comparingByValue())
                .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue, (k, v) -> k, LinkedHashMap::new));
        return sortedDistances.entrySet()
                .stream().limit(k).collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
    }

    private Point3D asPoint(User user) {
        return new Point3D(user.id, user.books, user.sport, user.games);
    }

    private double distance(Point3D pointA, Point3D pointB) {
        val x = square(pointB.x - pointA.x);
        val y = square(pointB.y - pointA.y);
        val z = square(pointB.z - pointA.z);
        return Math.sqrt(x + y + z);
    }

    private double square(double val) {
        return Math.pow(val, 2);
    }
}

From now, if I want to compare users by more criteria, lets say art and movies, I should just turn this Point3D into Point5D (and figure out how I calculate distance in 5 or more dimensions). Rest would stay pretty much the same?

#Edit2

Thanks for feedback, to wrap things up, I'd like to draw big picture how I imagine whole application could look like.

  1. I create set of randomly generated users, each one of them has 5 randomly rated categories out of 15 total existing categories.
  2. I ask user to manualy rank 5 random categories.
  3. I run algorithm for first time, then I suggest 5 best matches to the user and I ask him to pick one.
  4. Step 3 might be repeated few times, untill user has lets say 5 different friends.
  5. Now I'll start adding weight, for example if only one choosen friend likes books and 3 of them likes sport, then sport should be more important. So next calculation should take that into consideration.
  6. From now on, I should just keep on repeating step 5, which will add next friend and next calculations will take this friend interests into consideration. If I understand correctly, guesses should be better with every iteration. 7*. Also, in the background I can run another task that will generate more users every few seconds, I'm not sure how many users exactly I need, I'm about to use 1-10 ratings on each category.

Please correct me if my thinking is bad? Also, how exactly I should apply weights to algorithm I sent earlier?

Foi útil?

Solução

That is correct.

To extend this to more dimensions you just need to expand this function, after adding the extra properties:

private double distance(Point5D pointA, Point5D pointB) {
    val v = square(pointB.v - pointA.v);
    val w = square(pointB.w - pointA.w);
    val x = square(pointB.x - pointA.x);
    val y = square(pointB.y - pointA.y);
    val z = square(pointB.z - pointA.z);
    return Math.sqrt(v + w + x + y + z);
}

But you might want to change the way you structure this:

struct Point
{
    List<double> Vector;
}

private double distance(Point pointA, Point pointB) {
    if (pointA.Vector.Count != pointB.Vector.Count)
        throw new Exception("Can't calculate for different sized vectors");

    double squared = 0;
    for (int i = 0; i != pointA.Vector.Count; ++i)
        squared += (pointB.Vector[i] - pointA.Vector[i]);
    return Math.sqrt(squared);
}

Of course presuming that each attribute is collected and in order.

For Edit #2

To integrate weights...

private double distance(Point pointA, Point pointB, List<double> weightVector) {
    if (pointA.Vector.Count != pointB.Vector.Count
     || pointA.Vector.Count != weightVector.Count)
        throw new Exception("Can't calculate for different sized vectors");

    double squared = 0;
    for (int i = 0; i != pointA.Vector.Count; ++i)
        squared += (weightVector[i] * (pointB.Vector[i] - pointA.Vector[i]));
    return Math.sqrt(squared);
}

A value in the weightVector of 1 does nothing to change the distance. If its less than one distance is shrunk. If its greater than 1 it is expanded.

Licenciado em: CC-BY-SA com atribuição
scroll top