Pergunta

I'm developing backend for a dating app, in which each user has

  1. a profile of his/her characteristics

  2. a profile of ideal match's characteristics

There are dozens of characteristics like gender, height, looks and so on. Some characteristics are strings, others are numbers or arrays. Each characteristics has ascribed an importance factor, ranging from 0 to 4. 0 means not important at all and 4 means absolutely necessary.

so a user's match objects are like this:

    {      
      {
         gender: 'female',
         importance: 4
      }
      {
        eyeColor: ['blue', 'green'],
        importance: 2   
      } ,
      {
       ethnicity: [],
       importance: 0
      }
      heightMin: 150,
      heightMax: 200, 
      heightImportance: 3,
      ....    
    }

The data are saved in mongodb and the backend is in node.js.

I'm new to data science. I just know that there are some formulas to find similarities/distances between vectors, like Euclidean or cosine similarities. But I'm not sure which method (if any) is the most relevant in this circumstances?

Appreciate your hints.

Foi útil?

Solução

Identify the different kind of characteristics

Your sample data illustrates very well that different kind of characteristics need to be handled in a different way:

  • Heigh is a scalar attribute: a profile has one numeric value, but the ideal always looks for a range.
  • Ethnicity is a unique attribute: a profile has only one, but the ideal may identify several alternatives.
  • Eyes could be multiple-value attribute: although most of us have only one color in his/her profile, some people have several. And the ideal can identify several colors with the intent of finding one of those. For example if the ideal is "green,blue" it should be understood as "green OR blue". A profile having both should match. But a profile having only blue should match as well.
  • Hobbies (not in your example) would be option attribute: a profile could have several, and the ideal would have several. THen, the more hobbies match, the higher the affinity.

Define a scoring function

Once all the characteristics properly categorized in this way, you are ready to build a general scoring function that:

  • Scores each pair of characteristics: this can be as simple as 1 (match) and 0 (no match). It can be more subtle to show that a match is more or less strong, with 1.0 (all options are there) 0.8 (4 out of 5 options are there) ... 0 (no match). It could also be a more elaborate calculation with thresholds, ceilings, etc.
  • Aggregates the global score of a profile : Here, you need to experiment in order to find a meaningful aggregation. For example, should 2 matching characteristics of importance 1 outbalance a match of importance 2 ? Another example: should the absence of a match of importance 3 match not reduce the score ?
  • Eliminate not acceptable results: importance 4 is absolutely necessary, so a no-match on that criteria shall result in a global score of 0, whatever the result on other criteria is.

Improve performance

You then have to complement your scoring with:

  • a preselection logic, that uses at least some ideal criteria to select a subset of relevant records: this avoids to calculate the matching score for all the profiles of your database
  • a filter to eliminate scores which are too low, especially if there are many matches.
  • final sorting to present the most successful profiles first.

Future improvements

You could thing of the following, but at a later stage:

  • Should the score be unidirectional only ? Think a moment: the nice young lady will get her profile matched by an awful lot of old men and after a series of uninteresting solicitations, she'll leave the site. What if you would combine somehow score(ideal 1, profile2) with score(ideal 2, profile1)
  • String values will compare very ineffectively. So you may think in the end of a different encoding schema that could be processed quicker (you spoke of some vectors). But this is the cherry on the cake. Start simple.

Outras dicas

Unfortunately it isn't enough just to know the type to be able to perform fuzzy matching. For instance if you wanted to select persons with varied height, what is the difference between a search with height 5'10" and importance 4 vs an importance 1? Even if you try to apply some formula such as height can be in the range of plus or minus (5 - importance) x 2 inches.

But then how do you apply this formula for eye color or hair color? You can't of course. Each attribute must have its own matching system that fits that attribute.

Though some tips: ideally you want to filter out as much as possible and as soon as possible. Therefore if you were to say, favor filters in order of importance, you'd most likely filter out a good many people before you've reached attributes which aren't likely to filter much. However this may not even be necessarily true! For example, if a female is looking for a male with highest importance, and (heaven forbid), most people in your database are male, you're not filtering out many people, despite being of high importance.

So you should take this into consideration, as it would reduce search times tremendously if done correctly. Therefore it would almost certainly be worth your while to keep statistical information on all your clients, since it will allow you to organize the most efficient searches.

Licenciado em: CC-BY-SA com atribuição
scroll top