Pergunta

I'm new to machine learning, but I have an interesting problem. I have a large sample of people and visited sites. Some people have indicated gender, age, and other parameters. Now I want to restore these parameters to each user.

Which way do I look for? Which algorithm is suitable to solve this problem? I'm familiar with Neural Networks (supervised learning), but it seems they don't fit.

Foi útil?

Solução

I had almost the same problem: 'restoring' age, gender, location for social network users. But I used users' ego-networks, not visited sites statistics. And I faced with two almost independent tasks:

  1. 'Restoring' or 'predicting' data. You can use a bunch of different technics to complete this task, but my vote is for simplest ones (KISS, yes). E.g., in my case, for age prediction, mean of ego-network users' ages gave satisfactory results (for about 70% of users error was less than +/-3 years, in my case it was enough). It's just an idea, but you can try to use for age prediction weighted average, defining weight as similarity measure between visited sites sets of current user and others.
  2. Evaluating prediction quality. Algorithm from task-1 will produce prediction almost in all cases. And second task is to determine, if prediction is reliable. E.g., in case of ego network and age prediction: can we trust in prediction, if a user has only one 'friend' in his ego network? This task is more about machine-learning: it's a binary classification problem. You need to compose features set, form training and test samples from your data with both right and wrong predictions. Creating appropriate classifier will help you to filter out unpredictable users. But you need to determine, what are your features set. I used a number of network metrics, and summary statistics on feature of interest distribution among ego-network.

This approach wouldn't populate all the gaps, but only predictable ones.

Outras dicas

There exist many possibilities for populating empty gaps on data.

  • Most repeated value: Fill the gaps with the most common value.
  • Create a distribution: Make the histogram and drop values according to that distribution.
  • Create a new label: Since you do not have information, do not assume any value and create another label/category to indicate that value is empty.
  • Create a classifier: Make a relation among the variable with empty gaps and the rest of the data and create a simple classifier. With this, populate the rest of the data.

There exist many others, but these are the most common strategies. My suggestion is not to populate and to keep unknown what is unknown.

Although adesantos has already given a good answer, I would like to add a little background information.

The name for the problem you are looking at is "imputation". As adesantos already said, one of the possibilities is to fit a distribution. For example, you could fit a multivariate Gaussian to the data. You will get the mean only from the samples you know and you calculate the covariances only from the samples you know. You can then use standard MVG results to impute the missing data linearly.

This is probably the simplest probabilistic method of imputation and it is already quite involved. If you are a neural networks, a recently proposed method that can do so are deep latent gaussian models by Rezende et al. However, understand the method will require a lot of neural net knowledge, quite some variational Bayes knowledge about Markov chains.

Another method, which I have hear to work well is to train a generative stochastic network (Bengio et al). This is done by training a denoising auto encoder on the data you have (neglecting missing values in the reconstruction loss). Say you have a reconstruction function f and a input x. Then you will reconstruct it via x' = f(x). You then reset the values of x' with those you know from x. (I.e. you only keep the values that were missing before reconstruction.) If you do so many times, you are guaranteed to sample from the distribution given the values you know.

But in either case, these methods require quite some knowledge about statistics and neural nets.

Licenciado em: CC-BY-SA com atribuição
scroll top