Question

I have a dataset of Social Media Post and want to predict the number of "thumbs up" it will receive over time.

+---------+----------------+-----------+----------------+-----+-------+
| Post_id | Timestamp      | Follows   | Comments_count | ... | Likes |
+---------+----------------+-----------+----------------+-----+-------+
| 01      | 12-04-16 14:00 | 34        | 4              |     | 23    |
+---------+----------------+-----------+----------------+-----+-------+
| 01      | 12-04-16 14:35 | 35        | 7              |     | 34    |
+---------+----------------+-----------+----------------+-----+-------+
|         | ...            |           |                |     |       |
+---------+----------------+-----------+----------------+-----+-------+
| 02      | 12-04-16 14:02 | 134       | 5              |     | 36    |
+---------+----------------+-----------+----------------+-----+-------+
| 02      | 12-04-16 14:45 | 136       | 23             |     | 123   |
+---------+----------------+-----------+----------------+-----+-------+

The likes amount over Time looks like f(x) = sqrt(x)

My approach is to create a multivariable polynomial regression for each post and somehow ensemble/average them.

Is this a good approach? Which ensemble technique is appropriate?

Was it helpful?

Solution

Overall classification is generally better when the decision rules of each component classifier differ and provide complimentary information.

So the question becomes: Can you set up your component classifiers so that their decision rules are different and compliment one another based on the feature space? e.g. Does Post 1 have a significantly different feature space than Post 2? etc. If so, the ensemble approach should be beneficial.

Which technique? If you can highly train each classifier and make it an expert in different regions of the feature space, try models:

  • mixture model
  • mixture distribution
  • gating subsystem
  • winner take all.

OTHER TIPS

You could pick a few time windows at specific times since the post and try to regress on that. Since it is count data the obvious choice would be to model the outcome as a Poisson counting process. There are multiple models that are able to do this natively, some generalized linear models but also Neural Networks with a certain loss function. Another option is to model the fraction of followers that will like the post. This is likely easier to generalize however a small portion of the posts goes 'viral' beyond their own followers. This will be tricky to model anyway and you could clip these instances to a fraction of 1.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top