Binary classification problem

https://datascience.stackexchange.com/questions/16268

16-10-2019
|

Question

Im new to ML, I have a data set for Music sales info for Vinyls, the data set contains:

Author
Album Title
Genre
Country
RevenueGenerated
AverageRevenueGenerated

My goal is to create a Model which I can help me understand which Music may generate a lot of revenue (Boolean). I created a field AverageRevenueGenerated which is the average of all Revenue generated for all artists. Im looking for a tool that can help me associate or generate insights based on input signals above. This cold be automatically or a specific guide that allows me to say for example if:

UK + Industrial
IT + Opera
Daft Punk + Electro

Will be potential high revenues.

I found house prices example: https://yalantis.com/blog/predictive-algorithm-for-house-price/ is it the same type of problem? I'm looking which input signals may be the highest revenue. Any insights or pointers will be helpful.

Solution

Yes, you can run a regression using the features you have, and predict the revenue but you will need more features to run an effective analysis. Maybe adding features for the genre and wether the music is trending or not. To turn it into a binary classification problem you will have to decide on a cut-off and label everything above that as 1 and below it as 0, the median might be a good cut-off but solely depends on your data. This entire problem will boil down to feature engineering, which in my view is one of the hardest thing to do. Also you will need a lot of data for this. You can create features wether the song was featuted on the billboard. See, try to think of it like a human and add such features to your model. Like as a person I think the music that will generate more revenue will be :

Popularity and Reputation of the artist.
Wether the genre is trending these days.
Has the song won any awards.

Stuff like this. So you wilk have to find features that quantify these things. Try to find some open source datatset for this kind of problem and work from there.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange