How to insert two features in a model when a feature only applies to a certain group in the model

https://datascience.stackexchange.com/questions/80335

13-12-2020
|

Question

I'm building a machine learning model in Python to predict soccer player values. Consider the following feature columns of the dataframe:

         [features]
---------------------------------
position | goals | goals_conceded 
-------- |-------|---------------       
Forward  |  23   |     NaN
Defender |   2   |     NaN
Defender |   4   |     NaN
Keeper   |   NaN |     20
Keeper   |   NaN |     43

Since keepers don't usually score goals, they'll almost always have null values in the "goals" column, but they still can have this statistic, so it would be fine to fill the NaNs with 0. On the other hand, since line players can't have "goals_conceded" stats, they'll also have null values in that column, but in this case, players will never have this statistic, since this is a keeper only stat. How do I build a machine learning model considering these two columns as features?

I thought about putting them together in one single column, but that can't happen since for a line player, the more goals he makes the better it is. For goalkeepers it's the opposite, the less goals he conceeds the better. I also can't fill the columns with zeros since it would affect the model prediction in the "goals_conceded" column for example, since 98% of the rows contain in line players info.

This happens with many of the columns in my dataframe, such as "clean sheets" (only Keepers will have this stat) and "shots at target" (only line players will have this stat). How do I deal with them?

Solution

To me the data in the current form seems wrong fro training a model for all players, it is like trying to tell whether an apple is better than a tennis ball. They have completely different characteristics.

what you could do instead is group players with similar features in different sets and train the models to predict their scores relevant to their feature set. So for example goalkeepers will be compared against other goalkeepers and assigned scores accordingly. After which you need to set a baseline value for each set and scale the scores for the different classes accordingly.

OTHER TIPS

Remember that any model will have the relationships between the features (X) and the label (y), so your ideas are correct. What is better will depend on your data, so try both, and try more ideas if you can. The best model will be the one that generalize better for new data, so take the final decision with data.

Please, note that your features are at least 3, goals, goals_conceded and position, so this last will help with the problem.

And I suggest not to fill NaN with 0, but -1 (or any other imposible/invalid integer number), because a newbie player will have 0 in the feature(s), and it's correct, and you want to identificate this.

Its an interesting question. In this case I am doubting if training data is even correct. Because two different observations are themselves not comparable. For a ML model to say or understand how two observations are different they both should be comparable. Its a different thing that one value is unknown and one value does not apply to data.

Note: I don't know if below is valid because I am not very knowledgeable in soccer. There may a different way to look at this problem. Since Forwarder can also be assigned as a keeper. (No one will do it since he may not be good at it but he can still play as a keeper). So in your data the forwarder simply never conceded a goal so you may replace NaN with 0.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange