Question

I am thinking about it not the first time, namely if I have a variable that I want to convert later to the variable dummy (cities in this case), should I delete lines that occur less often than N times?

For example, the value of new york has occurred 400+ times but there are cities that only appeared once or twice.

What should I do with values ​​that have appeared only once or twice?

print(df[cities].value_counts())

Output:

city1         424
city2         107
city3          35
city4          33
city5          28
city6          24
city7          15
city8           7
city9           4
city10          3
city11          2
city12          1
city13          1
city14          1
city15          1
city16          1
city17          1

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top