Question

I'm trying to do anomaly detection with Isolation Forests (IF) in sklearn. Except for the fact that it is a great method of anomaly detection, I also want to use it because about half of my features are categorical (font names, etc.)

I've got a bit too much to use one hot encoding (about 1000+ and that would just be one of many features) and I'm anyway looking for a more robust way of data representation.

Also, I want to experiment with other clustering techniques later on, so I don't want to necessarily do label encoding as it will misrepresent the data in euclidean space.

I have thus a two part question:

  1. How will label encoding (ie. ordinal numbers) affect tree based methods such as the Isolation Forest? Seeing as they aren't distance based, they shouldn't make assumptions about ordinal data, right?

  2. What other feature transformations can I consider for distance based models?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top