How to quantify these features so they can be analysed upon using Logistic Regression?

StackOverflow https://stackoverflow.com/questions/23346489

  •  11-07-2023
  •  | 
  •  

Pregunta

I have a very small question which has been baffling me for a while. I have a dataset with interesting features, but some of them are dimensionless quantities (I've tried using z-scores) on them but they've made things worse. These are:

Timestamps (Like YYYYMMDDHHMMSSMis) I am getting the last 9 chars from this.
User IDs (Like in a Hash form) How do I extract meaning from them?
IP Addresses (You know what those are). I only extract the first 3 chars.
City (Has an ID like 1,15,72) How do I extract meaning from this?
Region (Same as city) Should I extract meaning from this or just leave it?

The rest of the things are prices, widths and heights which understand. Any help or insight would be much appreciated. Thank you.

¿Fue útil?

Solución

  • Timestamps can be transformed into Unix Timestamps, which are reasonable natural numbers
  • User IF/Cities/Regions are nominal values, which has to be encoded somehow. The most common approach is to create as much "dummy" dimensions as the number of possible values. So if you have 100 ciries, than you create 100 dimensions and give "1" only on the one representing a particular city (and 0 on the others)
  • IPs should rather be removed, or transformed into some small group of them (based on the DNS-network identification and nominal to dummy transformation as above)
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top