Question

I am new to using random forest in R an my goal is to identify the independent variables which have the highest impact on the dependent variables. I am looking at sales data, and sales is my dependent variable (1 vs. 0) I have other variables which have different levels such as professional status (retired, employed, unemployed), searching for (myself, parent, other) and region (north, west, south), etc...

summary(data) provides me with the information that the class of my variables is character (dependent variable shows min, 1st Qu, Media - so I assume R reads it as continuous?) and I believe that a character variable needs to be factored before I can run the randomForest command. Is there a single command that transforms all character into factors?

My second questions is whether I should remove the id of the customer from my imported table, or whether it will affect the results if I keep it in the RF model?

Was it helpful?

Solution

You can check the class of the class(df$dependent). You are expecting it to be numeric.

To convert multiple columns to factors, you can do something like this

factor_cols <- c("col_1","col_7"), 
df[factor_cols] <- lapply(df[factor_cols], as.factor)

If you keep the customer id, then you will have a problem when applying your model to a new customer.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top