Pergunta

I have the problem that I got a huge source data file which is showing text for all variable values instead of numerical IDs. So for example, I would like to have the variable gender coded as 1 and 2 instead of "female" and "male" written out. And equally the same for 200 other variables of which some have up to hundreds of distinct variable values.

Therefore, doing this manually is not really an option here.

Could anybody please point me to a solution or hint within R, SPSS or Python how I can assign numerical IDs to each distinct variable value?

I thought this would be a problem other people face more commonly as well, but I have found nothing of this kind at all.

Thank you for any help!

Foi útil?

Solução

SPSS has an AUTORECODE command which will do the whole job with one command. for example:

AUTORECODE vr1 to vr100 /into Kvr1 to Kvr100/PRINT.

This will take text variables vr1 to vr100 and recode them into new numerical variables Kvr1 to Kvr100 in which each textual category in the old variable is now automatically numbered in the new variable, with the textual category now used as a value label.
The PRINT sub-command will show you in the output window a list of all the number codes chosen for text categories in each variable.
Please note - using the TO convention (as in "vr1 to vr100") only works when the variables are consecutively ordered in the file. If they are not, you have to name them separately.

Outras dicas

You can use the Python sklearn preprocessing LabelEncoder. Here is some example code from this page with my comments:

# Make a Label Encoder instance
le = preprocessing.LabelEncoder()

# Show it the data it has to encode, so your column
le.fit(["paris", "paris", "tokyo", "amsterdam"])

# Get a ordered list of all classes it found
list(le.classes_)

# Transform a column/list
le.transform(["tokyo", "tokyo", "paris"]) 

# Transform encoding back to original
list(le.inverse_transform([2, 2, 1]))

In R you turn your categorical value into a factor.

dfr$id = as.numeric(factor(dfr$mycolumn))

Licenciado em: CC-BY-SA com atribuição
scroll top