Question

I have some doubts regarding encoding (i am not familiar with tasks like these) categorical variables in order to use them as parameters in a model like logistic regression or SVM. My dataset looks like the following

Text                                  Symbol    Note    Account    Age   Label 
There is a red car                      !        red      John    24   1
My bag was very expensive               ?       orange    Luke    36  0
Where are my keys?                      @        red      Red     58  1
I promise: I will never let you go!    ...       green    Aoife   28  0

In text there are stored comments from users in a community; symbol includes the most used symbol by an user; Note represents its level: green is more experienced, red is a new joiner; Account is the user name. Label gives me information about user’s trustworthy: if 0 the user is not fake; if one the user might be a possible bot.

I would like to classify new users based on the current information (see columns above). My dataset includes more than 1000 rows and 400 users. Since to use classifiers I need to encode categorical fields, I have tried to do as follows, by using MultiColumnLabelEncoder in sklearn:

MultiColumnLabelEncoder(columns = ['Text', 'Symbol', 'Note', 'Account']).fit_transform(df)

where df is my dataframe. However, I understood that also onehotencoder should be preferable. I also included Account as there might be more comments from the same account, so if I classified an account as fake and I receive a new comment from the same account, then this account could be easily detected as fake. The aim, as earlier i mentioned, would be to classify, with a certain accuracy, new elements from a test set, based on the information given(symbol, note , age, texts), i.e. looking for a possible correlations among these variables which can allow me to say that a new account is fake (1) or not (0).

The problem, as you can see, is related to classifiers where parameters are not only numerical, but also categorical.
For data preprocessing (removing stopwords and cleaning data), I have used Python packages of nltk; regarding features extraction ( this should be a key point as it is linked to the next step, i.e. using a classifier to predict class - 1 or 0), I have found difficulties in understanding what output I should expect from the encoding in order to be able to use information above as inputs in my model (where target is called label and it is a binary value). I am using as classifier logistic regression, but also SVM.

My expected output, in case of user X with age 16, symbol #, note Wonderful, and note red (new joiner) would be a classification as fake with a certain percentage.

I would appreciate if someone could explain me, step by step, the way to transform my dataset in a dataset whose variables I can use within a logistic regression in order to determine the label (fake or not fake) of new users.

Was it helpful?

Solution

You will have to use a mix of text processing and one hot encoding. Text column should not be treated as one-hot encoded since it will try to create one new variable for every unique sentence in the dataset, which will be a lot (and not very helpful from learning). Text vectorizer will summarize text column based on type of words/tokens that appear in it.

So you should use a text vectorizer for processing only the text column first. This will give you one data-frame (say A). This data-frame will have columns corresponding to tokens/words in the dataset. So if text vectorizer picks up 100 unique words then you will have a 1000x100 size data-frame. Note that these 100 columns have been generated only by text vectorizer.

For symbols and notes, you can use one-hot encoding, which will get you another data-frame (say B). Then you should join A & B on common key to get your final data-frame if input. The common key here will be row ID (though read the following comment on aggregating data at user level).

It is not clear if the user name (Account) column unique in the data? If there are 1000 rows but only 400 users, that means there can be more than 1 row per user. In that case, you can consider aggregating the data at the user level (for text column, you can simply concat all strings for same user).

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top