String Values in a dataframe in Pandas

https://datascience.stackexchange.com/questions/13089

16-10-2019
|

سؤال

Suppose I have a dataframe like this :

Hospital_name    State    Employees    ......
Fortis           Delhi    5000         ......
AIIMS            Delhi    1000000      ......
SuperSpeciality  Chennai  1000         ......

Now I want to use this dataframe to build a machine learning model for predictive analysis. For that I must convert the strings to float values. Also some of these columns in Hospital_name and State contains 'NAN' values. In such a case how should I prepare my data for building a model in keras?

المحلول

To convert from string to float in pandas(assuming you want to convert Employees and you loaded the dataframe with df), you can use

df['Employees'].apply(lambda x:float(x))

You have not given your enough information about your input and expected output. So let us assume that hospital name or anything for that matter which is the input for your model is nan, you would like to remove it from the dataset because extracting features from 'nan' wouldn't make sense. Apart from that if they are just other peripheral features, then it might be alright. In that case, if you wish to convert them into blank then use df.replace(np.nan,' ', regex=True). Else if you wish to remove that frame, you can check for nan using this.

نصائح أخرى

The best way to deal with types is to specify it when ingesting the file:

pandas.read_csv(file_name, dtype={"Employees": float})

What you do with the missing data in Keras is up to you. What's your plan?

A more direct way of converting Employees to float.

df.Employees = df.Employees.astype(float)

You didn't specify what you wanted to do with NaN's, but you can replace them with a different value (int or string) using:

df = df.fillna(value_to_fill)

If you want to drop rows with NaN in it: df = df.dropna()

I don't understand why map you would map the strings to floats... I would suggest using one hot encoding to categorize the strings with a Boolean 1 or 0.

In pandas this would be

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)

   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

You can also add pd.get_dummies(l, dummy_na=True) to deal with the NaN values.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى datascience.stackexchange