Pergunta

In a Real Time Dataset, There are many missing values available in the Dataset and also we need to deal with data preprocessing. And there are many ways to minimize the problem of missing values preprocessing.

So, Can we use mean, median, standerd deviation or we can remove those whole records?

Why many people denying to remove complete record from dataset, Why?

Foi útil?

Solução

Missing values doesn't necessarily mean missing information. Sometime missing value represent an information in itself. For example: we have a data set which have features such as pool area, no. Of rooms and area. Now pool area have 90% of its value missing. You can create a new column called is_pool, which tells if the house has pool or not, from pool area column by using condition that if pool area is missing, make is_pool =0 otherwise 1.

This is one basic example, in my experience most difficult thing while doing EDA is identifying is missing value is really means no information or is it represents something else entirely. In short, understand why the value is missing.

Licenciado em: CC-BY-SA com atribuição
scroll top