Filling missing values with pyspark using a probability distribution

https://datascience.stackexchange.com/questions/23616

missing-data
pyspark
data-imputation

30-10-2019
|

题

I want to fill missing values in my dataframe.

 In [1]: df = spark.createDataFrame([[1],[1],[2],[3],[3],[None],[3],[None],[3],[2],[None],[1],[4]], ['data'])     
 In [2]: df.show()
 +----+
 |data|
 +----+
 |   1|
 |   1|
 |   2|
 |   3|
 |   3|
 |null|
 |   3|
 |null|
 |   3|
 |   2|
 |null|
 |   1|
 |   4|
 +----+

I know I can use pyspark.ml Imputer to fill with the mean / median, or use this method to fill with the last valid value. These are fine options, but I would like to impute with a random sample from the data distribution. For example, in the data provided, nulls will be filled according to these probabilities:

 P(1) = .3
 P(2) = .2
 P(3) = .4
 P(4) = .1

What would be the best way to fill these values from a random sample?

没有正确的解决方案

许可以下： CC-BY-SA 和归因

不隶属于 datascience.stackexchange