Remove all columns where the entire column is null
-
10-12-2020 - |
Question
I have a very dirty csv where there are several columns with only null values.
I would like to remove them. I am trying to select all columns where the count of null values in the column is not equal to the number of rows.
clean_df = bucketed_df.select([c for c in bucketed_df.columns if count(when(isnull(c), c)) not bucketed_df.count()])
However, I get this error:
SyntaxError: invalid syntax
File "<command-2213215314329625>", line 1
clean_df = bucketed_df.select([c for c in bucketed_df.columns if count(when(isnull(c), c)) not bucketed_df.count()])
^
SyntaxError: invalid syntax
If anyone could help me get rid of these dirty columns, that would be great.
Solution
[Updated]: Just realized it is about pyspark!
It is still simple! A concrete example (idea heavily borrowed from this answer):
Creating a dummy dataset
import pandas as pd
import numpy as np
import pyspark.sql.functions as sqlf
main= pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
main["E"]= np.NAN
main["F"]= np.NAN
df = sqlContext.createDataFrame(main)
Function to drop Null columns
def drop_null_columns(df):
"""
This function drops columns containing all null values.
:param df: A PySpark DataFrame
"""
null_counts = df.select([sqlf.count(sqlf.when(sqlf.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
to_drop = [k for k, v in null_counts.items() if v >= df.count()]
df = df.drop(*to_drop)
return df
Outcome
df_dropped = drop_null_columns(df)
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange