Question

I have a very dirty csv where there are several columns with only null values.

I would like to remove them. I am trying to select all columns where the count of null values in the column is not equal to the number of rows.

clean_df = bucketed_df.select([c for c in bucketed_df.columns if count(when(isnull(c), c)) not bucketed_df.count()])

However, I get this error:

SyntaxError: invalid syntax
  File "<command-2213215314329625>", line 1
    clean_df = bucketed_df.select([c for c in bucketed_df.columns if count(when(isnull(c), c)) not bucketed_df.count()])
                                                                                                             ^
SyntaxError: invalid syntax

If anyone could help me get rid of these dirty columns, that would be great.

Was it helpful?

Solution

[Updated]: Just realized it is about pyspark!

It is still simple! A concrete example (idea heavily borrowed from this answer):

Creating a dummy dataset

import pandas as pd
import numpy as np 
import pyspark.sql.functions as sqlf

main= pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

main["E"]= np.NAN
main["F"]= np.NAN

df = sqlContext.createDataFrame(main)

Function to drop Null columns

def drop_null_columns(df):
  
    """
    This function drops columns containing all null values.
    :param df: A PySpark DataFrame
    """
    
    null_counts = df.select([sqlf.count(sqlf.when(sqlf.col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict()
    to_drop = [k for k, v in null_counts.items() if v >= df.count()]
    df = df.drop(*to_drop)
    
    return df

Outcome

df_dropped = drop_null_columns(df)
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top