Question

I have some data in a CSV that I want to run some analysis on, to check the quality of data. I have been using Pandas due to how easy it is to load data in from a CSV.

I was wondering what would be the most effective method for comparing all values in a series to see if it exists within another list of values? I want to do this to check for any errors in a CSV. Then later I will use these values to try clean the data. The data could potentially be very large.

For example.

I have a CSV that contains data on the suburbs that people have listed where they live. Many of these have been entered manually and could be prone to to typos, incorrect spelling, etc.

To check this I have a list which contains valid suburbs names. I will iterate through each value in a series and compare it to each value in the list of valid suburbs. Then return all unique values which are not valid.

  1. Read in values from csv

    df = read_csv(“user_address”)
    
  2. Extract Series I want to work with (suburb), and get all unique strings from series to lower the amount of comparison methods I have to do

    series = df['Suburb'].unique()
    
  3. Iterate through each unique strings, to see if it matches any of the valid suburb name stored in a list

    L = ......list of suburbs
    
    for value in series:
         if value not in L:
             print value #Will use value for something more in reality
    
  4. Return the strings which do not match any of the valid suburb names

Was it helpful?

Solution

The isin() method does this for you and is part of pandas. It's function is to compare a column to an array of values and returns True if a value in the pandas data frame is in the array and False if not.

values_not_in_array = df[~df.Suburb.isin(L)].Suburb
values_in_array = df[df.Suburb.isin(L)].Suburb
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top