Detecting typos in a CSV column by using Pandas and a list of valid values

https://stackoverflow.com/questions/23577020

19-07-2023
|

Question

I have some data in a CSV that I want to run some analysis on, to check the quality of data. I have been using Pandas due to how easy it is to load data in from a CSV.

I was wondering what would be the most effective method for comparing all values in a series to see if it exists within another list of values? I want to do this to check for any errors in a CSV. Then later I will use these values to try clean the data. The data could potentially be very large.

For example.

I have a CSV that contains data on the suburbs that people have listed where they live. Many of these have been entered manually and could be prone to to typos, incorrect spelling, etc.

To check this I have a list which contains valid suburbs names. I will iterate through each value in a series and compare it to each value in the list of valid suburbs. Then return all unique values which are not valid.

Read in values from csv
```
df = read_csv(“user_address”)
```
Extract Series I want to work with (suburb), and get all unique strings from series to lower the amount of comparison methods I have to do
```
series = df['Suburb'].unique()
```

Iterate through each unique strings, to see if it matches any of the valid suburb name stored in a list

L = ......list of suburbs

for value in series:
     if value not in L:
         print value #Will use value for something more in reality

Return the strings which do not match any of the valid suburb names

Solution

The isin() method does this for you and is part of pandas. It's function is to compare a column to an array of values and returns True if a value in the pandas data frame is in the array and False if not.

values_not_in_array = df[~df.Suburb.isin(L)].Suburb
values_in_array = df[df.Suburb.isin(L)].Suburb

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow