Frage

I am making a script to process csv that I can reuse. Right now I am using this code to normalize my columns in the csv files so they can all have similar columns.

df = pd.read_csv('Crokis.csv', index_col=0, encoding = "ISO-8859-1", low_memory=False)

genCol=['Genus','genus','ngenus','genera',]
df.rename(columns={typo: 'Genus' for typo in genCol}, inplace=True)

spCol=['species', 'sp', 'Species']
df.rename(columns={typo: 'species' for typo in spCol}, inplace=True)

chromCol=['Chromosome count', 'chromosome', 'Cytology', '2n', 'Chromosome']
df.rename(columns={typo: 'chromosome' for typo in chromCol}, inplace=True)

del chromCol, spCol, genCol

It works fine but there are 2 problems

  1. Sometime items are missing from the list because of upper/lower casing, or additional characters added in the front or back of it. Is there a way to include regex or something similar to handle different variations?

  2. There seems to be a redundant pattern so I think there should be a way to optimize it.

War es hilfreich?

Lösung

One can use python re functions to do so.

Below is an example where one replaces any occurrence of 'genus.*' with 'Genus'. It will match and replace for example 'genUS', 'GENUS', 'Genus_666'

import pandas as pd
import re

df = pd.read_csv('Crokis.csv', index_col=0, encoding = "ISO-8859-1", low_memory=False)

# 'Genus' column renaming
f = lambda x: re.sub('genus.*','Genus', x, flags = re.IGNORECASE)
df.rename(columns = f, inplace = True)

Andere Tipps

I will approach the problem this way:

# use a single dict to hold the mapping
name_map = {'Genus': ['Genus','genus','ngenus','genera'],
        'species':['species', 'sp', 'Species'],
        'chromosome':['Chromosome count', 'chromosome', 'Cytology', '2n', 'Chromosome']}

col_translate = {}

for c in df.columns: 
    for canonical_name, alias_names in name_map.items():
        for alias_name in alias_names:
            if c.lower() == col_name.lower():
               col_translate[c] = canonical_name
            # if you want to check prefix or suffix...
            elif c.startswith(alias_name) or c.endswith(alias_name) 
               col_translate[c] = canonical_name
            # ... any additional, more complicated test
                ...

It is more flexible if there is certain case where re may deem too difficult

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top