Question

I've couple of thousands of CSV files where most of them have following columns

threadSubject
bccList
sender_name
recipient_names
sender  
dateReceived
date    
recipients
subject 

Unfortunately depending on the CSV file each column if it is present might be at different columnt number therefore complicating parsing.

What I need to do is extract from the input CSV files only these selected columns and put them all into single output file.

I'm new to python and am sure there's perfectly easy way to achieve this but I can't figure it out. I'm not sure if should use Pandas or other mechanism.

In logical code it should work more or less like this.

for file in (all files in current folder); do
  open file;
  get header and find out at which positions are interesting columns 
  #or match by column name;

  dump interesting columns into output file in the right order;
  close file;
done

The tricky part of me is get header...

Would any of you have any advise how to do it in smart pythonic way?

I thought about bash and parse it manually, but thought it might be a good idea to learn how to do it in python with your help.

p.s. background of it is that I need to go through all emails for last 5 years and find out at what time was sent out first email and last email during each day. The CSVs I've were created based on Thunderbird MSF files using Mork tool. Once I'll have this CSV parsing done, I'll need to find out easy way to get time of first email and last email on the same day. BUt this is another story.

Thanks in advance for all advises.

Was it helpful?

Solution

If the column names are the same in all the files , use csv.DictReader to do the job.

Python csv.DictReader Documentation

You can reference the field names directly rather than the column number.

    import csv
    file = open('Path_to_file','rb')
    for record in csv.DictReader(file):
        print record['Column_Name']

Hope this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top