merge csv files with different column order remove duplicates

Question 1

The following script works properly if:

csv aren't too big (i.e. can be loaded in memory)
the first row of the CSV contains the column names

You only have to fill files and final_headers

import csv

files = ['c1.csv', 'c2.csv', 'c3.csv']
final_headers = ['col1', 'col2', 'col3']

merged_rows = set()
for f in files:
    with open(f, 'rb') as csv_in:
        csvreader = csv.reader(csv_in, delimiter=',')
    headers = dict((h, i) for i, h in enumerate(csvreader.next()))
        for row in csvreader:
            merged_rows.add(tuple(row[headers[x]] for x in final_headers))
with open('output.csv', 'wb') as csv_out:
    csvwriter = csv.writer(csv_out, delimiter=',')
    csvwriter.writerows(merged_rows)

Question 2

csvkit's csvjoin can do that.

csvjoin -c "Column 1,Column 2" --outer file1.csv file2.csv

Question 3

Personally, I would separate the two tasks of merging files and removing duplicates. I would also recommend using a database instead of CSV files if that's an option, since managing columns in a database is easier.

Here is an example using Python, which has a csv library that is easy to use.

import csv
with open(srcPath, 'r') as srcCSV:
    csvReader = csv.reader(csvFile, delimiter = ',')

    with open(destPath, 'rw') as destCSV:
        csvWriter = csv.writer(destCSV, delimiter = ',')        

        for record in csvReader:
            csvWriter.writerow(record[1],record[3],record[2], ... record[n])

This allows you to rewrite the columns in any order you choose. The destination CSV could be an existing one that you expand, or it could be a new one with a better format. Using the CSV library will help prevent transcription errors that would happen elsewhere.

Once the data is consolidated, you could use the same library to iterate over the single data file to identify records that are identical.

Note: this method reads and writes files a line at a time, so it can process files of any size. I used this method to consolidate 221 millions records from files as large as 6 GB each.