How do I merge two CSV files based on field and keep same number of attributes on each record?

StackOverflow https://stackoverflow.com/questions/23343919

  •  11-07-2023
  •  | 
  •  

Question

I am attempting to merge two CSV files based on a specific field in each file.

file1.csv

id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"

file2.csv

id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False

This is the code I am using:

import csv
from collections import OrderedDict

with open('file2.csv','r') as f2:
    reader = csv.reader(f2)
    fields2 = next(reader,None) # Skip headers
    dict2 = {row[0]: row[1:] for row in reader}

with open('file1.csv','r') as f1:
    reader = csv.reader(f1)
    fields1 = next(reader,None) # Skip headers
    dict1 = OrderedDict((row[0], row[1:]) for row in reader)

result = OrderedDict()
for d in (dict1, dict2):
    for key, value in d.iteritems():
        result.setdefault(key, []).extend(value)

with open('merged.csv', 'wb') as f:
    w = csv.writer(f)
    for key, value in result.iteritems():
        w.writerow([key] + value)

I get output like this, which merges appropriately, but does not have the same number of attributes for all rows:

1,True,7,Purple
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure

file2 will not have a record for every id in file1. I'd like the output to have empty fields from file2 in the merged file. For example, id 1 would look like this:

1,True,7,Purple,,,

How can I add the empty fields to records that don't have data in file2 so that all of my records in the merged CSV have the same number of attributes?

Was it helpful?

Solution

If we're not using pandas, I'd refactor to something like

import csv
from collections import OrderedDict

filenames = "file1.csv", "file2.csv"
data = OrderedDict()
fieldnames = []
for filename in filenames:
    with open(filename, "rb") as fp: # python 2
        reader = csv.DictReader(fp)
        fieldnames.extend(reader.fieldnames)
        for row in reader:
            data.setdefault(row["id"], {}).update(row)

fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged.csv", "wb") as fp:
    writer = csv.writer(fp)
    writer.writerow(fieldnames)
    for row in data.itervalues():
        writer.writerow([row.get(field, '') for field in fieldnames])

which gives

id,attr1,attr2,attr3,attr4,attr5,attr6
1,True,7,Purple,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope,,,
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure,,,

For comparison, the pandas equivalent would be something like

df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)

which is much simpler to my eyes, and means you can spend more time dealing with your data and less time reinventing wheels.

OTHER TIPS

You can use pandas to do this:

import pandas

csv1 = pandas.read_csv('filea1.csv')
csv2 = pandas.read_csv('file2.csv')
merged = csv1.merge(csv2, on='id')
merged.to_csv("output.csv", index=False)

I haven't tested this yet but it should put you on the right track until I can try it out. The code is quite self-explanatory; first you import the pandas library so that you can use it. Then using pandas.read_csv you read the 2 csv files and use the merge method to merge them. The on parameter specifies which column should be used as the "key". Finally, the merged csv is written to output.csv.

Use dict of dict then update it. Like this:

import csv
from collections import OrderedDict

with open('file2.csv','r') as f2:
    reader = csv.reader(f2)
    lines2 = list(reader)

with open('file1.csv','r') as f1:
    reader = csv.reader(f1)
    lines1 = list(reader)

dict1 = {row[0]: dict(zip(lines1[0][1:], row[1:])) for row in lines1[1:]}
dict2 = {row[0]: dict(zip(lines2[0][1:], row[1:])) for row in lines2[1:]}

#merge
updatedDict = OrderedDict()
mergedAttrs = OrderedDict.fromkeys(lines1[0][1:] + lines2[0][1:], "?")
for id, attrs in dict1.iteritems():
    d = mergedAttrs.copy()
    d.update(attrs)
    updatedDict[id] = d

for id, attrs in dict2.iteritems():
    updatedDict[id].update(attrs)

#out
with open('merged.csv', 'wb') as f:
    w = csv.writer(f)
    for id, rest in sorted(updatedDict.iteritems()):
        w.writerow([id] + rest.values())
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top