What is the most efficient way with Python to merge rows in a CSV which have a single duplicate field?

https://stackoverflow.com/questions/23156291

05-07-2023
|

Question

I have found somewhat similar questions however the answers that I think could work are too complex for me to morph into what I need. I could use some help figuring out how to accomplish the following in Python:

I have a CSV file which contains three columns of data. In the first column I have duplicate values (as in duplicated in other rows) of which I need to combine to a single row along with specific data from columns two and three. The result should be another CSV.

In addition, for each set of rows that have duplicate column one data there are a number of situations for data in columns two and three which need combined. In other words, for any first instance of column one value, if value in column two is not empty, grab it and place in a "final" row in column two, else if column two is empty, grab value in column three and place in "final" row in column three. The rule I need to implement is: The first and last instance of column one values need to combine whatever column two and three data exists, while maintaining column two data in column two and three in three. Also, there are never three full values in a given row of source CSV.

To better explain, here are the data situated as listed in source CSV: These are examples of sets of rows in source CSV that need to be combined:

Example1: Here I have four rows with matching column one data, as for all examples I need the result to be a row containing column one value followed by values found in first and last instance of column one value.

wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,
wp.xyz03.def02.01195.1,wp02.xyz03,
wp.xyz03.def02.01195.1,,wp01.def02
wp.xyz03.def02.01195.1,,wp02.def02-c02_lc14_m00

So the desired result for this group would be:

wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00

Example2: Here I have three rows with matching column one data, again I need the result to be a row containing column one value followed by values found in first and last instance of column one value.

wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,
wp.atl21.lmn01.01193.2,wp02.atl21,
wp.atl21.lmn01.01193.2,,wp03.lmn01

So the desired result for this group would be:

wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03.lmn01

Example3: Here I have three rows with matching column one data, again I need the result to be a row containing column one value followed by values found in first and last instance of column one value. Note this example sees the first row now contains no value in column two but rather desired value is in column three.

tp.ghi03.ghi05.02194.65,,tp05.ghi05:1
tp.ghi03.ghi05.02194.65,tp05.ghi03:2,
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,

So the desired result for this group would be:

tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1

Putting it all together:

This:

wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,
wp.xyz03.def02.01195.1,wp02.xyz03,
wp.xyz03.def02.01195.1,,wp01.def02
wp.xyz03.def02.01195.1,,wp02.def02-c02_lc14_m00
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,
wp.atl21.lmn01.01193.2,wp02.atl21,
wp.atl21.lmn01.01193.2,,wp03.lmn01
tp.ghi03.ghi05.02194.65,,tp05.ghi05:1
tp.ghi03.ghi05.02194.65,tp05.ghi03:2,
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,

Needs to turn into this:

wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03.lmn01
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1

I've tried a number of things to accomplish this but I cannot achieve desired result without getting into very unfamiliar territory quickly.

This is my original attempt which resulted in cutting off some of the necessary values as once I reach three values it writes out, and never catches that there might be another:

reader = csv.reader(open('parse_lur_luraz_clean_temp.csv', 'r'), delimiter=',')
final = ['-','-','-']
parselur = ['-']
lur_a = ""
lur_z = ""
for row in reader:
    if row[0] != parselur[0]:
        final = ['-','-','-']
        if row[1] != '': lur_a = row[1]
        if row[2] != '': lur_z = row[2]
        parselur[0] = row[0]
    elif row[0] == parselur[0]:
        if row[1] == '':
            lur_a = row[1]
        elif row[1] != '':
            lur_a = row[1]
        if row[2] == '':
            lur_z = row[2]
        elif row[2] != '':
            lur_z = row[2]
        parselur[0] = row[0]
    if parselur[0] != '' and parselur[0] not in final: final[0] = parselur[0]
    if lur_a != '': 
        if final[1] == '-' or '_lc' not in final[1]: final[1] = lur_a
        lur_a = ''
    if lur_z != '': 
        if final[2] == '-' or '_lc' not in final[2]: final[2] = lur_z
        lur_z = ''
    if len(final) == 3 and '-' not in final:
        fd = open('final_alu_nsn_temp.csv','a')
        writer = csv.writer(fd)
        writer.writerow((final))
        fd.close()
        final = ['-','-','-']
    else:
        parselur[0] = row[0]

La solution

Now's as good a time as any to learn about itertools.groupby:

import csv
from itertools import groupby

# assuming Python 2
with open("source.csv", "rb") as fp_in, open("final.csv", "wb") as fp_out:
    reader = csv.reader(fp_in)
    writer = csv.writer(fp_out)
    grouped = groupby(reader, lambda x: x[0])
    for key, group in grouped:
        rows = list(group)
        rows = [rows[0], rows[-1]]
        columns = zip(*(r[1:] for r in rows))
        use_values = [max(c) for c in columns]
        new_row = [key] + use_values
        writer.writerow(new_row)

produces

$ cat final.csv 
wp.xyz03.def02.01195.1,wp03.xyz03-c01_lc08_m00,wp02.def02-c02_lc14_m00
wp.atl21.lmn01.01193.2,wp03.atl21-c06_lc14_m00,wp03.lmn01
tp.ghi03.ghi05.02194.65,tp05.ghi03-c06_lc11_m00,tp05.ghi05:1

Autres conseils

If I understand what you want to do, have some pseudocode:

Read each line:
Split by comma
Add each section to a large list

Next

Until list is empty:

Foreach value in the list:
Write value to file, then write a comma
Search a list, and remove duplicate values

That seem like it? I can write you a python program if this is what you're intending

Edit:

I wrote a program, as far as I can see, the example inputs you gave me became the example outputs

FileInput = open("Input.txt") #Open an input file
EntireFile = FileInput.read() #Read to the end of the file

EntireFile = EntireFile.replace("\n","").replace("\r","")
#Remove newline characters

SplittedByComma = EntireFile.split(",")
#Split into a list

FileOutput = open("Output.txt","w") #The output file

#Go through the list. For each element, remove other ones that are the same
for X in SplittedByComma:
    for Y in range(len(SplittedByComma)-1,0,-1):
        if (X == SplittedByComma[Y]):
            SplittedByComma.pop(Y)

Output = "" #This will eventually get written to the file

for X in SplittedByComma:
    Output +=X + ","

#Write output, but dont write the last character (So it doesn't end on a comma)
FileOutput.write(Output[:-1])
FileOutput.close()
#Close the file so it saves

Feel free to ask if you have any questions

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow