Frage

I have created two CSV lists. One is an original CSV file, the other is a DeDuped version of that file. I have read each into a list and for all intents and purposes they are the same format. Each list item is a string.

I am trying to use a list comprehension to find out which items were deleted by the duplication. The length of the original is 16939 and the list of the DeDupe is 15368. That's a difference of 1571, but my list comprehension length is 368. Ideas?

deduped = open('account_de_ex.csv', 'r')
deduped_data = deduped.read()
deduped.close()
deduped = deduped_data.split("\r")

#read in file with just the account names from the full account list
account_names = open('account_names.csv', 'r')
account_data = account_names.read()
account_names.close()
account_names = account_data.split("\r")

# Get all the accounts that were deleted in the dedupe - i.e. get the duplicate accounts
dupes = [ele for ele in account_names if ele not in deduped]

Edit: For some notes in the comments, here is a test on my list comp and the lists themselves. Pretty much the same difference, 20 or so off. Not the 1500 i need! thanks!

print len(deduped)
deduped = set(deduped)
print len(deduped)

print len(account_names)
account_names = set(account_names)
print len(account_names)


15368
15368
16939
15387
War es hilfreich?

Lösung

Try running this code and see what it reports. This requires Python 2.7 or newer for collections.Counter but you could easily write your own counter code, or copy my example code from another answer: Python : List of dict, if exists increment a dict value, if not append a new dict

from collections import Counter

# read in original records
with open("account_names.csv", "rt") as f:
    rows = sorted(line.strip() for line in f)

# count how many times each row appears
counts = Counter(rows)

# get a list of tuples of (count, row) that only includes count > 1
dups = [(count, row) for row, count in counts.items() if count > 1]
dup_count = sum(count-1 for count in counts.values() if count > 1)

# sort the list from largest number of dups to least
dups.sort(reverse=True)

# print a report showing how many dups
for count, row in dups:
    print("{}\t{}".format(count, row))

# get de-duped list
unique_rows = sorted(counts)

# read in de-duped list
with open("account_de_ex.csv", "rt") as f:
    de_duped = sorted(line.strip() for line in f)

print("List lengths: rows {}, uniques {}/de_duped {}, result {}".format(
        len(rows), len(unique_rows), len(de_duped), len(de_duped) + dup_count))

# lists should match since we sorted both lists
if unique_rows == de_duped:
    print("perfect match!")
else:
    # if lists don't match, find out what is going on
    uniques_set = set(unique_rows)
    deduped_set = set(de_duped)

    # find intersection of the two sets
    x = uniques_set.intersection(deduped_set)

    # print differences
    if x != uniques_set:
        print("Rows in original that are not in deduped:\n{}".format(sorted(uniques_set - x)))
    if x != deduped_set:
        print("Rows in deduped that are not in original:\n{}".format(sorted(deduped_set - x)))

Andere Tipps

To see what you really have in each list you can proceed by construction :

If you only had unique elements :

deduped = range(15368)
account_names2 = range(15387)
dupes2 = [ele for ele in account_names2 if ele not in deduped] #len is 19

However because you have repetitions of removed and not removed elements you actually end up with :

account_names =account_names2 + dupes2*18 + dupes2[:7] + account_names2[:1571  - 368]
dupes = [ele for ele in account_names if ele not in deduped] # dupes will have 368 elements 
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top