Python - Converting long list of addresses into list of strings and intersection of lists

https://stackoverflow.com/questions/23450986

15-07-2023
|

Frage

I have two very long text files (thousands of e-mail addresses, one per line) an I'm looking for a way to compare the two files and have an output with the adresses contained in the first file and in the second file but not in both of them (something like AUB/(A⋂B) in set theory). It would be pretty easy if I could use lists containing strings as input, like this

input1=['address1','address2',...,'addressn']

but since my text file is long and on different lines I should manually put each address among the ''. So I tried to use a single string with all the addresses separated by a space as input, and then to convert it into a list of strings. This is what I've come out with:

import numpy as np
from StringIO import StringIO

def conv(data):
    array1=np.genfromtxt(StringIO(data),dtype="|S50")
    lista1=[]
    for el in array1:
        lista1.append(el)
    return lista1

input1='address1 address2 ... addressn'

And this is what I get when I call the function

>conv(input1)
>['address1', 'address2', 'addressn']

It works, but I have a problem: inputs needs to be horizontal, so I can not copy my addresses from the text file and paste them into a string as I would get something like

input1="Davide
...:Michele
...:Giorgio
...:Paolo"

File "<ipython-input-4-6d70053fb94e>", line 1
  input1="Davide
             ^
SyntaxError: EOL while scanning string literal

How can I deal with this issue? Any suggestion to improve the code would be very apprecciated. I know almost nothing about the StringIO module, I came across it today for the first time, and I'm sure it's possible to write a much more efficient program than mine. This is the whole program by the way:

def scan(data1,data2): #Strings
    array1=np.genfromtxt(StringIO(data1),dtype="|S50")
    array2=np.genfromtxt(StringIO(data2),dtype="|S50")
    lista1=[]
    lista2=[]
    for el in array1:
        lista1.append(el)
    for el in array2:
        lista2.append(el) #lista1 and lista2 are lists containing strings
    num1,num2=len(lista1),len(lista2)
    shared=[]
    for el in lista1:
        if el in lista2:
            shared.append(el) #shared is the intersection of lista1 and lista2
    if len(shared)==0:
        print 'No shared elements'
        return lista1+lista2
    else:
        for el in shared:
            n1=lista1.count(el)
            for i in range(n1):
                lista1.remove(el) #Removes from lista1 the elements shared with lista2
            n2=lista2.count(el)   #as many times as they appear
            for j in range(n2):
                lista2.remove(el) #Removes from lista2 the elements shared with lista1
    result=lista1+lista2          #as many times as they appear
    print 'Addresses list 1:',num1
    print 'Addresses list 2:',num2
    print 'Useful Addresses:',len(list(set(result)))
    return (list(set(result)))

and this is an example of how it works:

data1="Davide John Kate Mary Susan"
data2="John Alice Clara Kate John Alex"
scan(data1,data2)
>Addresses list 1: 5
>Addresses list 2: 6
>Useful Addresses: 6
>['Alex', 'Susan', 'Clara', 'Alice', 'Mary', 'Davide']

Thanks for help :)

Lösung

Use triple quotes around a string spanning multiple lines:

input1="""Davide
...:Michele
...:Giorgio
...:Paolo"""

They will then be seperated by returns ("\n"), so you could use inpu1.split('\n') to turn it into a list.

Using set objects, your operation becomes pretty simple. To get the elements in s1 that are not in s2 we can simply do s1 - s2. Union is just | and intersection is just & so all told we have.

s1 = set(input1.split('\n'))
s2 = set(input2.split('\n'))
adresses_in_only_one_file = (s1 | s2) - (s1 & s2)

Andere Tipps

Expanding upon @irh's answer, you can then use sets get the symmetric difference between the two sets: (elements in list1 and list2 but not in both)

list1 = ['address1', 'address2', 'address3']

list1 = ['address5', 'address4', 'address3']

result = list(set(list1) ^ set(list2))

>>> print result
['address1', 'address2', 'address4', 'address5']     #note result might be jumbled but that shouldn't matter

shared =[]
for el in lista1:
    if el in lista2:
        shared.append(el) #shared is the intersection of lista1 and lista2

In [10]: lista1=[1,2,3,4,5,6,7,8,9]

In [11]: lista2=[1,2,3,10,11,12,13]

In [12]: lista1=set(lista1)

In [13]: shared = lista1.intersection(lista2) # same as your loop above

In [14]: shared
Out[14]: {1, 2, 3}

If you want a list just use list(lista1.intersection(lista2))

for el in shared:
    n1=lista1.count(el)
    for i in range(n1):
        lista1.remove(el) #Removes from lista1 the elements shared with lista2
    n2=lista2.count(el)   #as many times as they appear
    for j in range(n2):
        lista2.remove(el)
result=lista1+lista2

         lista1=set(lista1) 
In [15]: list(lista1.symmetric_difference(lista2))
Out[15]: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13] # same as above.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow