find difference between lists and append difference to lists, but for 40 different lists - python

StackOverflow https://stackoverflow.com/questions/18146660

  •  24-06-2022
  •  | 
  •  

Domanda

Hi so it's difficult to explain this properly in the title, but firstly let me start by explaining my data. I have 40 lists stored within a list with a form such as this:

data[0] = [[value1 value2 value3,80],[value1,90],[value1 value3,60],[value2 value3,70]]
data[1] = [[value2,40],[value1 value2 value3,90]]
data[2] = [[value1 value2,80],[value1,50],[value1 value3,20]]
   .
   .
   .

Now I am expecting an output such as this:

data[0] = [[value1 value2 value3,80],[value1,90],[value1 value3,60],[value2 value3,70],[value2,0],[value1 value2,0]]
data[1] = [[value2,40],[value1 value2 value3,90],[value1,0],[value1 value3,0],[value2 value3,0],[value1 value2,0]]
data[2] = [[value1 value2,80],[value1,50],[value1 value3,20],[value1 value2 value3,0],[value2 value3,0],[value2,0]]    

I know this is a bit complicated to read, but I wanted to make sure a good demo of the data is there. So basically all lists need to have all possible combinations of the values present in all the lists, if the combination isn't present in that list as standard then it's frequency (the second field) is 0.

Thanks for any help, please bear in mind this is the intersection of 40 different lists and thus needs to be fast and efficient. I'm not sure how best to do this...

EDIT: I also don't know all the 'values', I have just written 3 different values here (value1, value2, value3) for simplicity. In my project I have no idea what the values are or how many different ones there are (I know there are at least a few thousand)

EDIT 2: Here is some real input data, I don't have real output data but I will try and work it out:

data[0] = [['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', '39.7769']]


data[1] = [['syslog_priority:Info', '100'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', '43.8362']]


data[2] = [['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', '43.9506']]
È stato utile?

Soluzione 2

Well given your comments I would use sets as already suggested

first loop through your list to build a set of each possible string

possible_strings = set()
for row in mydata:
   for item in row:
       possible_string.add(item[0])

So possible_strings has all possible strings in your data

Now you need to inspect each row for a string, if it does not exist you need to append it to the row with a frequency of 0

my_new_data = []
for row in mydata:
    row_strings = set(item[0] for item in row)
    missing_strings = possible_strings - row_strings
    for item in list(missing_strings):
         new_item = []
         new_item.append(item)
         new_item.append(0)
         row.append(new_item)
     row.sort()
     my_new_data.append(row)

The reason I would use sets is that you do not have to do any lookup and the items are strings so they can be members of a set. There are ways to speed this up (condense the code) but I like to lay things out so I can see clearly what I am doing. Unless I made a typo (and I have already corrected 3) this code worked on my computer

Here are the unsorted results

newrow*************
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80', '39.7769']
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', 0]
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', 0]
['syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', 0]
newrow*************
['syslog_priority:Info', '100']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', '43.8362']
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', 0]
newrow*************
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', 0]
['syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', 0]

Altri suggerimenti

Sounds like you could use sets:

>>> {1, 2, 3, 4, 5} & {2, 3, 4, 5, 6, 7} & {3, 4, 5}
{3, 4, 5}

& is the intersection operator for sets. Get a set of a list (this will remove duplicate elements with set(mylist).

Edit: In the light of your comments, it seems what you need is some sort of union (the union operator being |), not an intersection. Here is a function that does what you wanted in your comment for 2 lists of lists:

def function(first, second):
    first_set = {tuple(i) for i in first}
    second_set = {tuple(i) for i in second}
    return (first_set | {(i[0], 0) for i in second_set},
            second_set | {(i[0], 0) for i in first_set})

>>> a = [(1,60),(3,90)]
>>> b = [(2,30),(4,50)]
>>> x, y = function(a, b)
>>> print(x)
{(2, 0), (3, 90), (1, 60), (4, 0)}
>>> print(y)
{(3, 0), (4, 50), (1, 0), (2, 30)}

It sounds like you want dictionaries, and then you want to compare the keys, which are lists of "values" as you have them, but not the dictionary values, which are frequencies. Restructuring your data as dictionaries isn't necessary, of course, but it might make more sense.

Now, for an actual answer: make a new list/dictionary just to put together one full list of all the keys/"lists of values". Then, go through a second time and add the elements that are missing to the lists that are missing them. The outer loops go through 40 times. The first outer loop is O(n*2), where n is the total number of unique keys, although I imagine the average case will be less than n*2. The second outer loop is O(n**2), as well.

I hope that's not too brute forcey. At least it's better than comparing data[n] to data[n+m] for n 0-40... That'd be 40**2 for the outer loops... which is still a constant, but, obviously a bigger one than 80.

Correct me if I'm wrong, but I think the best solution to this involves a dictionary for each desired output, and a master set of keys. A set will basically store every value without allowing for duplicates. With your above example I would do this:

master_set = set()
for current_list in list_of_lists:
    master_set |= [entry[0] for entry in current_list] 

Where |= is effectively the union operator for sets.

Once you have that set, you're looking to construct a dictionary for each entry that either contains the relevant value, or a zero. First I would construct a dictionary, then I would just add results for absent items.

full_dictionary = {}
for entry in master_set:
    full_dictionary[entry] = [thing[1] for thing in current_list if thing[0] == entry]

And then just generate the full dictionary for each list you've got.

Alternately, if you have choice over how your data is coming in, or just want to restructure it reasonably I would suggest using a dictionary comprehension, which would just make this whole thing simpler:

new_dict = {value[0]: value[1] for value in current_list}

I'm also having a little trouble interpreting the question, but let me know if that's not accurate and I can revise it.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top