Python: eliminate duplicate nested lists

https://stackoverflow.com/questions/19717099

02-07-2022
|

Pregunta

I am building a tagging model for natural language processing. Initially, the words of a sentence are tagged as a part of speech (like NN for noun), then the rules are applied that divide them into trees which are represented as nested lists. This process iterates many times until you get one node at the top level. I have a master list of all potential trees and I need to eliminate duplicate trees or the whole thing blows up in memory. Here is a small sample of what a list consists of. I need to make sure that each list in the list is unique as each iteration creates many branches.

[[('NP', [('PRP', 'I')]), ('VBD', 'ate'), ('DT', 'a'), ('NN', 'steak'), ('IN', 'with'), ('DT', 'a'), ('NN', 'knife'), ('.', '.')]
[('PRP', 'I'), ('VP', [('VBD', 'ate')]), ('DT', 'a'), ('NN', 'steak'), ('IN', 'with'), ('DT', 'a'), ('NN', 'knife'), ('.', '.')]
[('PRP', 'I'), ('VBD', 'ate'), ('NP', [('DT', 'a')]), ('NN', 'steak'), ('IN', 'with'), ('DT', 'a'), ('NN', 'knife'), ('.', '.')]
...]

I thought of using a set but lists aren't hashable. I tried using recursion and it runs out of memory. I thought about converting the lists to strings, using the string as a dictionary key and the list as the value, then iterating over and turning it back into a list again (or keep it as a dictionary?). Does anyone have a less hackish solution? I'm relatively new to Python so please provide an explanation of how your solution works.

I should clarify: the nested lists can be indefinitely deep. The tree structure is not known ahead of time but is built on the fly. Trying to build something like this -http://jos.oxfordjournals.org/content/25/4/345/F23.large.jpg but in the form of a nested list.

Solución 2

Thanks to sdasdadas for pointing out the duplicate. I was able to (more or less) solve it by creating this function:

 def list_to_tuple(a_list):
    temp_list = []
    for item in a_list:
        if isinstance(item, list) or isinstance(item, tuple):
            temp_list.append(list_to_tuple(item))
        else:
            temp_list.append(item)
    return tuple(temp_list)

It takes an indefinitely deep nested list or tuples and gives back the equivalent thing in tuples. In another function, I pass this through a set to ensure unique values then it can go back to a list or whatever.

Thanks for your answers.

Otros consejos

You can hash tuples. This might solve some of your problem (although I am a little bit confused about your problem) This would let you store all existing tuples in a set (which is kind of a hashable list). That way when you try to create each new tupple you just call

if my_tuple in my_tuple_set:
    # the best way I could find to access the item in a tuple
    stored_tuple = my_tuple_set.pop(my_tuple) 
    my_tupple_set.add(stored_tuple)
    my_tuple = stored_tuple   # reset pointer to stored data

Kind of interesting having to deal with memory issues in python!

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow