How to change the items in a list of sublists based on certain rules and conditions of those sublists?

StackOverflow https://stackoverflow.com/questions/23350348

  •  11-07-2023
  •  | 
  •  

Question

I have a list of sublists that are made up of three items. Only the first and last item matter in the sublists, because I want to change the last item across all sublists based on the frequency of the last item across the list.

This is the list I have:

lst = [['A','abc','id1'],['A','def','id2'],['A','ghi','id1'],['A','ijk','id1'],['A','lmn','id2'],['B','abc','id3'],['B','def','id3'],['B','ghi','id3'],['B','ijk','id3'],['B','lmn','id'],['C','xyz','id6'],['C','lmn','id6'],['C','aaa','id5']]

For example, A appears the most with id1 instead of id2, so I'd like to replace all id2 that appear with A with id1. For B, id3 is the most common, so I'd like to replace any instance of anything else with id3, which means I'd want to replace 'id' with 'id3' only for B. For C, I'd like to replace the instance of 'id5' with 'id6,' because 'id6' appears the most with the list.

Desired_List = lst = [['A','abc','id1'],['A','def','id1'],['A','ghi','id1'],['A','ijk','id1'],['A','lmn','id1'],['B','abc','id3'],['B','def','id3'],['B','ghi','id3'],['B','ijk','id3'],['B','lmn','id3'],['C','xyz','id6'],['C','lmn','id6'],['C','aaa','id6']]

I should also mention that this is going to be done on a very large list, so speed and efficiency is needed.

Was it helpful?

Solution

Straight-up data processing using your ad-hoc requirement above, I can come up with the following algorithm.

First sweep: collect frequency information for every key (i.e. 'A', 'B', 'C'):

def generate_frequency_table(lst):
    assoc = {}        # e.g. 'A': {'id1': 3, 'id2': 2}
    for key, unused, val in list:
        freqs = assoc.get(key, None)
        if freqs is None:
            freqs = {}
            assoc[key] = freqs
        valfreq = freqs.get(val, None)
        if valfreq is None:
            freqs[val] = 1
        else:
            freqs[val] = valfreq + 1
    return assoc

>>> generate_frequency_table(lst)
{'A': {'id2': 2, 'id1': 3}, 'C': {'id6': 2, 'id5': 1}, 'B': {'id3': 4, 'id': 1}}

Then, see what 'value' is associated with each key (i.e. {'A': 'id1'}):

def generate_max_assoc(assoc):
    max = {}    # e.g. {'A': 'id1'}
    for key, freqs in assoc.iteritems():
        curmax = ('', 0)
        for val, freq in freqs.iteritems():
            if freq > curmax[1]:
                curmax = (val, freq)
        max[key] = curmax[0]
    return max

>>> maxtable = generate_max_assoc(generate_frequency_table(lst))
>>> print maxtable
{'A': 'id1', 'C': 'id6', 'B': 'id3'}

Finally, iterate through the original list and replace values using the table above:

>>> newlst = [[key, unused, maxtable[key]] for key, unused, val in lst]
>>> print newlst
[['A', 'abc', 'id1'], ['A', 'def', 'id1'], ['A', 'ghi', 'id1'], ['A', 'ijk', 'id1'], ['A', 'lmn', 'id1'], ['B', 'abc', 'id3'], ['B', 'def', 'id3'], ['B', 'ghi', 'id3'], ['B', 'ijk', 'id3'], ['B', 'lmn', 'id3'], ['C', 'xyz', 'id6'], ['C', 'lmn', 'id6'], ['C', 'aaa', 'id6']]

OTHER TIPS

This is pretty much the same solution as supplied by Santa, but I've combined a few steps into one, as we can scan for the maximum value while we are collecting the frequencies:

def fix_by_frequency(triple_list):
    freq = {}

    for key, _, value in triple_list:
        # Get existing data
        data = freq[key] = \
            freq.get(key, {'max_value': value, 'max_count': 1, 'counts': {}})

        # Increment the count
        count = data['counts'][value] = data['counts'].get(value, 0) + 1

        # Update the most frequently seen
        if count > data['max_count']:
            data['max_value'], data['max_count'] = value, count

    # Use the maximums to map the list
    return [[key, mid, freq[key]['max_value']] for key, mid, _ in triple_list]

This has been optimised a bit for readability (I think, be nice!) rather than raw speed. For example you might not want to write back to the dict when you don't need to, or maintain a separate max dict to prevent two key lookups in the list comprehension at the end.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top