Fastest nested loops over a single list (with elements remove or not)
-
21-09-2019 - |
Question
I am looking for advice about how to parse a single list, using two nested loops, in the fastest way, avoiding doing len(list)^2
comparisons, and avoiding duplicate files in groups.
More precisely: I have a list of 'file' objects, that each has a timestamp. I want to group the files by their timestamp and a time offset. Ex. starting from a file X, I want to create a group with all the files that have the timestamp < (timestamp(x) + offset)
.
For this, I did:
for file_a in list:
temp_group = group()
temp_group.add(file_a)
list.remove(file_a)
for file_b in list:
if (file_b.timestamp < (file_a.timestamp + offset)):
temp_group.add(file_b)
list.remove(file_b)
groups.add(temp_group)
(ok, the code is more complicated, but this is the main idea)
This obviously doesn't work, because I am modifying the list during the loops, and strange things happen :)
I thought I have to use copy of 'list' for the loops, but, this doesn't work either:
for file_a in list[:]:
temp_group = group()
temp_group.add(file_a)
list.remove(file_a)
for file_b in list[:]:
if (file_b.timestamp < (file_a.timestamp + offset)):
temp_group.add(file_b)
list.remove(file_b)
groups.add(temp_group)
Well.. I know I can do this without removing elements from the list, but then I need to mark the ones that were already 'processed', and I need to check them each time - which is a speed penalty.
Can anyone give me some advice about how this can be done in the fastest/best way?
Thank you,
Alex
EDIT: I have found an other solution, that doesn't exactly answers the question, but it is what I actually needed (my mistake for asking the question in that way). I am posting this here because it may help someone looking for related issues with loops over lists, in Python.
It may not be the fastest (considering the number of 'passes' through the list), but it was quite simple to understand and implement, and it does not require the list to be sorted.
The reason for which I avoid sorting is that it may take some more time, and because after I make the first set of groups, some of them will be 'locked', and the unlocked groups will be 'dissolved', and regrouped using a different time offset. (And when dissolving groups, the files order may be changed, and they will require sorting again).
Anyway, the solution was to control the loops index myself. If I delete a file from the list, I skip increasing the index (ex: when I delete index "3", the previous index "4" is now "3", and I don't want to increment the loop counter, because I will skip over it). If at that iteration I don't delete any item, then the index is increased normally. Here's the code (with some extras; ignore all that 'bucket' stuff):
def regroup(self, time_offset):
#create list of files to be used for regrouping
regroup_files_list = []
if len(self.groups) == 0:
#on first 'regroup', we start with a copy of jpeg_list, so that we do not change it further on
regroup_files_list = copy.copy(self.jpeg_list)
else:
i = 0
while True:
try:
group = self.groups[i]
except IndexError:
break
if group.is_locked == False:
regroup_files_list.extend(group)
self.groups.remove(group)
continue
else:
i += 1
bucket_group = FilesGroup()
bucket_group.name = c_bucket_group_name
while len(regroup_files_list) > 0: #we create groups until there are no files left
file_a = regroup_files_list[0]
regroup_files_list.remove(file_a)
temp_group = FilesGroup()
temp_group.start_time = file_a._iso_time
temp_group.add(file_a)
#manually manage the list index when iterating for file_b, because we're removing files
i = 0
while True:
try:
file_b = regroup_files_list[i]
except IndexError:
break
timediff = file_a._iso_time - file_b._iso_time
if timediff.days < 0 or timediff.seconds < 0:
timediff = file_b._iso_time - file_a._iso_time
if timediff < time_offset:
temp_group.add(file_b)
regroup_files_list.remove(file_b)
continue # :D we reuse the old position, because all elements were shifted to the left
else:
i += 1 #the index is increased normally
self.groups.append(temp_group)
#move files to the bucket group, if the temp group is too small
if c_bucket_group_enabled == True:
if len(temp_group) < c_bucket_group_min_count:
for file in temp_group:
bucket_group.add(file)
temp_group.remove(file)
else:
self.groups.append(temp_group)
if len(bucket_group) > 0:
self.groups.append(bucket_group)
Solution
A simple solution that works by sorting the list then using a generator to create groups:
def time_offsets(files, offset):
files = sorted(files, key=lambda x:x.timestamp)
group = []
timestamp = 0
for f in files:
if f.timestamp < timestamp + offset:
group.append(f)
else:
yield group
timestamp = f.timestamp
group = [timestamp]
else:
yield group
# Now you can do this...
for group in time_offsets(files, 86400):
print group
And here's a complete script you can run to test:
class File:
def __init__(self, timestamp):
self.timestamp = timestamp
def __repr__(self):
return "File: <%d>" % self.timestamp
def gen_files(num=100):
import random
files = []
for i in range(num):
timestamp = random.randint(0,1000000)
files.append(File(timestamp))
return files
def time_offsets(files, offset):
files = sorted(files, key=lambda x:x.timestamp)
group = []
timestamp = 0
for f in files:
if f.timestamp < timestamp + offset:
group.append(f)
else:
yield group
timestamp = f.timestamp
group = [timestamp]
else:
yield group
# Now you can do this to group files by day (assuming timestamp in seconds)
files = gen_files()
for group in time_offsets(files, 86400):
print group
OTHER TIPS
The best solution I can think of is O(n log n)
.
listA = getListOfFiles()
listB = stableMergesort(listA, lambda el: el.timestamp)
listC = groupAdjacentElementsByTimestampRange(listB, offset)
Note that groupAdjacentElementsByTimestampRange
is O(n)
.
I'm not exactly sure what you are trying to do - it seems to me that the order of the list will affect the groupings, but your existing code can be modified to work like this.
#This is O(n^2)
while lst:
file_a=lst.pop()
temp_group = group()
temp_group.add(file_a)
while lst
file_b=lst[-1]
if (file_b.timestamp < (file_a.timestamp + offset)):
temp_group.add(lst.pop())
groups.add(temp_group)
Does the group have to start at file_a.timestamp?
# This is O(n)
from collections import defaultdict
groups=defaultdict(list) # This is why you shouldn't use `list` as a variable name
for item in lst:
groups[item.timestamp/offset].append(item)
Is much simpler way to chop up into groups with similar timestamps