return two lists from a list comprehension -- performance

https://stackoverflow.com/questions/23216218

07-07-2023
|

Question

In my program, I am getting all directories and files(walk) then write all of them to a dictionary by file names as keys and paths as values, then get a keyword from interface (tk.Entry) and return all matches to two lists. I will show them(tk.Listbox) and open the selected one(win32shell).

I used this one to create two lists with one comprehension. In the comments, it says "Just running two separate list comprehensions is simpler and probably faster though." so thats makes me confused about which one to use. Because this program will run through ~3TB data which I don't have right now so i can not run and see which will be faster.

This is my minimized code, I removed interface and defined keyword and path by keywrd, folder variables respectively.

import os
import sqlite3

audio_ext = [".mp3",".mp4","etc..."]
folder = "C:\\Users\\Lafexlos\\Music"
keywrd = "mo"  ##searching keyword which I normally get from user by Entry

conn = sqlite3.connect(":memory:")
data  = conn.cursor()
data.execute(" create table if not exists audio(path text,\
                filename text UNIQUE) ")

for roots ,dirs ,files in os.walk(folder):
    for item in os.listdir(roots):
        if "."+item.split(".")[-1].lower() in audio_ext:
        #Above line is not eye-friendly but is only checks file's extension
            data.execute(" INSERT OR IGNORE into audio \
                (path, filename) VALUES (?,?)",(roots,item))

lines = {}
musics = data.execute("select * from audio")
[lines.update({row[1]:row[0]}) for row in musics]


# This is the option 1. Using zip to create two lists
results,paths = zip(*[(k,v) for k,v in lines.items() if keywrd in k])

# This is option 2. Running same list comprehension twice
results = [k for (k,v) in lines.items() if keywrd in k]
paths = [v for (k,v) in lines.items() if keywrd in k]

print ("Results: ", results)
print ("\n\nPaths: ", paths)

As I mentioned above, my question is which one would be faster when working large amount of data?

Solution

Use zip():

results, paths = zip(*((k, v) for k, v in lines.items() if keywrd in k))

as this'll produce the two lists in one step. The alternative is to use one for loop:

results = []
paths = []
for (k,v) in lines.items():
    if keywrd in k:
        results.append(k)
        paths.append(v)

List comprehensions are great if you want to build one list; if you need multiple from the same loop, just use the loop.

However, since this data comes from a SQLite query, your best bet would be to have SQLite limit the rows to those that match:

data.execute("select * from audio if filename LIKE ?", ('%{}%'.format(keywrd),))

Your lines dictionary is far more efficiently built with a dictionary comprehension:

musics = data.execute("select * from audio")
lines = {row[1]: row[0] for row in musics}

or using a more specific query and a direct loop over the cursor:

data.execute("SELECT path, filename FROM audio WHERE filename LIKE ?",
             ('%{}%'.format(keywrd),))
paths, results = zip(*data)

LIKE against a string with % wildcards on both sides produces the same results an an in test in Python; if keywrd is contained in filename the row matches.

Now there is no need to create an intermediary dictionary either.

OTHER TIPS

Faster is to use a for-loop:

results = []; add_result = result.append
paths = []; add_path = path.append
for k,v in lines.items():
    if keywrd in k:
        add_result(k)
        add_path(v)

Fastest is to use your in-memory-sqlite database to do the filtering.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow