Question

I have a huge list of files (20k). Each file has a unique identifier string in the first line. This first line contains only this identifier string. The list of files has about n different identifiers and each identifier has at least 500 files (but the amount of files for each identifier is not equal).

I need to randomly sample 500 files (of each identifier) and copy them to another directory so that I end up with a subset of the original list and each identifier is represented in an equal amount of files

I know random.sample() can give me a random list of but that does not take care of the constraint in the first line, and shutil.copy() can copy files...

But how can I do this (efficiently) in python by obeying the constraint of the identifier in the first line of a file?

Was it helpful?

Solution

From what you've described, you'll have to read the first line of every file to organize them by identifiers. Something like this I think would do what you're looking for:

import os
import collections
import random
import shutil

def get_identifier(path):
    with open(path) as fd:
        return fd.readline().strip()       #assuming you don't want the \n in the identifier

paths = ['/home/file1', '/home/file2', '/home/file3']
destination_dir = '/tmp'
identifiers = collections.defaultdict(list)
for path in paths:
    identifier = get_identifier(path)
    identifiers[identifier].append(path)

for identifier, paths in identifiers.items():
    sample = random.sample(paths, 500)
    for path in sample:
        file_name = os.path.basename(path)
        destination = os.path.join(destination_dir, file_name)
        shutil.copy(path, destination)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top