بيثون فئة دمج فرز الملفات, كيف يمكن تحسينها ؟

https://stackoverflow.com/questions/1001569

05-07-2019
|

سؤال

الخلفية:

أنا التنظيف الكبيرة (يمكن عقد في الذاكرة) المفصول الملفات.وأنا تنظيف مدخلات الملف ، بناء قائمة في الذاكرة ؛ عندما يحصل على 1 ، 000 ، 000 الإدخالات (حوالي 1GB في الذاكرة) أنا من النوع هذا (باستخدام المفتاح الافتراضي أدناه) كتابة قائمة ملف.هذه الدرجة هو وضع فرز الملفات معا مرة أخرى.يعمل على ملفات لقد واجهت حتى الآن.بلدي أكبر قضية حتى الآن هو دمج 66 فرز الملفات.

الأسئلة:

هناك ثقوب في المنطق (أين هو الهشة)?
لقد نفذت نوع دمج الخوارزمية بشكل صحيح ؟
هل هناك أي تحسينات واضحة يمكن أن يكون ؟

البيانات المثال:

هذا هو التجريد من خط في واحد من هذه الملفات:

'hash_of_SomeStringId\tSome String Id\t\t\twww.somelink.com\t\tOtherData\t\n'

الوجبات الجاهزة هي التي تستخدم 'SomeStringId'.lower().replace(' ', '') كما فرز بلدي الرئيسية.

رمز الأصلي:

class SortedFileMerger():
    """ A one-time use object that merges any number of smaller sorted 
        files into one large sorted file.

        ARGS:
            paths - list of paths to sorted files
            output_path - string path to desired output file
            dedup - (boolean) remove lines with duplicate keys, default = True
            key - use to override sort key, default = "line.split('\t')[1].lower().replace(' ', '')"
                  will be prepended by "lambda line: ".  This should be the same 
                  key that was used to sort the files being merged!
    """
    def __init__(self, paths, output_path, dedup=True, key="line.split('\t')[1].lower().replace(' ', '')"):
        self.key = eval("lambda line: %s" % key)
        self.dedup = dedup
        self.handles = [open(path, 'r') for path in paths]
        # holds one line from each file
        self.lines = [file_handle.readline() for file_handle in self.handles]
        self.output_file = open(output_path, 'w')
        self.lines_written = 0
        self._mergeSortedFiles() #call the main method

    def __del__(self):
        """ Clean-up file handles.
        """
        for handle in self.handles:
            if not handle.closed:
                handle.close()
        if self.output_file and (not self.output_file.closed):
            self.output_file.close()

    def _mergeSortedFiles(self):
        """ Merge the small sorted files to 'self.output_file'. This can 
            and should only be called once.
            Called from __init__().
        """
        previous_comparable = ''
        min_line = self._getNextMin()
        while min_line:
            index = self.lines.index(min_line)
            comparable = self.key(min_line)
            if not self.dedup:                      
                #not removing duplicates
                self._writeLine(index)
            elif comparable != previous_comparable: 
                #removing duplicates and this isn't one
                self._writeLine(index)
            else:                                   
                #removing duplicates and this is one
                self._readNextLine(index)
            previous_comparable = comparable
            min_line = self._getNextMin()
        #finished merging
        self.output_file.close()

    def _getNextMin(self):
        """ Returns the next "smallest" line in sorted order.
            Returns None when there are no more values to get.
        """
        while '' in self.lines:
            index = self.lines.index('')
            if self._isLastLine(index):
                # file.readline() is returning '' because 
                # it has reached the end of a file.
                self._closeFile(index)
            else:
                # an empty line got mixed in
                self._readNextLine(index)
        if len(self.lines) == 0:
            return None
        return min(self.lines, key=self.key)

    def _writeLine(self, index):
        """ Write line to output file and update self.lines
        """
        self.output_file.write(self.lines[index])
        self.lines_written += 1
        self._readNextLine(index)

    def _readNextLine(self, index):
        """ Read the next line from handles[index] into lines[index]
        """
        self.lines[index] = self.handles[index].readline()

    def _closeFile(self, index):
        """ If there are no more lines to get in a file, it 
            needs to be closed and removed from 'self.handles'.
            It's entry in 'self.lines' also need to be removed.
        """
        handle = self.handles.pop(index)
        if not handle.closed:
            handle.close()
        # remove entry from self.lines to preserve order
        _ = self.lines.pop(index)

    def _isLastLine(self, index):
        """ Check that handles[index] is at the eof.
        """
        handle = self.handles[index]            
        if handle.tell() == os.path.getsize(handle.name):
            return True
        return False

تحرير: تنفيذ اقتراحات من براين لقد جاء مع الحل التالي:

تحرير الثاني: تحديث الشفرة في جون ماشين'اقتراح:

def decorated_file(f, key):
    """ Yields an easily sortable tuple. 
    """
    for line in f:
        yield (key(line), line)

def standard_keyfunc(line):
    """ The standard key function in my application.
    """
    return line.split('\t', 2)[1].replace(' ', '').lower()

def mergeSortedFiles(paths, output_path, dedup=True, keyfunc=standard_keyfunc):
    """ Does the same thing SortedFileMerger class does. 
    """
    files = map(open, paths) #open defaults to mode='r'
    output_file = open(output_path, 'w')
    lines_written = 0
    previous_comparable = ''
    for line in heapq26.merge(*[decorated_file(f, keyfunc) for f in files]):
        comparable = line[0]
        if previous_comparable != comparable:
            output_file.write(line[1])
            lines_written += 1
        previous_comparable = comparable
    return lines_written

الخام اختبار

باستخدام نفس ملفات الإدخال (2.2 GB من البيانات):

SortedFileMerger أخذت الطبقة 51 دقيقة (3068.4 ثانية)
براين'حل استغرق 40 دقيقة (2408.5 ثانية)
بعد إضافة جون ماشيناقتراحات الحل رمز استغرق 36 دقيقة (2214.0 ثانية)

المحلول

علما أنه في python2.6, heapq جديد دمج الوظيفة التي سوف نفعل ذلك لك.

للتعامل مع مفتاح مخصص الوظيفة ، يمكنك مجرد التفاف الملف مكرر مع ما يزين من ذلك أنه يقارن على أساس المفتاح ، وقطاع بها بعد ذلك:

def decorated_file(f, key):
    for line in f: 
        yield (key(line), line)

filenames = ['file1.txt','file2.txt','file3.txt']
files = map(open, filenames)
outfile = open('merged.txt')

for line in heapq.merge(*[decorated_file(f, keyfunc) for f in files]):
    outfile.write(line[1])

[عدل] حتى في الإصدارات السابقة من بيثون ، ربما من المفيد ببساطة أن تأخذ تنفيذ دمج في وقت لاحق من heapq وحدة.انها محض بيثون ، ويعمل معدلة في python2.5 لأنه يستخدم كومة نحصل على الحد الأدنى يجب أن تكون فعالة جدا عند دمج عدد كبير من الملفات.

يجب أن تكون قادرة على مجرد نسخ heapq.py من python2.6 تركيب نسخه إلى مصدر "heapq26.py" واستخدام "from heapq26 import merge"- لا توجد 2.6 الميزات الخاصة المستخدمة في ذلك.بدلا من ذلك يمكنك فقط نسخ دمج وظيفة (إعادة كتابة heappop الخ يدعو إلى مرجع python2.5 heapq وحدة).

نصائح أخرى

<< هذا "الجواب" هو التعليق على الأصل السائل هو الناتجة رمز >>

اقتراح:باستخدام eval() هو قلوب و ما تقومون به يقيد الطالب إلى استخدام امدا -- مفتاح الاستخراج قد تتطلب أكثر من بطانة واحدة ، وعلى أي حال لا تحتاج نفس الوظيفة الأولية النوع الخطوة ؟

حتى تحل محل هذا:

def mergeSortedFiles(paths, output_path, dedup=True, key="line.split('\t')[1].lower().replace(' ', '')"):
    keyfunc = eval("lambda line: %s" % key)

مع هذا:

def my_keyfunc(line):
    return line.split('\t', 2)[1].replace(' ', '').lower()
    # minor tweaks may speed it up a little

def mergeSortedFiles(paths, output_path, keyfunc, dedup=True):

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow