Algorithm for merging large files

https://stackoverflow.com/questions/128043

02-07-2019
|

Question

I have several log files of events (one event per line). The logs can possibly overlap. The logs are generated on separate client machines from possibly multiple time zones (but I assume I know the time zone). Each event has a timestamp that was normalized into a common time (by instantianting each log parsers calendar instance with the timezone appropriate to the log file and then using getTimeInMillis to get the UTC time). The logs are already sorted by timestamp. Multiple events can occur at the same time, but they are by no means equal events.

These files can be relatively large, as in, 500000 events or more in a single log, so reading the entire contents of the logs into a simple Event[] is not feasible.

What I am trying do is merge the events from each of the logs into a single log. It is kinda like a mergesort task, but each log is already sorted, I just need to bring them together. The second component is that the same event can be witnessed in each of the separate log files, and I want to "remove duplicate events" in the file output log.

Can this be done "in place", as in, sequentially working over some small buffers of each log file? I can't simply read in all the files into an Event[], sort the list, and then remove duplicates, but so far my limited programming capabilities only enable me to see this as the solution. Is there some more sophisticated approach that I can use to do this as I read events from each of the logs concurrently?

Solution

Read the first line from each of the log files
LOOP

a. Find the "earliest" line.

b. Insert the "earliest" line into the master log file

c. Read the next line from the file that contained the earliest line

You could check for duplicates between b and c, advancing the pointer for each of those files.

OTHER TIPS

Sure - open every log file. Read in the first line for each into an array of 'current' lines. Then repeatedly pick the line with the lowest timestamp from the current array. Write it to the output, and read a new line from the appropriate source file to replace it.

Here's an example in Python, but it makes good pseudocode, too:

def merge_files(files, key_func):
    # Populate the current array with the first line from each file
    current = [file.readline() for file in files]
    while len(current) > 0:
        # Find and return the row with the lowest key according to key_func
        min_idx = min(range(len(files)), key=lambda x: return key_func(current[x]))
        yield current[min_idx]
        new_line = files[min_idx].readline()
        if not new_line:
            # EOF, remove this file from consideration
            del current[min_idx]
            del files[min_idx]
        else:
            current[min_idx] = new_line

Checkout this link: http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194

Use a heap (based on an array). The number of elements in this heap/array will be equal to the number of log files you have.
Read the first records from all the files and insert them into your heap.
Loop until (no more records in any of the files)

      > remove the max element from the heap
      > write it to the output
      > read the next record from the file to which the (previous) max element belonged
          if there are no more records in that file
              remove it from file list
              continue
      > if it's not the same as the (previous) max element, add it to the heap

Now you have all your events in one log file, they are sorted, and there are no duplicates. The time complexity of the algorithm is (n log k) where n is the total number of records and k is the number of log files.

You should use buffered reader and buffered writer objects when reading to and from files to minimize the number of disk reads and writes, in order to optimize for time.

We were needed to merge chronologically several log files having multiple lines per one log entry (java applications do this often - their stack traces are the same). I decided to implement the simple shell+perl script. It covers our tasks. If you are interested in it - follow by the link http://code.google.com/p/logmerge/

Read only one line at a time from both source files. Compare the lines and write the older one to the output file (and advance to the next line). Do this until you have reached the end of both files and you've merged the files.

And make sure to remove duplicates :)

I guess this code in C# may illustrate the approach:

        StringReader fileStream1;
        StringReader fileStream2;
        Event eventCursorFile1 = Event.Parse(fileStream1.ReadLine());
        Event eventCursorFile2 = Event.Parse(fileStream2.ReadLine());

        while !(fileStream1.EOF && fileStream2.EOF)
        {
            if (eventCursorFile1.TimeStamp < eventCursorFile2.TimeStamp)
            {
                WriteToMasterFile(eventCursorFile1);
                eventCursorFile1 = Event.Parse(fileStream1.ReadLine());
            }
            else if (eventCursorFile1.TimeStamp == eventCursorFile2.TimeStamp)
            {
                WriteToMasterFile(eventCursorFile1);
                eventCursorFile1 = Event.Parse(fileStream1.ReadLine());
                eventCursorFile2 = Event.Parse(fileStream2.ReadLine());
            }
            else
            {
                WriteToMasterFile(eventCursorFile1);
                eventCursorFile2 = Event.Parse(fileStream2.ReadLine());
            }  
        }

The break condition isn't exactly right as this is just Quick'n'dirty, but it should look similar..

OR you could borrow a log merge utility from Awstats which is an open source website stats tool.

logresolvemerge.pl is a perl script that can merge multiple log files : you can even use multiple threads to merge the log files(need to have perl 5.8 for multi-thread use). Why don't you try using a readily available tool instead of building one ?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow