Question

I have to read through an extremely large amount of network data from various log files, and compile relevant information about that data to perform statistical analysis on it (the top communicators, top ip addresses that send on average the largest packets, average packet size, etc.) I decided to make a matrix where each index in the outer array corresponds to a sorted list of ips in incremental order.

To get an idea of my data set, I have a log file that is generated each day which has information regarding all the communication that occurred within a network that day. I have been generating log files since the middle of May and each log file has about 5 million lines each, so the more efficient this program is, the better.

My question is the following:

When I am compiling data from all the log files into this matrix, should I sort my outer array layer by IPs while I'm still putting together the data, or do I just append the new IPs to the end of the file as I find them and then sort the list afterwards? Is there a better option here that I haven't thought of? which way would be the most efficient? I am going to be using python 2.7, if that makes a difference. Also, due to restrictions on what I can do, I cannot install any new modules to the machine this code will run on, so unless I can create a database with python natively, that isn't an available option to me.

Was it helpful?

Solution

Insertion sort works best when you are inserting a new value into something that is already sorted. So what I would do is use Quicksort to sort the original dataset that you have, then when additional log entries come in, add them one by one into the already sorted set.

With Quicksort being O(n*logn) and Insertion Sort being O(n) when used with an already sorted set, the total time for everything will be O(a * log(a) + b), where a is the size of the original data set and b is the additional logs you place in afterwards.

OTHER TIPS

Inserting everything takes no comparisons, sorting it afterwards is in O(n*log(n)) so inserting and then sorting is in O(n * log(n))

Inserting one element into a sorted list so as to maintain a sorted list is logarithmic in the size of the list, so inserting n elements into that list is in O(n * log(n)) since the size of the list grows as you keep inserting elements.

It therefore has similar complexity in terms of comparisons, but inserting an element in python is much more costly than appending (linear with the size of the list instead of constant). The most efficient is therefore to append everything and then sort.

Here is a sample code to check the performance of processing 10000 random numbers :

import random
import timeit
import bisect


def append_then_sort(data):
    result = []

    for element in data:
        result.append(element)

    result.sort()
    return result


def insert_sorted(data):
    result = []

    for element in data:
        bisect.insort_left(result, element)

    return result

if __name__ == '__main__':
    numbers = [random.randint(0, 3000000) for _ in xrange(10000)]
    print('Append then sort:')
    print(timeit.timeit('append_then_sort(numbers)', setup='from __main__ import append_then_sort, numbers', number=10))
    print('Insert sorted:')
    print(timeit.timeit('insert_sorted(numbers)', setup='from __main__ import insert_sorted, numbers', number=10))

And the results on my machine :

Append then sort:
0.0509831905365
Insert sorted:
0.27309513092

You might also be interested in looking into the sqlite3 module which is part of the standard python library so should be available on your system. I'm not sure of how well sqlite handles large databases however...

Licensed under: CC-BY-SA with attribution
scroll top