Need to compare very large files around 1.5GB in python

Question 1

Another possible (system-admin) way, avoiding database and SQL queries plus a whole lot of requirements in runtime processes and hardware resources.

Update 20/04 Added more code and simplified approach:-

Convert the timestamp to seconds (from Epoch) and use UNIX sort, using email and this new field (that is: sort -k2 -k4 -n -t, < converted_input_file > output_file)
Initialize 3 variable, EMAIL, PREV_TIME and COUNT
Interate over each line, if new email is encountered, add "1,0 day". Update PREV_TIME=timestamp, COUNT=1, EMAIL=new_email
Next line: 3 possible scenario
- a) if same email, different timestamp: calculate days, increment COUNT=1, update PREV_TIME, add "Count, Difference_in_days"
- b) If same email, same timestamp: increment COUNT, add "COUNT, 0 day"
- c) If new email, start from 3.

Alternative to 1. is to add a new field TIMESTAMP and remove it upon printing out the line.

Note: If 1.5GB is too huge to sort at a go, split it into smaller chuck, using email as the split point. You can run these chunks in parallel on different machine

/usr/bin/gawk -F'","' ' { 
    split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " "); 
    for (i=1; i<=12; i++) mdigit[month[i]]=i; 
    print $0 "," mktime(substr($4,6,4) " " mdigit[substr($4,3,3)] " " substr($4,1,2) " 00 00 00"
)}' < input.txt |  /usr/bin/sort -k2 -k7 -n -t, > output_file.txt

output_file.txt:

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1280102400 "DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439",1291852800 "DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",1292112000 "DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006",1292976000
...

You pipe the output to Perl, Python or AWK script to process step 2. through 4.

Question 2

make sure you have 0.11, read these docs: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables, and these recipes: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore (esp the 'merging on millions of rows'

Here is a solution that seems to work. Here is the workflow:

read data from your csv by chunks and appending to an hdfstore
iterate over the store, which creates another store that does the combiner

Essentially we are taking a chunk from the table and combining with a chunk from every other part of the file. The combiner function does not reduce, but instead calculates your function (the diff in days) between all elements in that chunk, eliminating duplicates as you go, and taking the latest data after each loop. Kind of like a recursive reduce almost.

This should be O(num_of_chunks**2) memory and calculation time chunksize could be say 1m (or more) in your case

processing [0] [datastore.h5]
processing [1] [datastore_0.h5]
    count                date  diff                        email
4       1 2011-06-24 00:00:00     0           0000.ANU@GMAIL.COM
1       1 2011-06-24 00:00:00     0          00000.POO@GMAIL.COM
0       1 2010-07-26 00:00:00     0           00000000@11111.COM
2       1 2013-01-01 00:00:00     0         0000650000@YAHOO.COM
3       1 2013-01-26 00:00:00     0       00009.GAURAV@GMAIL.COM
5       1 2011-10-29 00:00:00     0          0000MANNU@GMAIL.COM
6       1 2011-11-21 00:00:00     0    0000PRANNOY0000@GMAIL.COM
7       1 2011-06-26 00:00:00     0  0000PRANNOY0000@YAHOO.CO.IN
8       1 2012-10-25 00:00:00     0          0000RAHUL@GMAIL.COM
9       1 2011-05-10 00:00:00     0            0000SS0@GMAIL.COM
12      1 2010-12-09 00:00:00     0         0001HARISH@GMAIL.COM
11      2 2010-12-12 00:00:00     3         0001HARISH@GMAIL.COM
10      3 2010-12-22 00:00:00    13         0001HARISH@GMAIL.COM
14      1 2012-11-28 00:00:00     0           000AYUSH@GMAIL.COM
15      2 2012-11-29 00:00:00     1           000AYUSH@GMAIL.COM
17      3 2012-12-08 00:00:00    10           000AYUSH@GMAIL.COM
18      4 2012-12-12 00:00:00    14           000AYUSH@GMAIL.COM
13      5 2013-01-25 00:00:00    58           000AYUSH@GMAIL.COM
import pandas as pd
import StringIO
import numpy as np
from time import strptime
from datetime import datetime

# your data
data = """
"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
"""


# read in and create the store
data_store_file = 'datastore.h5'
store = pd.HDFStore(data_store_file,'w')

def dp(x, **kwargs):
    return [ datetime(*strptime(v,'%d%b%Y')[0:3]) for v in x ]

chunksize=5
reader = pd.read_csv(StringIO.StringIO(data),names=['x1','email','x2','date','x3','x4'],
                     header=0,usecols=['email','date'],parse_dates=['date'],
                     date_parser=dp, chunksize=chunksize)

for i, chunk in enumerate(reader):
    chunk['indexer'] = chunk.index + i*chunksize

    # create the global index, and keep it in the frame too
    df = chunk.set_index('indexer')

    # need to set a minimum size for the email column
    store.append('data',df,min_itemsize={'email' : 100})

store.close()

# define the combiner function
def combiner(x):

    # given a group of emails (the same), return a combination
    # with the new data

    # sort by the date
    y = x.sort('date')

    # calc the diff in days (an integer)
    y['diff'] = (y['date']-y['date'].iloc[0]).apply(lambda d: float(d.item().days))
    y['count'] = pd.Series(range(1,len(y)+1),index=y.index,dtype='float64')  
    
    return y

# reduce the store (and create a new one by chunks)
in_store_file = data_store_file
in_store1 = pd.HDFStore(in_store_file)

# iter on the store 1
for chunki, df1 in enumerate(in_store1.select('data',chunksize=2*chunksize)):
    print "processing [%s] [%s]" % (chunki,in_store_file)

    out_store_file = 'datastore_%s.h5' % chunki
    out_store = pd.HDFStore(out_store_file,'w')

    # iter on store 2
    in_store2 = pd.HDFStore(in_store_file)
    for df2 in in_store2.select('data',chunksize=chunksize):

        # concat & drop dups
        df = pd.concat([df1,df2]).drop_duplicates(['email','date'])

        # group and combine
        result = df.groupby('email').apply(combiner)
            
        # remove the mi (that we created in the groupby)
        result = result.reset_index('email',drop=True)
            
        # only store those rows which are in df2!
        result = result.reindex(index=df2.index).dropna()

        # store to the out_store
        out_store.append('data',result,min_itemsize={'email' : 100})
    in_store2.close()
    out_store.close()
    in_store_file = out_store_file

in_store1.close()

# show the reduced store
print pd.read_hdf(out_store_file,'data').sort(['email','diff'])

Question 3

Use the built-in sqlite3 database: you can insert the data, sort and group as necessary, and there's no problem using a file which is larger than available RAM.