Question

I try using python create filter for log file like

 Thu Oct  4 23:14:40 2012 [pid 16901] CONNECT: Client "66.249.74.228"
 Thu Oct  4 23:14:40 2012 [pid 16900] [ftp] OK LOGIN: Client "66.249.74.228", anon     password "googlebot@google.com"
 Thu Oct  4 23:17:42 2012 [pid 16902] [ftp] FAIL DOWNLOAD: Client "66.249.74.228",   "/pub/10.5524/100001_101000/100039/Assembly-2011/Pa9a_assembly_config4.scafSeq.gz",  14811136 bytes, 79.99Kbyte/sec
 Fri Oct  5 00:04:13 2012 [pid 25809] CONNECT: Client "66.249.74.228"
 Fri Oct  5 00:04:14 2012 [pid 25808] [ftp] OK LOGIN: Client "66.249.74.228", anon password "googlebot@google.com"
 Fri Oct  5 00:07:16 2012 [pid 25810] [ftp] FAIL DOWNLOAD: Client "66.249.74.228", "/pub/10.5524/100001_101000/100027/Raw_data/PHOlcpDABDWABPE/090715_I80_FC427DJAAXX_L8_PHOlcpDABDWABPE_1.fq.gz", 14811136 bytes, 79.99Kbyte/sec
 Fri Oct  5 00:13:19 2012 [pid 27354] CONNECT: Client "1.202.186.53"
 Fri Oct  5 00:13:19 2012 [pid 27353] [ftp] OK LOGIN: Client "1.202.186.53", anon password "mozilla@example.com"
 Fri Oct  5 00:13:33 2012 [pid 27355] [ftp] FAIL DOWNLOAD: Client "1.202.186.53", "/pub", 0.00Kbyte/sec
 Fri Oct  5 00:26:04 2012 [pid 341] [ftp] OK DOWNLOAD: Client "210.72.156.68", "/pub/10.5524/100001_101000/100030/RNA-Seq/Mgo_2.fq.gz", 1985229528 bytes, 85.87Kbyte/sec
 Fri Oct  5 00:55:45 2012 [pid 2766] CONNECT: Client "157.82.250.217"
 Fri Oct  5 00:55:45 2012 [pid 2765] [ftp] OK LOGIN: Client "157.82.250.217", anon password "mozilla@example.com"
 Fri Oct  5 00:56:05 2012 [pid 2767] [ftp] FAIL DOWNLOAD: Client "157.82.250.217", "/pub/10.5524/100001_101000/100036/Gene_catalogue/Gene_catalogue.pep", 1638400 bytes, 81.81Kbyte/sec
 Fri Oct  5 00:57:27 2012 [pid 3056] CONNECT: Client "157.82.250.217"
 Fri Oct  5 00:57:27 2012 [pid 3055] [ftp] OK LOGIN: Client "157.82.250.217", anon password "-wget@"

The log file has some robot access records, so how to achieve the real people access records by using python filter. I have already build an filter to get weekly records, so can you help me add it inside.

import time
f= open("/opt/CLiMB/Storage1/log/vsftp.log")
def OnlyRecent(line):
    if  time.strptime(line.split("[")[0].strip(),"%a %b %d %H:%M:%S %Y")>  time.gmtime(time.time()-(60*60*24*7)): 
        return True
    return False
filename= time.strftime('%Y%m%d')+'.log'
f1= open(filename,'w')
for line in f:
    if OnlyRecent(line):
            print line
            f1.write(line)
f.close()
f1.close()
Was it helpful?

Solution

If you are determining that client which uses your system is, in fact, robot by looking into his password (googlebot@google.com looks like an actual robot), then you can just split a string and look if the second part contains robot e-mail in it:

# Add additional robot e-mails here
robot_emails = ["googlebot@google.com"]

def isRobotRecord(line):

    for email in robot_emails:
        if email in line.split("Client")[1]:
            return True
        else:
            return False

OTHER TIPS

You can group events by some identifier. I thought about pid, but it seem all lines in your log have different pids. You can use IP address for every group and start new group when you find CONNECT: Client "[IP]", but this will fail if from some IP address clients have many sessions at one time. Without session identifier it is hard to decide which lines use as one session (group).

When you group events, then for every group you must check if in this event is "sign" left by bot like: "anon password "googlebot@google.com"

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top