Question

which is the best way to parse a log file like this using R?

- - - [20/Nov/2011:01:16:29 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 279
- - - [20/Nov/2011:01:16:29 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:16:29 +0100] "GET /IDEE-ServicesSearch/ServicesSearch.html?locale=es HTTP/1.1" 200 1665
- - - [20/Nov/2011:01:16:29 +0100] "GET /search/indexLayout.jsp?PAGELANGUAGE=es HTTP/1.1" 200 9874
- - - [20/Nov/2011:01:16:29 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.1" 200 12058
- - - [20/Nov/2011:01:16:30 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 258038
- - - [20/Nov/2011:01:17:09 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:17:33 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:17:33 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "POST /csw/?locale=es HTTP/1.0" 200 2536
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.0" 200 11769
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.0" 200 12058
- - - [20/Nov/2011:01:17:39 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867
- - - [20/Nov/2011:01:17:46 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867
- - - [20/Nov/2011:01:18:10 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647
- - - [20/Nov/2011:01:19:01 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769

I must consider border cases like having 2 IP's in a single line (internal and external).

Thanks!

Was it helpful?

Solution

For this example it suffices to replace the leading dashes with two NA's and the commas with spaces. You can then parse with read.table()

datlog <- readLines(textConnection('- - - [20/Nov/2011:01:16:29 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 279
- - - [20/Nov/2011:01:16:29 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:16:29 +0100] "GET /IDEE-ServicesSearch/ServicesSearch.html?locale=es HTTP/1.1" 200 1665
- - - [20/Nov/2011:01:16:29 +0100] "GET /search/indexLayout.jsp?PAGELANGUAGE=es HTTP/1.1" 200 9874
- - - [20/Nov/2011:01:16:29 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.1" 200 12058
- - - [20/Nov/2011:01:16:30 +0100] "POST /csw/servlet/cswservlet HTTP/1.1" 200 258038
- - - [20/Nov/2011:01:17:09 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:17:33 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769
- - - [20/Nov/2011:01:17:33 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "POST /csw/?locale=es HTTP/1.0" 200 2536
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /DescargaFenomenos/index.jsp HTTP/1.0" 200 11769
192.168.69.10, 62.97.81.202 - - [20/Nov/2011:01:17:34 +0100] "GET /clientesIGN/wmsGenericClient/index.html?lang=ES HTTP/1.0" 200 12058
- - - [20/Nov/2011:01:17:39 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867
- - - [20/Nov/2011:01:17:46 +0100] "GET //csw/servlet/cswservlet?request=GetCapabilities&service=CSW&version=2.0.2 HTTP/1.1" 200 8867
- - - [20/Nov/2011:01:18:10 +0100] "GET //show.do?to=pideep_pidee.ES HTTP/1.1" 200 26647
- - - [20/Nov/2011:01:19:01 +0100] "GET //DescargaFenomenos/index.jsp HTTP/1.1" 200 11769'))
 datlog <- gsub("^-", "NA NA", datlog)
 datlog <- sub("\\,", "   ", datlog)
 datlog<-read.table(text=datlog, fill=TRUE)
 datlog

Spacedman asked about datetime parsing:

datlog[['dtime']] <- as.POSIXct( paste( sub("\\[", "", datlog[[5]]), 
                                         sub("\\]", "", datlog[[6]]) ),
                                 format="%d/%b/%Y:%H:%M:%S %z")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top