Question

I'm trying to phrase my access log file, to do this I simply read the access log file line by line and extract usefull info from each line, finally add them to a database.

For example, a line would look like this.

124.99.152.202 - naveen [22/Nov/2013:10:41:17 +1300] "GET /p/V4ZkA5d074CTy_vbFa7nLw,1385070078/IneedThisInteger/12.txt HTTP/1.1" "200" "3" "-" "Mozilla/5.0" "-"

I only know how to extract IP address.( using this)

I want to extract

  1. this request value- GET /p/V4ZkA5d074CTy_vbFa7nLw,1385070078,IneedThisInteger/12.txt HTTP/1.1

  2. this integer value from above part- IneedThisInteger

  3. this status part- 200

  4. this bytes part- 3

sometimes request URL changes it's last portion,

"GET /p/V4ZkA5d074CTy_vbFa7nLw,1385070078,IneedThisInteger/FOLDER/12.txt HTTP/1.1"
"GET /p/V4ZkA5d074CTy_vbFa7nLw,1385070078,IneedThisInteger/FOLDER/ANOTHER FOLDER/12.txt HTTP/1.1"
"GET /p/V4ZkA5d074CTy_vbFa7nLw,1385070078,IneedThisInteger/FOLDER/ANOTHER FOLDER/HEREIS-ANOTHER-FOLDER-AND-SO-ON/12.txt HTTP/1.1"

So I really need a stable way to get those values from each line.How do I do this?

Was it helpful?

Solution

This should do the trick:

^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*?"(.*?/p/.*?,\d+,(\d+).*?)" "(\d+)" "(\d+)".*$

Here's a fiddle to demonstrate: http://www.rexfiddle.net/3sDwWut

I replaced your "I NEED THIS INTEGER" with an actual number for testing purposes, and also randomized the "bytes" and IP addresses a little bit. These are the captures, in order:

  1. The IP
  2. The request (e.g. GET xxx HTTP/1.1)
  3. The integer from the URL you wanted
  4. The HTTP status
  5. The byte count

OTHER TIPS

Assuming you are always having GET requests, this should do the trick

"(GET /.*?/.*?,\d+,(\d+)/.*?)"\s"(\d+)"\s"(\d+)"

See regex101.com for explanation of the expression.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top