Question

I have the following regular expression(updated): ([0-9]{2}/[a-zA-Z]{3}/[0-9]{4})(.+)(GET|POST)\s(http://|https://)([a-zA-Z-.][a-zA-Z0-9+\.[a-zA-Z0-9-.]+)(\.)([a-zA-Z0-9]+)([\.:/\s]).+?"\s200

I also have the following lines excerpted from a long server log(updated):

218.5.192.147 - - [14/Mar/2004:02:31:06 -0500] "GET http://searchanytime.com" 200 - "-" "-"
202.101.150.100 - - [12/Mar/2004:21:18:55 -0500] "GET http://nationalwholesalellc.com" 200 114887 "-" "-"

It works as planned for these lines:

220.173.17.142 - - [09/Mar/2004:23:32:13 -0500] "POST http://www.canada44.ca/ HTTP/1.1" 200 27095 "http://www.so123.com" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"
212.160.136.163 - - [10/Mar/2004:01:01:46 -0500] "GET http://www.6seconds.org/ HTTP/1.0" 200 51937 "http://www.helavasearch.com/cgi-bin/search.cgi?username=amundii&keywords=parenting" "Mozilla/4.0 (compatible; MSIE 4.0; Windows 98)"
218.72.85.59 - - [10/Mar/2004:01:05:13 -0500] "GET http://hpcgi1.nifty.com/trino/ProxyJ/prxjdg.cgi HTTP/1.1" 200 2221 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"

In almost every situation, group 7 is my top level domain (com, cn, org, etc.). However for the lines that have .com" instead of .com HTTP/1.1" it doesn't work and returns group 7 as "searchanytime" instead of "com"

I've been using www.regexr.com

Était-ce utile?

La solution

The Regex

From what I can see of what you are trying to do, I came up with the following:

(\d{2}/\w{3}/\d{4})(.+)(GET|POST)\s(http://|https://)(\w+)?\.?([\w\d]+)\.(\w+).*?200

The Breakdown

I'll breakdown the regex so that if it's not 100% what you're looking for hopefully it will put you on your way

group1

(\d{2}/\w{3}/\d{4})

captures the date on the log entry, format is DD/MMM/YYYY

group2

(.+)

capture the filler inbetween this and the next group. from your first example, this will match :02:31:06 -0500] " Note: that if POST or GET

group3

(GET|POST)

pretty self-explanitory

filler

\s

matching a single white-space character that we don't care about

group4

(http://|https://)

also pretty straight forward

group5

this is where your regex broke down I think.

(\w+)?\.?

This will match the www or hpcgi1 portion of the log entry. Note the ? character making this group optional. This is for cases such as

[14/Mar/2004:02:31:06 -0500] "GET http://searchanytime.com" 200 - "-" "-"

group6

([\w\d]+)

The middle portion (i.e. canada44, nifty) or the first portion (i.e. searchanytime)

group7

([\w\d]+)

The end portion (i.e. com, org)

filler

.*?

Any character (as few as possible) between the 'com', 'org', etc. and the 200. If you want to reference any of this you should capture it.

the end

200

match a 200. Note, because of the ? in the filler above, this will be the first 200 the match encounters after group7

Disclaimer

I have not actually tested this regex on your log messages outside of an online regex tool. I am not sure of the grouping you want/need, but hopefully this helps a little.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top