Question

This is an example of a 'valid' line in my log file: 194.81.31.125 - - [129/Dec/2013:22:03:09 -0500] "GET http://www.firemaiden.hu/cgi-bin/top/topsites.cgi?an12 HTTP/1.0" 200 558 "http://Afrique" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"

I've got this re.findall expression: (GET|POST)\s(http://|https//)[a-zA-Z]+.+?"\s200 This expression contains all the rules for a valid line, but doesn't extract the domain.

I want to count the top-level domains, in this case "hu", for each date and dump the count for each domain into an organized log file. I also want to extract the non-valid lines into a different log file

output ideally is:

12/Dec/2013[tab]as:1[tab]ab:2[tab]hu:4

13/Dec/2013[tab]as:4[tab]br:7[tab]cd:8

Was it helpful?

Solution

Of course it doesn't extract the domain; you didn't put that in a capturing group by wrapping it in parentheses.

So, the first thing to do is to add the parentheses:

r'(GET|POST)\s(http://|https//)([a-zA-Z]+.+?)"\s200'

But that's still not right, as it will capture the entire www.firemaiden.hu/cgi-bin/top/topsites.cgi?an12 HTTP/1.0, not just the www.firemaiden.hu. That's because you only have one group of letters followed by anything at all up to a quote. You want just letters and dots (which isn't actually correct for DNS, but let's ignore that for the moment). Like this:

r'(GET|POST)\s(http://|https//)([a-zA-Z\.]+).+?"\s200'

And now you get www.firemaiden.hu.

But you wanted just the .hu, right? So, what you really need as many letters and dots as possible up to a group of just letters after a dot:

r'(GET|POST)\s(http://|https//)[a-zA-Z\.]+\.([a-zA-Z]+).+?"\s200'

However, you will want to read the rules on DNS names—which are actually up to each root server, in theory. But anything under the standard world roots follows the LDH rule: letters, digits, hyphens. So, you want [a-zA-Z0-9-\.], right?

But many servers will also accept underscores and treat them as hyphens, and some servers will decode IDNA (punycode) names to Unicode for logging, so even that may not be right.

All that being said, I think that, rather than use a regexp you didn't know how to write and may not understand, you go with a simpler regexp to get just the URL (which you already know how to do), and then use a dedicated URL parser to crack it:

r'(GET|POST)\s(\S+)\s.*?200'

Then:

p = urllib.parse.urlparse(match[1])

Now p.scheme is your 'http' or 'https', p.netloc is 'www.firemaiden.hu' (which you can easily call .split('.')[-1] on), etc.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top