Of course it doesn't extract the domain; you didn't put that in a capturing group by wrapping it in parentheses.
So, the first thing to do is to add the parentheses:
r'(GET|POST)\s(http://|https//)([a-zA-Z]+.+?)"\s200'
But that's still not right, as it will capture the entire www.firemaiden.hu/cgi-bin/top/topsites.cgi?an12 HTTP/1.0
, not just the www.firemaiden.hu
. That's because you only have one group of letters followed by anything at all up to a quote. You want just letters and dots (which isn't actually correct for DNS, but let's ignore that for the moment). Like this:
r'(GET|POST)\s(http://|https//)([a-zA-Z\.]+).+?"\s200'
And now you get www.firemaiden.hu
.
But you wanted just the .hu
, right? So, what you really need as many letters and dots as possible up to a group of just letters after a dot:
r'(GET|POST)\s(http://|https//)[a-zA-Z\.]+\.([a-zA-Z]+).+?"\s200'
However, you will want to read the rules on DNS names—which are actually up to each root server, in theory. But anything under the standard world roots follows the LDH rule: letters, digits, hyphens. So, you want [a-zA-Z0-9-\.]
, right?
But many servers will also accept underscores and treat them as hyphens, and some servers will decode IDNA (punycode) names to Unicode for logging, so even that may not be right.
All that being said, I think that, rather than use a regexp you didn't know how to write and may not understand, you go with a simpler regexp to get just the URL (which you already know how to do), and then use a dedicated URL parser to crack it:
r'(GET|POST)\s(\S+)\s.*?200'
Then:
p = urllib.parse.urlparse(match[1])
Now p.scheme
is your 'http'
or 'https'
, p.netloc
is 'www.firemaiden.hu'
(which you can easily call .split('.')[-1]
on), etc.