Fetch a particular part of the url in python

Question 1

Here is my solution, at the end, domains holds a list of domains you expected.

import urlparse
urls = [
    'https://www.google.com', 
    'http://stackoverflow.com',
    'http://www.google.co.in',
    'http://domain.com',
    ]
hostnames = [urlparse.urlparse(url).hostname for url in urls]
hostparts = [hostname.split('.') for hostname in hostnames]
domains = [p[0] == 'www' and p[1] or p[0] for p in hostparts]
print domains # ==> ['google', 'stackoverflow', 'google', 'domain']

Discussion

First, we extract the host names from the list of URLs using urlparse.urlparse(). The hostnames list looks like this:

[ 'www.google.com', 'stackoverflow.com, ... ]
In the next line, we break each host into parts, using the dot as the separator. Each item in the hostparts looks like this:

[ ['www', 'google', 'com'], ['stackoverflow', 'com'], ... ]
The interesting work is in the next line. This line says, "if the first part before the dot is www, then the domain is the second part (p[1]). Otherwise, the domain is the first part (p[0]). The domains list looks like this:

[ 'google', 'stackoverflow', 'google', 'domain' ]
My code does not know how to handle login.gmail.com.hk. I hope someone else can solve this problem as I am late for bed. Update: Take a look at the tldextract by John Kurkowski, which should do what you want.

Question 2

Why can't you just do this:

from urlparse import urlparse as ue
urls = ['https://www.google.com', 'http://stackoverflow.com']
parsed = []
for url in urls:
    decoded = ue(url).hostname
    if decoded.startswith('www.'):
        decoded = ".".join(decoded.split('.')[1:])
    parsed.append(decoded.split('.')[0])
#parsed is now your parsed list of hostnames

Also, you might want to change the if statement in the for loop, because some domains might start with other things that you would want to get rid of.

Question 3

What about using a set of predefined toplevel doamains?

import re
from urlparse import urlparse

#Fake top level domains... EG: co.uk, co.in, co.cc
TOPLEVEL = [".co.[a-zA-Z]+", ".fake.[a-zA-Z]+"]

def TLD(rgx, host, max=4): #4 = co.name
        match = re.findall("(%s)" % rgx, host, re.IGNORECASE)
        if match: 
            if len(match[0].split(".")[1])<=max:
                return match[0]
        else:
            return False

parsed = []
urls = ["http://www.mywebsite.xxx.asd.com", "http://www.dd.test.fake.uk/asd"]
for url in urls:
    o = urlparse(url)
    h = o.hostname
    for j in range(len(TOPLEVEL)):
        TL = TLD(TOPLEVEL[j], h)
        if TL: 
            name = h.replace(TL, "").split(".")[-1]
            parsed.append(name)
            break
        elif(j+1==len(TOPLEVEL)): 
            parsed.append(h.split(".")[-2])
            break

print parsed

It's a bit hacky, and maybe cryptic for some, but it does the trick, and nothing more has to be done :)