Question

I am using python and trying to fetch a particular part of the url as below

from urlparse import urlparse as ue

url = "https://www.google.co.in"
img_url = ue(url).hostname

Result

www.google.co.in

case1:

Actually i will have a number of urls(stored in a list or some where else), so what i want is, need to find the domain name as above in the url and fetch the part after www. and before .co.in, that is the string starts after first dot and before second dot which results only google in the present scenario.

So suppose the url given is url given is www.gmail.com, i should fetch only gmail in that, so what ever the url given, the code should fetch the part thats starts with first dot and before second dot.

case2:

Also some urls may be given directly like this domain.com, stackoverflow.com without www in the url, in that cases it should fetch only stackoverflow and domain.

Finally my intention is to fetch the main name from the url that gmail, stackoverflow, google like so.....

Generally if i have one url i can use list slicing and will fetch the string, but i will have a number of ulrs, so need to fetch the wanted part like mentioned above dynamically

Can anyone please let me know how to satisfy the above concept ?

Was it helpful?

Solution 3

Here is my solution, at the end, domains holds a list of domains you expected.

import urlparse
urls = [
    'https://www.google.com', 
    'http://stackoverflow.com',
    'http://www.google.co.in',
    'http://domain.com',
    ]
hostnames = [urlparse.urlparse(url).hostname for url in urls]
hostparts = [hostname.split('.') for hostname in hostnames]
domains = [p[0] == 'www' and p[1] or p[0] for p in hostparts]
print domains # ==> ['google', 'stackoverflow', 'google', 'domain']

Discussion

  1. First, we extract the host names from the list of URLs using urlparse.urlparse(). The hostnames list looks like this:

    [ 'www.google.com', 'stackoverflow.com, ... ]

  2. In the next line, we break each host into parts, using the dot as the separator. Each item in the hostparts looks like this:

    [ ['www', 'google', 'com'], ['stackoverflow', 'com'], ... ]

  3. The interesting work is in the next line. This line says, "if the first part before the dot is www, then the domain is the second part (p[1]). Otherwise, the domain is the first part (p[0]). The domains list looks like this:

    [ 'google', 'stackoverflow', 'google', 'domain' ]

  4. My code does not know how to handle login.gmail.com.hk. I hope someone else can solve this problem as I am late for bed. Update: Take a look at the tldextract by John Kurkowski, which should do what you want.

OTHER TIPS

Why can't you just do this:

from urlparse import urlparse as ue
urls = ['https://www.google.com', 'http://stackoverflow.com']
parsed = []
for url in urls:
    decoded = ue(url).hostname
    if decoded.startswith('www.'):
        decoded = ".".join(decoded.split('.')[1:])
    parsed.append(decoded.split('.')[0])
#parsed is now your parsed list of hostnames

Also, you might want to change the if statement in the for loop, because some domains might start with other things that you would want to get rid of.

What about using a set of predefined toplevel doamains?

import re
from urlparse import urlparse

#Fake top level domains... EG: co.uk, co.in, co.cc
TOPLEVEL = [".co.[a-zA-Z]+", ".fake.[a-zA-Z]+"]

def TLD(rgx, host, max=4): #4 = co.name
        match = re.findall("(%s)" % rgx, host, re.IGNORECASE)
        if match: 
            if len(match[0].split(".")[1])<=max:
                return match[0]
        else:
            return False

parsed = []
urls = ["http://www.mywebsite.xxx.asd.com", "http://www.dd.test.fake.uk/asd"]
for url in urls:
    o = urlparse(url)
    h = o.hostname
    for j in range(len(TOPLEVEL)):
        TL = TLD(TOPLEVEL[j], h)
        if TL: 
            name = h.replace(TL, "").split(".")[-1]
            parsed.append(name)
            break
        elif(j+1==len(TOPLEVEL)): 
            parsed.append(h.split(".")[-2])
            break

print parsed

It's a bit hacky, and maybe cryptic for some, but it does the trick, and nothing more has to be done :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top