Question

I have url data in a table. I would like to create a view that shows the second level (sld) and top level domain (tld) as well as the subdomain. How can I extract this in ANSI SQL? The database I am using supports only ansi sql and doesn't have cool functions such as reverse.

Here is the data:

  TLD =  -- The top-level domain (.com, .org, .info, .us)
  SLD =  -- The second-level domain (twitter, yahoo, facebook, google) second part of URL
  SUBDOMAIN = -- The subdomain domain (www, search.google, search.espn) first part of URL // tricky

Here is the logic I am using. But I am unable to get the subdomain properly. I would like to reverse and get the remainder after extracting TLD, and SLD, but Vertica doesnt support reverse function.

Here is the query and sample data (note: SPLIT_PART splits the string at the character specified):

select COALESCE(SPLIT_PART(URL, '.', 3), SPLIT_PART(URL, '.', 2))  as tld, 
             SPLIT_PART(URL, '.', 2) as sld, 
SPLIT_PART(URL, '.', 1) as subdomain from URL_table

The table has 2 columns, date and URL Here are the example URLS:

search.mywebsearch.com   (TLD = com, SLD = mywebsearch, subdomain = search)
search.earthlink.net     
topix.com
main.welcomescreen.intrepid.com
ad.yieldmanager.com
google.com
news.google.com
Was it helpful?

Solution

This is a really hard thing to do right, especially if your data is noisy, as is the case with big data.

Can you ever get http:// as a prefix? What about sites like www.sub.dom.com? Is everything after the .TLD scrubbed out already?

For these reasons, we were wary about trying to implement splitting in SQL. Instead, we used Vertica's UDTF feature and wrote a splitter in C++. I think we'd rather not do it, but we just don't trust SQL to be robust enough.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top