Question

Extracting an accurate representation of the top-level domain of a hostname is complicated by the fact that each top-level domain registry is free to make up its own policies regarding how domains are issued and what subdomains are defined. As there doesn't appear to be any standards body coordinating these or establishing standards, this has made determining the actual TLD a somewhat complicated affair.

Since web browsers assign cookies only to registered domains, and for security reasons must be vigilant about ensuring cookies cannot be assigned on a broader level, these browsers typically contain a database of all known TLDs in some form. I've found that Firefox has a fairly complete database:

http://hg.mozilla.org/mozilla-central/raw-file/3f91606bd115/netwerk/dns/effective_tld_names.dat

I have two specific questions:

  • Although it is fairly trivial to convert this listing into a regular expression, is there a gem or reference regexp that's a better solution than rolling your own? The tld gem only provides country-level info for the root-level domain.

  • Is there a better reference than the Firefox TLD listing? All of the local Google sites are correctly parsed by this specification, but that's hardly an exhaustive test.

If there's nothing out there, is anyone interested in a gem that performs this kind of operation? This sort of thing should be present in the URI module but is apparently missing.

Here's my take on converting this file into a usable Regexp in Ruby:

TLD_SPEC = Regexp.new(
  '[^\.]+\.(' + %q[
// ***** BEGIN LICENSE BLOCK *****
// ... (Rest of file)
  ].split(/\n/).collect do |line|
    line.sub(%r[//.*], '').sub(/\s+$/, '')
  end.reject(&:blank?).collect do |s|
    Regexp.escape(s).sub(/^\\\*\\\./, '[^\.]+\.')
  end.join('|') + ')$'
)
Was it helpful?

Solution

You might want to look into using Addressable to see if that has what you need. It's got a lot more features than Ruby's default URI library. In particular, its template ability might help you.

From the docs:

Addressable is a replacement for the URI implementation that is part of Ruby's standard library. It more closely conforms to the relevant RFCs and adds support for IRIs and URI templates. Additionally, it provides extensive support for URI templates.

With the recent opening of the new TLDs, it's going to be a nightmare for a while. Check out the related list to the right to see how many people are trying to find a solution. Regex to match Domain.CCTLD recommends using a function to break it down into smaller steps and is what I'd do. Trying to do this with a regex assumes you can do it all in one expression, which starts to smell like using regex to parse XML or HTML. The target is too wiggly for a single pattern, or at least for a single maintainable pattern.

That answer mentions the public TLD list. Using the information there you could quickly use Ruby's Regexp.escape and Regexp.union methods to build a reasonably good regex on the fly. It'd be nice if we had Perl's Regexp::Assemble module available to us, but we don't so union will have to do. (See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for a way to work around this.)

OTHER TIPS

There is another flat-file db here at http://guava-libraries.googlecode.com/svn-history/r42/trunk/src/com/google/common/net/TldPatterns.java

Perhaps you could combine the 2, and upload it to somewhere like OData.org, github, sourceforge, etc.

There's a gem called public-suffix-list which provides access to a more formalized version of the Mozilla listing.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top