Question

I have a list of million urls. I need to extract the TLD for each url and create multiple files for each TLD. For example collect all urls with .com as tld and dump that in 1 file, another file for .edu tld and so on. Further within each file, I have to sort it alphabetically by domains and then by subdomains etc.

Can anyone give me a head start for implementing this in perl?

Was it helpful?

Solution

  1. Use URI to parse the URL,
  2. Use its host method to get the host,
  3. Use Domain::PublicSuffix's get_root_domain to parse the host name.
  4. Use the tld or suffix method to get the real TLD or the pseudo TLD.

use feature qw( say );

use Domain::PublicSuffix qw( );
use URI                  qw( );

my $dps = Domain::PublicSuffix->new();

for (qw(
   http://www.google.com/
   http://www.google.co.uk/
)) {
   my $url = $_;

   # Treat relative URLs as absolute URLs with missing http://.
   $url = "http://$url" if $url !~ /^\w+:/;

   my $host = URI->new($url)->host();
   $host =~ s/\.\z//;  # D::PS doesn't handle "domain.com.".

   $dps->get_root_domain($host)
      or die $dps->error();

   say $dps->tld();     # com  uk
   say $dps->suffix();  # com  co.uk
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top