Question

In the need for finding e-mail addresses and host names, we would like to improve an existing regex to seach only for existing public TLD's.

We would like one bash command where we can copy and paste its output into our regular expression.

We already had trial with (co|com) which matches only 'co' and doesn't match the full 'com' tld for .com domains, that is why the tld list need to be sorted with longest TLD's first.

Can someone supply a one line "copy and paste" bash command that outputs the most recent list of TLD's sorted and formatted?

Was it helpful?

Solution

With the help of @Alex_Volkov his answer in Regular expression to match DNS hostname or IP Address? we were pointed to http://data.iana.org/TLD/tlds-alpha-by-domain.txt is source for TLD's.

With the help of @thiton his answer in Sorting lines from longest to shortest the output could be sorted so that the longest TLD's are listed first.

Resulting in this one liner:

$ curl -s http://data.iana.org/TLD/tlds-alpha-by-domain.txt | sed '1d; s/^ *//; s/ *$//; /^$/d' | awk '{print length" "$0}' | sort -rn | cut -d' ' -f2- | tr '\n' '|' | tr '[:upper:]' '[:lower:]' | sed 's/\(.*\)./\1/'

that nicely outputs the desired TLD regex part:

xn--clchc0ea0b2g2a9gcd|xn--hlcj6aya9esc7a|xn--hgbk6aj7f53bba|xn--xkc2dl3a5ee0h|xn--mgberp4a5d4ar|xn--11b5bs3a9aj6g|xn--xkc2al3hye2a|xn--80akhbyknj4f|xn--mgbc0a9azcg|xn--lgbbat1ad8j|xn--mgbx4cd0ab|xn--mgbbh1a71e|xn--mgbayh7gpa|xn--mgbaam7a8h|xn--9t4b11yi5a|xn--ygbi2ammx|xn--yfro4i67o|xn--fzc2c9e2c|xn--fpcrj9c3d|xn--ogbpf8fl|xn--mgb9awbf|xn--kgbechtv|xn--jxalpdlp|xn--3e0b707e|xn--s9brj9c|xn--pgbs0dh|xn--kpry57d|xn--kprw13d|xn--j6w193g|xn--h2brj9c|xn--gecrj9c|xn--g6w251d|xn--deba0ad|xn--80ao21a|xn--45brj9c|xn--0zwm56d|xn--zckzah|xn--wgbl6a|xn--wgbh1c|xn--o3cw4h|xn--fiqz9s|xn--fiqs8s|xn--90a3ac|xn--p1ai|travel|museum|post|name|mobi|jobs|info|coop|asia|arpa|aero|xxx|tel|pro|org|net|mil|int|gov|edu|com|cat|biz|zw|zm|za|yt|ye|ws|wf|vu|vn|vi|vg|ve|vc|va|uz|uy|us|uk|ug|ua|tz|tw|tv|tt|tr|tp|to|tn|tm|tl|tk|tj|th|tg|tf|td|tc|sz|sy|sx|sv|su|st|sr|so|sn|sm|sl|sk|sj|si|sh|sg|se|sd|sc|sb|sa|rw|ru|rs|ro|re|qa|py|pw|pt|ps|pr|pn|pm|pl|pk|ph|pg|pf|pe|pa|om|nz|nu|nr|np|no|nl|ni|ng|nf|ne|nc|na|mz|my|mx|mw|mv|mu|mt|ms|mr|mq|mp|mo|mn|mm|ml|mk|mh|mg|me|md|mc|ma|ly|lv|lu|lt|ls|lr|lk|li|lc|lb|la|kz|ky|kw|kr|kp|kn|km|ki|kh|kg|ke|jp|jo|jm|je|it|is|ir|iq|io|in|im|il|ie|id|hu|ht|hr|hn|hm|hk|gy|gw|gu|gt|gs|gr|gq|gp|gn|gm|gl|gi|gh|gg|gf|ge|gd|gb|ga|fr|fo|fm|fk|fj|fi|eu|et|es|er|eg|ee|ec|dz|do|dm|dk|dj|de|cz|cy|cx|cw|cv|cu|cr|co|cn|cm|cl|ck|ci|ch|cg|cf|cd|cc|ca|bz|by|bw|bv|bt|bs|br|bo|bn|bm|bj|bi|bh|bg|bf|be|bd|bb|ba|az|ax|aw|au|at|as|ar|aq|ao|an|am|al|ai|ag|af|ae|ad|ac

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top