Question

Need a function like:

function isGoogleURL(url) { ... }

that returns true iff URL belongs to Google. No false positives; no false negatives.

Luckily there's this as a reference:

.google.com .google.ad .google.ae .google.com.af .google.com.ag .google.com.ai .google.am .google.it.ao .google.com.ar .google.as .google.at .google.com.au .google.az .google.ba .google.com.bd .google.be .google.bg .google.com.bh .google.bi .google.com.bn .google.com.bo .google.com.br .google.bs .google.co.bw .google.com.by .google.com.bz .google.ca .google.cd .google.cg .google.ch .google.ci .google.co.ck .google.cl .google.cn .google.com.co .google.co.cr .google.com.cu .google.cz .google.de .google.dj .google.dk .google.dm .google.com.do .google.dz .google.com.ec .google.ee .google.com.eg .google.es .google.com.et .google.fi .google.com.fj .google.fm .google.fr .google.ge .google.gg .google.com.gh .google.com.gi .google.gl .google.gm .google.gp .google.gr .google.com.gt .google.gy .google.com.hk .google.hn .google.hr .google.ht .google.hu .google.co.id .google.ie .google.co.il .google.im .google.co.in .google.is .google.it .google.je .google.com.jm .google.jo .google.co.jp .google.co.ke .google.com.kh .google.ki .google.kg .google.co.kr .google.kz .google.la .google.li .google.lk .google.co.ls .google.lt .google.lu .google.lv .google.com.ly .google.co.ma .google.md .google.mn .google.ms .google.com.mt .google.mu .google.mv .google.mw .google.com.mx .google.com.my .google.co.mz .google.com.na .google.com.nf .google.com.ng .google.com.ni .google.nl .google.no .google.com.np .google.nr .google.nu .google.co.nz .google.com.om .google.com.pa .google.com.pe .google.com.ph .google.com.pk .google.pl .google.pn .google.com.pr .google.pt .google.com.py .google.com.qa .google.ro .google.ru .google.rw .google.com.sa .google.com.sb .google.sc .google.se .google.com.sg .google.sh .google.si .google.sk .google.sn .google.sm .google.st .google.com.sv .google.co.th .google.com.tj .google.tk .google.tl .google.tm .google.to .google.com.tr .google.tt .google.com.tw .google.co.tz .google.com.ua .google.co.ug .google.co.uk .google.com.uy .google.co.uz .google.com.vc .google.co.ve .google.vg .google.co.vi .google.com.vn .google.vu .google.ws .google.rs .google.co.za .google.co.zm .google.co.zw .google.cat

Any ideas how to do this elegantly?

Some Clarifications:

  • I need this for a greasemonkey script I wrote that currently only works for google.com (and should work for all other TLDs as well). Here is the script (it modifies Google Reader to work on wide screens better).
  • It should work on URLs that belong to the above domains (not blogger.com, etc.).
Was it helpful?

Solution

Here is an updated version of Prestaul's answer which solves the two problems I mentioned in the comment there.

var GOOGLE_DOMAINS = ([
    '.google.com',
    '.google.ad',
    '.google.ae',
    '.google.com.af',
    '.google.com.ag',
    '.google.com.ai',
    '.google.am',
    '.google.it.ao',
    '.google.com.ar',
    '.google.as',
    '.google.at',
    '.google.com.au',
    '.google.az',
    '.google.ba',
    '.google.com.bd'
]).join('\n');

function isGoogleUrl(url) {
    // get the 2nd level domain from the url
    var domain = /^https?:\/\/[^\///]*(google\.[^\/\\]+)\//i.exec(url);
    if(!domain) return false;

    domain = '.'+domain[1];
    // create a regex to check to see if the domain is supported
    var re = new RegExp('^' + domain.replace(/\./g, '\\.') + '$', 'mi');
    return re.test(GOOGLE_DOMAINS);
}

alert(isGoogleUrl('http://www.google.ba/the/page.html')); // true
alert(isGoogleUrl('http://some_mal_site.com/http://www.google.ba/')); // false
alert(isGoogleUrl('https://google.com.au/')); // true
alert(isGoogleUrl('http://www.google.com.some_mal_site.com/')); // false
alert(isGoogleUrl('http://yahoo.com/')); // false

OTHER TIPS

All the domains end in either "google.xx", "google.co.xx", or "google.com.xx" except "google.it.ao" and "google.com", so if you just look at the domain, this regular expression should work for most cases (it's not perfect, but it accepts all the listed domains, and rejects most other valid domains that happen to include "google"):

/^(\w+\.)*google\.((com\.|co\.|it\.)?([a-z]{2})|com)$/i

As a function you could do something like this:

function isGoogleUrl(url) {
    url = url.replace(/^https?:\/\//i, ''); // Strip "http://" from the beginning
    url = url.replace(/\/.*/, ''); // Strip off the path
    return /^(\w+\.)*google\.((com\.|co\.|it\.)?([a-z]{2})|com)$/i.test(url);
}

You could simplify it if you use window.location.hostname:

function isGoogleUrl() {
    return /^(\w+\.)*google\.((com\.|co\.|it\.)?([a-z]{2})|com)$/i.test(window.location.hostname);
}

The only way this should allow a false positive is if there is a "google.(some other TLD)". For example, "google.tv", is not on the list (it redirects to google.com), but it would pass.

Edit: Like Wimmel pointed out, it also accepts invalid domains like "google.com.fr" which are not listed. It will basically accept any "google.whatever" domain name.

Do you count other Google properties as "belonging to Google"? FeedBurner, Blogger etc?

Can I ask what the purpose of this is? There may be a better way of doing what you want... and if it's reasonable I can ask internally for you.

If you don't need the test to be 100% accurate, this simple regex would do for all the domains you posted above:

"(http://)?([\w]+)?\.google\.([\w]{2,3})"

Just testing the presence of ".google." would suffice in most cases, although it could easily be fooled by adding a "google" domain in the url (not so easy though, nor quickly done).

Or just wait for google to buy their own google TLD.

I agree that you probably shouldn't do this... However, if you are going to do it (and you aren't content with the previously offered solutions that just check for a google-like pattern) then this is how I would approach it:

var GOOGLE_DOMAINS = ([
    '.google.com',
    '.google.ad',
    '.google.ae',
    '.google.com.af',
    '.google.com.ag',
    '.google.com.ai',
    '.google.am',
    '.google.it.ao',
    '.google.com.ar',
    '.google.as',
    '.google.at',
    '.google.com.au',
    '.google.az',
    '.google.ba',
    '.google.com.bd'
]).join('\n');

function isGoogleUrl(url) {
    var url = 'http://www.google.ba/the/page.html';

    // get the domain from the url
    var domain = /\.google\.[^\/\\]+/i.exec(url) + '';
    if(!domain) return false;

    // create a regex to check to see if the domain is supported
    var re = new RegExp('^' + domain.replace(/\./g, '\\.') + '$', 'mi');
    return re.test(GOOGLE_DOMAINS);
}

This creates a regex based on the domain your url and uses it to test the list of domains.

Note: The GOOGLE_DOMAINS variable is just a string that holds the contents returned from the url you posted. There is no way for you to retrieve that string via AJAX or iframe because you cannot make such a request across domains. You'll have to hard code it or make a request server-side to retrieve that list.

A regular expression may be what you need. An example is:

<script>
var elem = document.getElementById("a");
var regex = new RegExp("(http://)?(www\\.)?google\\.com");

elem.innerHTML = regex.test(elem.innerHTML);
</script>

This would get the content of a span element "a", and would change it to "true" if google.com, and "false" otherwise. Note that it doesn't consider all other URLs(although the regex could easily be modified to do so), and "pages.google.com", for example, wouldn't match.

Also, your URLs all have a "." before them(".google.com" instead of "google.com"). Does this have any reason or is it just a mistake?

You could use a regular expression like....

^https?://[-A-Za-z0-9\.]+(\.google\.com|\.google\.ad|\.google\.ae|\.google\.com\.af|\.google\.com\.ag|\.google\.com\.ai|\.google\.am|\.google\.it\.ao|\.google\.com\.ar|\.google\.as|\.google\.at|\.google\.com\.au|\.google\.az|\.google\.ba|\.google\.com\.bd|\.google\.be|\.google\.bg|\.google\.com\.bh|\.google\.bi|\.google\.com\.bn|\.google\.com\.bo|\.google\.com\.br|\.google\.bs|\.google\.co\.bw|\.google\.com\.by|\.google\.com\.bz|\.google\.ca|\.google\.cd|\.google\.cg|\.google\.ch|\.google\.ci|\.google\.co\.ck|\.google\.cl|\.google\.cn|\.google\.com\.co|\.google\.co\.cr|\.google\.com\.cu|\.google\.cz|\.google\.de|\.google\.dj|\.google\.dk|\.google\.dm|\.google\.com\.do|\.google\.dz|\.google\.com\.ec|\.google\.ee|\.google\.com\.eg|\.google\.es|\.google\.com\.et|\.google\.fi|\.google\.com\.fj|\.google\.fm|\.google\.fr|\.google\.ge|\.google\.gg|\.google\.com\.gh|\.google\.com\.gi|\.google\.gl|\.google\.gm|\.google\.gp|\.google\.gr|\.google\.com\.gt|\.google\.gy|\.google\.com\.hk|\.google\.hn|\.google\.hr|\.google\.ht|\.google\.hu|\.google\.co\.id|\.google\.ie|\.google\.co\.il|\.google\.im|\.google\.co\.in|\.google\.is|\.google\.it|\.google\.je|\.google\.com\.jm|\.google\.jo|\.google\.co\.jp|\.google\.co\.ke|\.google\.com\.kh|\.google\.ki|\.google\.kg|\.google\.co\.kr|\.google\.kz|\.google\.la|\.google\.li|\.google\.lk|\.google\.co\.ls|\.google\.lt|\.google\.lu|\.google\.lv|\.google\.com\.ly|\.google\.co\.ma|\.google\.md|\.google\.mn|\.google\.ms|\.google\.com\.mt|\.google\.mu|\.google\.mv|\.google\.mw|\.google\.com\.mx|\.google\.com\.my|\.google\.co\.mz|\.google\.com\.na|\.google\.com\.nf|\.google\.com\.ng|\.google\.com\.ni|\.google\.nl|\.google\.no|\.google\.com\.np|\.google\.nr|\.google\.nu|\.google\.co\.nz|\.google\.com\.om|\.google\.com\.pa|\.google\.com\.pe|\.google\.com\.ph|\.google\.com\.pk|\.google\.pl|\.google\.pn|\.google\.com\.pr|\.google\.pt|\.google\.com\.py|\.google\.com\.qa|\.google\.ro|\.google\.ru|\.google\.rw|\.google\.com\.sa|\.google\.com\.sb|\.google\.sc|\.google\.se|\.google\.com\.sg|\.google\.sh|\.google\.si|\.google\.sk|\.google\.sn|\.google\.sm|\.google\.st|\.google\.com\.sv|\.google\.co\.th|\.google\.com\.tj|\.google\.tk|\.google\.tl|\.google\.tm|\.google\.to|\.google\.com\.tr|\.google\.tt|\.google\.com\.tw|\.google\.co\.tz|\.google\.com\.ua|\.google\.co\.ug|\.google\.co\.uk|\.google\.com\.uy|\.google\.co\.uz|\.google\.com\.vc|\.google\.co\.ve|\.google\.vg|\.google\.co\.vi|\.google\.com\.vn|\.google\.vu|\.google\.ws|\.google\.rs|\.google\.co\.za|\.google\.co\.zm|\.google\.co\.zw|\.google\.cat)

and I'd imagine generating that in JavaScript (or whatever language you choose) from an array or some other data set would be relatively easy.

I wouldn't do this client-side.

The list of Google domains doesn't change so frequently, so you could store a list server-side and then dynamically generate the .js to check it.

Without a regex to individually match each and every TLD, there isn't really an 'elegant way of doing it'.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top