Question

My SPA employs the Backbone.js router which uses pushstate and hashed URLs as a fallback method. I intend to use Google's suggestion for making an AJAX web-app crawlable. That is, I want to index my site into static .html files generated by PhantomJS and deliver them to Google via the URL:

mysite.com/?_escaped_fragment_=key=value.

Keep in mind that the site does not serve static pages for end-users (it only works with a Javascript-enabled browser). If you navigate to mysite.com/some/url the .htaccess file is setup to always serve up mysite.com/index.php and the backbone router will read the URL in order to display the JavaScript-generated content for that URL.

Furthermore, so that Google will index my entire site, I plan on creating a sitemap which will be a list of hashbang URLs. The URLs must be hashbanged so that Google will know to index the site using the _escaped_fragment_key URL.

Soooo....

(1) Will this approach work?

and

(2) Since backbone.js does not use hashbang URLs, how can I convert the hashbang URL to the pushstate URL for when the user arrives via Google?

reference: https://stackoverflow.com/a/6194427/1102215

Was it helpful?

Solution

I ended up stumbling through the implementation as I've outlined in my questions. So...

(1) Yes, the approach seems to work rather well. The only downside is that even though the app works without hash-bangs, my sitemap.xml is full of hashbang URLs. This is necessary to tip-off Google to the fact that it should query the _escaped_fragment_ URL when crawling these pages. So when the site appears in Google search results there is a hashbang in the URL, but that's a small price to pay.

(2) This part was a lot easier than I had imaged. It only required one line of code before initializing the Backbone.js router...

window.location.hash = window.location.hash.replace(/#!/, '#');

var AppRouter = Backbone.Router.extend({...

After the hashbang is replaced with just a hash, the backbone router will automatically remove the hash for browsers that support pushState. Furthermore, those two URL state changes are not saved in the browser's history state, so if the user clicks the back button there is no weirdness/unexpected redirects.

UPDATE: A better approach

It turns out that there is a dead simple approach which completely does away with hashbangs. Via BromBone:

If your site is using hashbangs (#!) urls, then Google will crawl your site by replacing #! with ?escaped_fragment=. When you see ?escaped_fragment=, you'll know the request is from a crawler. If you're using html5 pushState, then you look at the "UserAgent" header to determine if the request is from a bot.

This is a modified version of BromBone's suggested .htaccess rewrite rules:

    RewriteEngine On
    RewriteCond $1 !\.(gif|jpe?g|png)$ [NC]
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteCond %{HTTP_USER_AGENT} .*Googlebot.* [OR]
    RewriteCond %{HTTP_USER_AGENT} .*Bingbot.* [OR]
    RewriteCond %{HTTP_USER_AGENT} .*Baiduspider.* [OR]
    RewriteCond %{HTTP_USER_AGENT} .*iaskspider.*
    RewriteRule ^(.*)$ snapshot.php/$1 [L]

OTHER TIPS

Let me summarize something I wrote about 10 pages in my upcoming book on SPA. Google wants a classic version of your site. This is also an advantage because obsolete browsers really cant do SPA effectively anyway. Serve the spiders and old browsers a core site.

I get the term from the Gaurdian newspaper, http://vimeo.com/channels/smashingconf.

In the browser check if the browser cuts the mustard, here is my script for doing this:

<script>

    if (!('querySelector' in document)
         || !('localStorage' in window)
         || !('addEventListener' in window)
        || !('matchMedia' in window)) {

        if (window.location.href.indexOf("#!") > 0) {
            window.location.href = window.location.href.replace("#!", "?_escaped_fragment_=");
        } else {
            if (window.location.href.indexOf("?_escaped_fragment_=") < 0) {
                window.location.href = window.location.href + "?_escaped_fragment_=";
            }
        }

    } else {

        if (window.location.href.indexOf("?_escaped_fragment_=") >= 0) {
            window.location.href = window.location.href.replace("?_escaped_fragment_=", "#!");
        }
    }

</script>

On the server you need some mechanism to check for the presence of the _escape_fragment_ querystring. If it is present you need to serve the core site. The core site only uses simple CSS and little or no JavaScript. I have a SPAHelper library for ASP.NET MVC you can check out to see some things I implements around this, https://github.com/docluv/spahelper.

The real issue is most server-side web frameworks like ASP.NET, PHP, etc are not designed to support a single view system for the client and server. So you are sort of stuck maintaining two views for this. Again I wrote about 10 pages around this topic for my book, which should be ready sometime next week.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top