A website or web page may be reachable through a number of different addresses. It is important for a variety of reasons including branding and search engine ranking that whenever possible, visitors see and use only one of these addresses. The address selected for this purpose is known as the canonical address, hostname or url. The correct way to ensure that visitors are presented with the canonical name is to use the web server's RewriteEngine or equivalent so that if the visitor arrives via a non-canonical alias, the address is rewritten to the canonical form. But server administrators often deny their customers the ability to use the rewrite capability. When that is the case, it is still possible to use JavaScript such as that presented here to redirect most visitors to the correct address.
The chief problem with alias names, where several versions of a hostname or page address actually reach the same page, is that to search engines, they are separate pages. See Matt Cutts, URL Canonicalization (2006) (mattcutts.com/ blog/ seo-advice-url-canonicalization retrieved Jul. 2008). If some incoming links use one URL and some use other URLs, the search engine may not know they refer to the same page and may index the URLs separately. This penalizes the page, because none of its URLs receive the pagerank benefit of all the incoming links. In fact, when different URLs link to the same page, pages can be further penalized or even banned because it can look like an attempt to spam the search engine index. See Google, Webmaster Guidelines (google.com/ support/ webmasters/ bin/ answer.py?answer=35769 retrieved Jul. 2008), particularly under Quality Guidelines about duplicate content.
A common solution to enforce the use of a canonical name involves using the Linux Apache server's mod rewrite (RewriteEngine) in the .htaccess file to permanently redirect addresses to the preferred canonical address. But mod rewrite is too complex for typical customers. See Apache.org, mod_rewrite (httpd.apache.org/ docs/ 1.3/ mod/ mod_rewrite.html retrieved Jul. 2008). Hosting companies often disable it rather than incur the customer support calls it creates. See jdMorgan, Reply to bcrbcr re: Host Supports .htaccess but not Mod Rewrite (Apr. 16, 2007) (webmasterworld.com/ apache/ 3312276.htm retrieved Jul. 2008). For microsites and personal home pages, upgrading to a hosting package that includes mod rewrite support may not be sufficiently economic. That leaves a lot of websites that have unresolved canonical name problems that penalize them in search engine indexes.
JavaScript offers a partial solution, but there may be undesirable side effects. A simple JavaScript can check the URL used to access a page, and if this is not the preferred URL, it can replace the page using the canonical URL. If JavaScript is unavailable, disabled, or too old to have the getElementByID function, the redirect will fail and the non-canonical version of the page will load. This is generally the case for search engine robots and old browsers. But for better than 80% of human visitors, the redirect to the canonical page will work. And that means that when they print, bookmark, or copy the URL, they will have the canonical version. I hypothesize that this should greatly reduce the use of non-canonical aliases, and that keeping people from seeing the wrong URL will make it less likely for search engine indexes to be contaminated with non-canonical URLs.
This workaround may have unanticipated side effects. Examples and the solution applied include:
The use of JavaScript redirects might harm search engine rankings or even result in being banned from indexes. JavaScript redirects are extensively used by spammers to present different content to search engines than to human visitors. Search engine algorithms are known to attempt detect this practice and downrate or ban pages using it. See K. Chellapilla and A. Maykov, A Taxonomy of JavaScript Redirection Spam (May 8, 2007) (airweb.cse.lehigh.edu/ 2007/ papers/ paper_115.pdf, 401 kilobytes, retrieved Jul. 2008). However, in this case, we're presenting identical content to search engine robots and human visitors. Thus, the use is not inappropriate and should not result in penalties. Nevertheless, most search engine algorithms are not publically known. This means that there can be no assurance that search engines will not penalize you or ban you for using this JavaScript. You use it at your peril.
// w-gregg.juneau.ak.us; July 2008; make webpage reload with canonical url.
// Insert in the HTML head, replacing with the page's canonical url:
// <base href="http://example.co.uk/extensionless_page">
// <script type="text/javascript" src="mkcanonical.js">//</script>
// Below:
// Set alias array count to total hostnames including canonical hostname.
// Set alias[0] to desired canonical hostname.
// Set alias[1] to first alias hostname and add all known non-canonical hostnames.
// See w-gregg.juneau.ak.us/2008g25-canonical-url-javascript.
var alias = new Array(2);
alias[0] = 'example.com';
alias[1] = 'www.example.com';
var basehref;
var baseget;
var hostname;
var i;
if (document.getElementsByTagName) {
basehref = document.getElementsByTagName('base')[0].href;
baseget = location.protocol + '//' + location.hostname + location.pathname;
hostname = location.hostname.toLowerCase();
for (i = 0; i < alias.length;i++) {
if ((hostname === alias[i]) && (basehref !== baseget)) {
location.replace(basehref + location.hash + location.search);
}
}
}
There are some advantages in making the canonical URL for linking to a page lack a file type extension. This way, you can freely change the document type, for example from HTML to PHP, without breaking incoming links. By default, most servers require the extension, but many servers have a means to make it optional. In Linux Apache, if your hosting company supports this, you merely need a /.htaccess file that includes 'Options +multiviews' (without the quotes). This very often works even with hosts that don't allow customers to use the RewriteEngine.