The big question is, which is a better course of action to take when you don’t want a web page indexed in the Google search index, block the page in the robots.txt file or add a noindex meta tag to the page? The answer is, it depends. And to be completely honest, not many people actually know the answer to this question. In this post, I’ll simply explore the issue without giving any hard answers. If you read what I have to write and then come to a sensible conclusion, please share it down below because I’m just as interested in learning more about this as you are.
I’ll give you an easy one. Let’s say you’ve got a website that has a link in the top navigation bar that says Newest Posts. Actually, let’s use this very forum website as an example. Do you see the What’s New link up top? Well, that link leads to a bunch of other pages that display the most recent threads and posts that have been created on this website. Actually, there can be hundreds or thousands of these pages because they (for some reason) get spun off into duplicates based on user sessions. The only route to those pages is through this one What’s New link up top. We know that we don’t want any pages contained inside of this directory crawled or indexed, so what do we do? The easy and correct answer to this is to block the /whats-new/ directory in the robots.txt file. If we were to keep this directory unblocked and have Google crawl all the pages inside and merely add a noindex meta tag to each page, we’d be doing two things wrong. First, we’d be wasting Google’s resources that can be better spent on other parts of this website and second, we’d be allowing Google to crawl many very low value pages and we don’t quite know the consequences of allowing that to happen. From personal experience, I can tell you that allowing Google to crawl low value pages, even if they contain the noindex meta tag, is no good. I’ve seen website rankings slowly slide year over year because of this type of crawling.
The reason blocking this directory is appropriate is because once blocked, only one blocked page will ever appear if someone does a “site:” command in Google search. We wouldn’t necessarily want hundreds or thousands of these pages to appear that way, but since there’s only one route in, this really is the most appropriate solution.
Which brings me to a very important question: Do pages that are blocked in the robots.txt file count against a website? Do they accumulate and get to a point of having some sort of negative effect? In my experience, I haven’t seen that, except in certain circumstances. Let’s say that you’ve got five links on a web page that lead to other pages and you block each of those pages in the robots.txt file. I would suggest that that’s a fine thing to do. I have never seen any negative consequences from doing that. Also, I haven’t seen any negative consequences from blocking an entire directory like the one I described above. The only time I’ve witnessed rankings fall after blocking something in the robots.txt file is when you block a directory like the /whats-new/ one I described above, but there have already been pages in that directory that have been crawled and indexed. The reason for this is because before the directory was blocked, the pages contained therein have gained pagerank and some value, albeit very little. After the route to those pages has been cut off, those pages begin to wither and away to die some time in the future (pagerank was no longer able to flow to them). They become orphans that count towards the overall pagerank of the website and Google seriously doesn’t like withering pages like this. So, if you ever plan on blocking a whole bunch of pages in this fashion, be aware that your site rankings will suffer in the short term until those pages are completely deindexed by the search engines. This can take months or years, depending on the size of the website.
Let’s say that you’ve got an article website that, for each article page, has a printer friendly version of what’s written. You obviously don’t want that printer page indexed, so do you block it in the robots.txt file or add a noindex meta tag to it? I’ve heard that some folks will add a canonical tag to those pages, but I’ve venture to say that I absolutely can not stand canonical tags. I’ve yet to see when they’ve become consistently applied when they’re supposed to be. Also, I’ve tried adding noindex meta tags to these types of pages and have failed. There’s never been a good result when I’ve used noindex meta tags, actually. So in this case, I’d recommend hard blocking these pages in the robots.txt file as well.
I’m going to make a long story short here. I really don’t like noindex meta tags. I’ve never seen them help rankings in any way, shape, or form and I also really don’t like the canonical tag for the same reasons. The only technique I’ve ever seen help website rankings from a ranking drop or penalty is the blocking in the robots.txt technique. Everything else has failed miserably and it almost seems like folks have talked themselves into believing some theoretical solution as opposed to a more empirically proven one. One that’s based in reality.I mean, the phase “noindex” says it all, right? No, it doesn’t. Noindex merely means don’t show in search results. It doesn’t mean that the search engine won’t crawl the page and that it won’t index it. And it also doesn’t mean that the search engine won’t count that page against you if it’s thin. We need to come to grips with this fact. The same is true for canonical tags used on pages. These things only work half of the time and even when they seem to, Google was probably going to consolidate the pages in question anyway, whether or not the canonical tag was used. We need to get over this tag once and for all. The only thing that works 100% of the time is blocking the bad pages in the robots.txt file. Either that or removing the page in its entirety or password protecting the page so it’s crawling becomes impossible.
But still, the question remains, does having too many pages blocked in the robots.txt file harm a website? That’s the big one, so if you have a perspective on this, please share below. Actually, if you have any perspective, please share below.
I always thought of it this way: if you have a page with lots of linked to pages that are only one level away from said page, and you don’t want all those pages indexed (or counted toward your website’s overall value for the Panda algorithm), use the noindex meta tag. At least the link juice will flow to those closely linked pages and count for something. If you’ve got a website with only one link on it and that link is a doorway that leads to a whole bunch of other pages, such as a blog or a forum discussion board, then block that link in the robots.txt file. By blocking it, you’ll be saving tons of crawl budget and Google won’t waste its time crawling a bunch of URLs it’s got no business crawling in the first place. Only one page will show as blocked in robots.txt in the Google Search Console too. To me, this seems like a reasonable approach, but really, no one has any idea which is a better route to take when it comes to these things. Trial and error, I suppose.
I think the best thing to do might be to block awful pages in the robots.txt file and then remove the links to those pages, if possible. After Panda hit in 2012, I began noticing many larger sites, such as Ebay, removing those types of links and using Javascript instead.