Question: My SEO team recently located a large number of pages on my website that needed to be blocked from Google crawling, so they suggested that I added those pages and directories to my website’s robots.txt file. The pages not only need to be blocked, but removed from Google’s index as well. I’ve already done the blocking, so now I’m wondering how long it will take for Google to remove the pages. Any ideas?
Answer: The short answer is, it’ll take a while. Months most likely. My advice to you is to block the pages in the file and then go on with your life and forget about it. The worst thing you can do is harp on the situation. After the pages have been blocked by your robots.txt file, there’s nothing else you can do about the removal process. It’s out of your hands. Both Google and Bing crawl at a snail’s pace.
Long answer: there’s a whole bunch of crap on the internet about this topic. You can throw most of it out the window. What people seem to be confused about mostly are the concepts of removing something from Google’s index and stopping it from appearing in Google search results. These are two very different things.
Allow me to educate you and anyone else who reads this post. If every single one of your website’s pages were allowed to be crawled by Googlebot, meaning, no page was blocked by the robots.txt file, Googlebot would try to crawl as much as it could. It may even crawl every single one of your pages. If Google then liked everything it crawled, it would index everything. Every single one of your pages would be included in Google’s index.
Now, if you added meta noindex to half of your website’s pages, Googlebot would still crawl your site, but after encountering the noindex on some of your pages, Google would stop showing those pages in its results. The pages would still be indexed, but they would no longer be shown in search results. This is very important. By adding noindex to a page on your website, you aren’t removing that page from Google’s index. You are merely removing it from the search results. The page is still indexed and the page is still taken into account when Google applies a ranking to the website as a whole. By adding this code to a page, you haven’t really done anything in regards to your website’s ranking ability. The page is still there and it’s still in Google’s index. Please prove me wrong on this concept. I have over 20 years of experience with SEO and I have never improved one of my websites’ ranking by adding noindex meta code to a page.
To really drive my point home. Let’s say that your website has one good, well written, and valuable page. It also has 50,000 thin worthless pages that include meta noindex code. Do you honestly think that Google won’t take into account those 50,000 pages? What about all of the pagerank those pages will carry? Yes, even pages that include noindex hold pagerank. Everything holds pagerank. When Google decides the value of your website, it will include every single one of your 50,001 pages. And the results won’t be good.
Moving on. If you allowed Googlebot to crawl half of your website’s pages, but blocked the other half of the pages in your robots.txt file, something else entirely will occur. The crawlable pages would be crawled and the blocked pages wouldn’t be crawled. If the crawlable pages are good and if Google likes them, they’ll be included in its index. As for the pages that can’t be crawled, they’ll be indexed too, as long as Google can see the URL leading to them. If all of the blocked pages are contained inside of a directory and their URLs are hidden from Googlebot, they’ll be ignored completely because Googlebot will never see them. As for the URLs it does see, they’ll likely start off with the link anchor text showing in the Google search results, but over time, that text will disappear and a simple URL will remain. Sometimes though, the link anchor text will remain forever if the page remains linked to. Any URL that Google sees, no matter if it’s blocked in the robots.txt file or contains the noindex code, will be indexed and will be counted towards something. Remember that.
Now here’s the answer to the question you asked. I’m guessing that Googlebot has already seen the pages that you would like removed. This means that they are in its index. If you block the pages, but they’re still linked to and if Googlebot can still see the links, the page’s content will be removed from Google’s index over time, but the link text will remain in the search results. The reason for this is that pagerank is still flowing to the page. Blocked pages actually hold value and pagerank.
If you block the pages with your robots.txt file and they’re contained inside of a directory (or if you simply stop linking to them) and if Googlebot is no longer able to see any links that lead to those pages (essentially making them orphaned), over time, any pagerank they ever held will eventually drain away, removing the page from Google’s index. Finally, the page will be gone. It won’t be counted towards the value of the overall website and it won’t be contained in Google’s index. The big question is, how long will it take for that pagerank to drain away? Usually, it take takes a few months at a minimum. It all depends on your website’s crawl rate and how those pages were queued up in Googlebot’s crawl schedule. If Googlebot tries to crawl those pages very soon and sees that they can’t be accessed anymore, the process may be swift. If Googlebot could care less about those pages and doesn’t want to crawl them in the least, it’ll take a while. I once blocked a directory with 20,000 previously crawled pages and it took an entire year for those pages to be removed from Google’s index. It can be painfully slow at times.
Remember, there’s a difference between crawling at indexing. Googlebot does the crawling and Google does the indexing. The robots.txt file can block crawling, but it doesn’t control the indexing. The noindex meta code can stop the page from being returned in the search results, but it can’t stop the page from being indexed. The creators of that code chose their syntax very poorly when they came up with the term “noindex.” It’s misleading.
So really, once you block a page that’s hidden by Googlebot, you’re not waiting for Google to simply remove the page from its index because it’s blocked. You’re in fact waiting for Google to reduce the page’s pagerank, making it worthless. At that point, because it’s worthless, Google will remove the page from its index.
PS – Oh yeah – don’t even get me going on the Google page removal tool. People out there love to spout misinformation about this tool. They claim that if you remove one of your website pages with this tool that the page will be removed from Google’s index. It won’t be. What you will do is remove the page from being returned in Google’s search results. That’s it. The page will still be contained in the index. It will only be removed from the index if its pagerank drops to zero or if the page has been physically removed from the website and a 404 or 410 status code is returned when Googlebot accesses it. Both of these cases can take a lot of time.
Leave a Reply