How to Fix Index Bloat

CraigHardy · Jan 15, 2022

Index bloat is bad on so many levels. Unless you're a webmaster who dwells on these kinds of things, you may not even know your website has got a problem. You most likely launched your site a while back and have been adding to it ever since. It began to rank for some keywords a few months to a few years later and you've been coasting ever since. Did you know that your site may have the potential to rank much, much better than it currently does? Do you know that you may have an issue we in the webmaster and SEO world refer to index bloat?

What is index bloat, anyway? I'll explain it this way. The index I'm referring to here is Google's. It can be Bing's as well as any other search engine's, but for now, let's focus on Google, since that's really the only one that matters. If your website consists of a homepage and four other pages, then you've got a five page site. All Google should know about and have indexed are those five pages. Let's say though that you've also got some additional pages on the site that you don't want to count. They don't really matter to search engines, but they help the website operate the way it's supposed to. Every page except for the homepage sorts five different ways. Right there, you've got 20 extra pages. Now let's say that you've got some additional pages that 301 redirect to those four pages. Okay, add on eight more there. Now let's say that you've got a contact page that spins off some additional page every time someone uses it. There's an endless number of pages there. Basically, what I'm trying to say is that, depending on which content management system you're using, you can have a huge number of pages beyond those you currently know about that are polluting Google's index. You may not be aware of this, but these extra pages are dragging down your entire site's ranking.

I'd like to get something out of the way right now. If you block pages in the robots.txt file, Google won't index them. Or rather, they'll index them, but they may fall out of the index in a few months to years. Eventually, they'll disappear, unless they're linked to from a prominent position on the site. The thing is, there's no guarantee that they'll disappear and it's a gamble keeping these pages around. It's not good practice. The same is true for pages that have a noindex meta tag on them. And the same is true for 301 redirects. And the same is true for pages with the canonical tag on them. And the same is true for thin and junk pages. All of these types of pages need to go. They need to disappear from your site in order to regain your rankings or allow them to flourish in the first place.

I've been working in SEO for over a decade and I can't tell you how many times I've seen a supposed SEO professional tell a client that they need to place a noindex meta tag on a thin page in order to remove it from Google's index. This is complete garbage advice. The truth is, if you'd like to remove a page from Google's index, you should delete the page or block it via authentication. Also, you can forget about the canonical tag and 301 redirects for the sake of search engines. They don't need them. I've seen pages consolidate into one within minutes of creation automatically and I've seen pages that were supposed to merge via a 301 redirect never redirect - even after 10 years. Yes, use 301 redirects, but only for users. Don't count on Google or the other search engines obeying them at all. Again, if you've got these types of redirects, you'll need to change your website's architecture so search engines can't see them. If you don't, you'll have a bad case of duplicate content and you don't want that.

If you're running Wordpress, the big sources of bloat are author pages, tag pages, and those pages inside of the /page/ directory. These are the ones that are linked to from the bottom of the homepage. 1, 2, 3, and so on. There are plugins that remove the /page/ directory pages and the author pages. In regards to the tag pages, don't use tags at all. You'll only get yourself in trouble. And as for those individual image pages that get spun off from every image you place in your posts, there are plugins to deal with them as well.

My big point of this post is this: in order to reduce index bloat, you'll need to delete pages. And to do this, you'll need to alter your website's architecture. This isn't an easy task, but it's necessary. In order to find out if you've got index bloat or not, you can use of the many SEO services out there to analyze your site (Moz, Botify, SEMrush, etc...). You can also do a scan yourself with applications such as Xenu Link Sleuth. This is a very handy and free program that I've used a lot in the past. It crawls your site and makes you aware of things you never thought you'd see. Only after you have all the necessary information will you be able to determine the best course of action.

If you've got any questions about any of this, please ask. As I said, I've got lots of experience so I may be able to help out.

EmeraldHike · Jan 15, 2022

I have a website that was launched all the way back in 2004. It's a classifieds site. Back then, each ad page had a "Contact Seller" link that led to a contact page. Each of these pages had a different URL. So basically, there were thousands of these pages through the years. Hundreds of thousands. If a site visitor wasn't logged into their account, they wouldn't land on the actual contact page. They'd land on a login page that was identical to the overall website login page, except for the URL. The thing is, every "login" page had a different URL. The same as the contact page. Essentially, it worked like this: if user logged in, then go to unique contact seller page. If user not logged in, go to unique contact seller page, but show login form instead of contact form. It's pretty standard for these types of sites.

For a few years, I had these contact pages blocked in the robots.txt file, which worked well. Before that though, I allowed Google to crawl each and every one of them. Since none of them appeared in the search results, I assume they were being canonicalized with the site login page. There was no problem with this.

Just recently, I made a change to the site where neither the login page nor any of these contact seller pages exist anymore. Now, all of them are returning 404 status codes. They're dead pages. The weird thing is, as these pages are being deleted, I'm seeing some from all the way back to 2004 when I use the site: command. I know this because I used to have a very distinct title page for them that I haven't had in over a decade.

I guess I'm writing this post just to say that Google has a very long memory. If you think a page is long gone just because you haven't seen it in a while, it's likely not gone. This is why I prefer to either block pages in the robots.txt file that I don't want to return in the search results or kill them off completely so they return either a 404 or 410 code. I don't like redirects or canonicalizations. They only cause issues in the future.

Cameron · Jan 15, 2022

I feel like it's extremely difficult to remove pages from my website. It's got terrible bloat and I know there are thousands and thousands of pages that need to get deleted. I have that part all sorted out, so that's not the problem. Some of the pages are returning 404 status codes, some 410, and some 403. Eventually all the pages will be dropped from Google's index.

What I have the most trouble with is patience. When I make a change, I look through all my statistics obsessively all the time. I expect things to happen immediately and of course they never do. Also, each time I begin deleting large numbers of thin pages, I feel as though the site's rankings drop. Actually, I know they do, which makes me lose my nerve and do something stupid like link to the pages again and block them in the robots.txt file or something. I know what I do is wrong, but I just don't want to lose even more traffic than I already have. I'll need to have someone tie my hands together so I don't touch anything after I change it.

I have a theory as to why a website's rankings will fall even further than they already have when they were first hit with a Google penalty, such as Panda or whatever. Basically, even thin pages have pagerank and when they get deleted en masse and the links to them disappear, that's a lot to digest on the part of the search engine. So let's say that the web page in question has a fraction of a percent of pagerank when it's unlinked and set to show an error status, that same page's pagerank would continue to fall before it's actually removed from the index. Google might not know about the removal or error code for months until the next time it tries to crawl the page. It's crazy and having tons of pages with weakening pagerank all over your website is sure to reduce the overall ranking. From what I've been reading, when deleting large numbers of pages, it's best to keep your links to them so the search engines think they're still there until they try crawling them again. At that point, the search engine sees that the page is dead and will eventually remove it from the index. I personally watch my stats in the graphs in Google Search Console. It's a painfully slow process but like anything that has to do with SEO, patience is a virtue. You definitely need patience to play this game.

CampFireJack · Jan 15, 2022

Does a page that's linked to and blocked in the robots.txt file count as an indexed thin page to Google? That's my big question. I've had much experience with blocking pages, rarely to good effect. Usually what happens is that I find a group of pages that I consider thin, but that have been crawled and indexed by Google, so I block them in the robots.txt file. Through the months, these pages collect and in some cases, I've seen dramatic ranking drops afterwards. I can only suspect that blocking the pages had something to do with that. In other cases, I've seen my ranking skyrocket months after blocking a group of pages. The difference I think is that some pages were linked to and some weren't; they were in a directory with only one path in and that path was blocked too. The pages that weren't linked to anymore dropped from the Google index while the ones that were linked to seemed to hang around forever. So keep that in mind if you're trying to get rid of pages due to Panda or some other penalty. Ultimately though, I think all pages that are blocked by the robots.txt file will eventually drop from the index. It just may take a while.

To deal with index bloat, I like to periodically scan my website's log files. I'll see exactly what Google is crawling and make adjustments if necessary. I'll also check the Valid area of the Coverage section inside of Google Search Console. That gives me a good indication of whether or not Google is indexing pages it shouldn't be.

I like to actually remove pages I don't want indexed anymore by setting them to return a 404 error code by the server, but in the past, if I was dealing with parameter URLS, I'd just block the "?" in the robots.txt file. It's important to note that any links to those parameter pages must be removed as well because, as I stated above, the pages will hang around forever if you don't. I'm not sure if applying a nofollow link attribute counts as removing the link, so if you know, please update me down below.

Search

How to Fix Index Bloat

CraigHardy

Member

EmeraldHike

Member

Cameron

Member

CampFireJack

Member