Google's Allotted Index Size Per Website

LukeLewis · Jan 15, 2022

I've got a website that's definitely been hit by Google Panda. The rankings recently hit the floor after an update. I know I've got tons of thin and junk pages in the index, so I'm currently working on getting them all out. It's been a long road, but I have made the necessary changes so I'm confident of success in the future. I did want to report a few areas of interest though. It's been a few months since I began removing pages and some odd things are occurring.

First off, Google doesn't like to let go of pages. If you're looking to lift your site from a thin content penalty, you better start soon. Don't listen to everyone online who tells you to 301 redirect the old pages to new ones. Just delete the old pages. Force them to return 404 or 410 status codes. It doesn't matter which one. Just get rid of the pages and make sure they return errors.

What I'm noticing as I delete pages is that once one page is gone, it'll sort of open a spot for a better one to appear in the index. Here's my theory. I'm still working this out in my head, so I hope it makes sense. And if you have any sort of follow up on this, please let me know below. I'd appreciate it.

I think Google crawls an entire website and then makes a decision on how many pages to allow in its index. So say you've got a website with a total of 10,000 pages, both good and bad. Some of the pages are worthy of being returned in the search results and some are so thin that they're just junk. Google crawls all of the pages and in this case, it'll determine that the "bucket size" it'll create for the site will be about 5,000 pages. This is based on both number of pages (size of website) as well as pagerank. Google knows about all 10,000 pages and has put some of them in a holding tank, but it'll only return 5,000 of those pages in search. Now remember that some of the pages that Google will include in this "search returnable" 5,000 page group are thin and bad. Alternately, some of the good pages will be placed in the holding tank. The reserve, if you will. Pages in this reserve don't show in the search results.

I think Google rates websites with how many pages they have over the "returnable" group and that's what hits you with a Panda penalty. So if a 4,000 page website gets crawled and Google deems its bucket to hold 3,900 pages, that means that most of the pages are good to show in the results and the site has a fairly high overall pagerank. If a 90,000 page website gets crawled and Google deems it to only have a bucket size of 900 search returnable pages, then that's a horrible website that needs to be cleaned up tremendously.

The reason I make all these claims is because I have a website that includes approximately 8,000 good pages. On top of that, it's got about 30,000 junk orphan pages from some previous software that need to be removed and it's also got about 50,000 pages that have been blocked in robots.txt and other methods. What I'm seeing is that Google only shows that there are about 6,000 pages in its index and that number has been steady for a while. But as the bad pages get removed by Google, that 6,000 number appears to be rising, as if the overall quality score is increasing and Google is deeming this website's bucket larger and larger. I'm also seeing pages return in the index that I haven't seen in months. I'll use the site: command to find pages and ones I thought had been removed a long time ago are now being revealed. It's almost as if the more the website shrinks, the more good pages are shown. I know, it's strange, but it seems to be true.

One more critical observance. When I blocked pages with the robots.txt file in the past, it seems that Google counted those blocked URLs as pages in the index. Meaning, it would bump out good pages and show these blocked pages instead. So by blocking the pages with the robots.txt file, I was essentially shrinking the bucket. Once I unblocked the pages and had them return 403, 404, and 410 header status codes, the good pages began getting crawled and included in the index again. It's so odd.

If you have any input or experience with this, please let me know down below. I'd love to learn more about it.

Newman · Jan 15, 2022

Room in Google's Index?

I was watching a Google Webmasters video a few days ago when I heard something interesting. If memory serves, I believe the person in the video was discussing how web pages canonicalize into one another. Basically, they were expressing how important it is to send Google clear signals so they can exclude pages that don't need to be in the index and merge pages that are duplicates. The person really emphasized how critical it was to allow Google to merge similar and duplicate pages because a website's page count will go down, which will help the website rank overall. Then, and this is the interesting part, the person said that if a site's pages are allow to be merged, or canonicalized, more pages will be allowed into the index. As if there's a cap on the number of each website's allowed pages and by removing and merging the cruft, more good pages can fit into the container. I found that little statement fascinating.

This whole thing got me thinking. Perhaps we're not considering things correctly. We like to talk a lot about crawl budget, where Googlebot only visits a certain number of our web pages per day, but what if we're actually focusing on the wrong thing? What if, instead of crawl budget, we should be thinking about "container size," or how many pages, good or bad, we've currently got in the index, versus how many we'd like to get into the index? What if, because we've got so many thin and bad pages currently in the index, our container is full? Now, because of that, Googlebot simply doesn't have the need to crawl very many of our other pages and as a result, we're stuck with a low crawl rate.

If this is the case, I'd say that the primary goal of any webmaster should be to keep the cruft out of Google's index. Make sure duplicate and similar pages are folding into one another, stay away from thin pages, stay away from noindex, stay away from blocking pages in robots.txt, and if at all possible, don't let Google know about any bad pages at all.

What's your opinion on this? I know this is probably old news, but I think I've been pondering things a bit backwards for years.

WendyMay · Jan 15, 2022

Newman said:
Room in Google's Index?
I was watching a Google Webmasters video a few days ago when I heard something interesting. If memory serves, I believe the person in the video was discussing how web pages canonicalize into one another. Basically, they were expressing how important it is to send Google clear signals so they can exclude pages that don't need to be in the index and merge pages that are duplicates. The person really emphasized how critical it was to allow Google to merge similar and duplicate pages because a website's page count will go down, which will help the website rank overall. Then, and this is the interesting part, the person said that if a site's pages are allow to be merged, or canonicalized, more pages will be allowed into the index. As if there's a cap on the number of each website's allowed pages and by removing and merging the cruft, more good pages can fit into the container. I found that little statement fascinating.

This whole thing got me thinking. Perhaps we're not considering things correctly. We like to talk a lot about crawl budget, where Googlebot only visits a certain number of our web pages per day, but what if we're actually focusing on the wrong thing? What if, instead of crawl budget, we should be thinking about "container size," or how many pages, good or bad, we've currently got in the index, versus how many we'd like to get into the index? What if, because we've got so many thin and bad pages currently in the index, our container is full? Now, because of that, Googlebot simply doesn't have the need to crawl very many of our other pages and as a result, we're stuck with a low crawl rate.

If this is the case, I'd say that the primary goal of any webmaster should be to keep the cruft out of Google's index. Make sure duplicate and similar pages are folding into one another, stay away from thin pages, stay away from noindex, stay away from blocking pages in robots.txt, and if at all possible, don't let Google know about any bad pages at all.

What's your opinion on this? I know this is probably old news, but I think I've been pondering things a bit backwards for years.

So you're saying that pages that are blocked by robots.txt fill up the index just like thin pages do? Can you please elaborate on that? Also, how do 301 redirects play into this?

Newman · Jan 15, 2022

WendyMay said:
So you're saying that pages that are blocked by robots.txt fill up the index just like thin pages do? Can you please elaborate on that? Also, how do 301 redirects play into this?

Pretty much. A while ago, I read on Google's website that they don't recommend blocking duplicate content with robots.txt. I suspect that's because each and every URL that's blocked still counts for something. They can't work for you. I couldn't imaging they'd be good for a site. Now, if you have a huge directory filled with pages that you don't want indexed and there's only one path to those pages, then yes, that's the way to block them. With robots.txt. Otherwise, if every single page you'd like to block has an individual path to it, you might want to figure out a way to remove those paths and somehow delete those pages. Examples of these individual pages would be user accounts on classifieds and forum sites. Each and every post has a link to these pages and Google doesn't need to know about any of them. They shouldn't be crawling them. So if you placed a noindex meta tag on them and allowed Google to crawl them, that would be wasting crawl budget. And if you blocked them in robots.txt, you'd be filling up your index "bucket" with blocked pages. No good either way. The best thing to do in this case is to remove the links and then have those pages return an error, such as a 403 status code, so when Google crawls the link, it'll remove the page from the index. And since the link will be removed for those who aren't logged in (search engine crawlers), new users will never even be known about.

As for 301 redirects, I think they're terrible. Search engines follow them all the time and only canonicalize them with the target pages some of the time. I can't stand redirects. I do understand that they are a necessary evil though. If your site has 301 redirects that can be removed or dealt with another way, then focus on that. No website should ever have permanent 301 redirects in the site structure. Any more questions, please ask. Also, here's a video for you from Matt Cutts. Listen closely to how he is suggesting that pages fold into one another. That's the ultimate goal.

Search

Google's Allotted Index Size Per Website

LukeLewis

Member

Newman

Member

Room in Google's Index?

WendyMay

Member

Room in Google's Index?

Newman

Member

Search

Google's Allotted Index Size Per Website

LukeLewis

Member

Newman

Member

Room in Google's Index?​

WendyMay

Member

Room in Google's Index?​

Newman

Member

Room in Google's Index?

Room in Google's Index?