EmeraldHike
Member
- Joined
- May 10, 2021
- Messages
- 133
- Reaction Score
- 0
- Points
- 21
- #1
If you look at the top navigation bar on this website, you'll see a link that says What's New. If you click that link and then look up in the address bar of your browser, you'll see a URL like this:
https://gaulard.com/forum/whats-new/
That's fine. That's the URL that's supposed to be there. If you now look right below that link, you'll see a few other links. There's one that says New Posts. The URL for that link is /whats-new/posts/. There should also be a number at the end of that URL. Now here's the tricky part. If you take note of that number and then click around the site a bit and then come back to that page again, you'll see that number change. It actually changes all the time. It increases. The reason for this has something to do with how the software handles user sessions (I believe). That What's New page changes, depending on who's logged in and what's new to each user. The problem with this is, even though there's a noindex tag on the page, Google is considered a new user every time it crawls that page. So what starts off as /whats-new/posts/10/ quickly turns into /whats-new/posts/2049586/. Do you see the issue with this? Yes, the page does have the noindex tag, but Google seems to crawl those pages very aggressively. I have a XenForo site that made it all the way to 20,000 before I stopped the madness.
Since these pages obviously shouldn't be crawled by search engines, they need to be blocked in the robots.txt file of the website in question. The reason the pages shouldn't be crawled is because each and every version of the same page is considered "new" because it's got a new number attached to the end of it. They really are the same page, but since they've got unique URLs, they're considered distinct in the eyes of the search engines. And because all of these pages are distinct, but the same, they're considered duplicates. Just because they've got a noindex tag on them makes no difference. They're using up your website's bandwidth and they're also using up your website's crawl budget with Google. And beyond that, they're actually considered low value pages by search engines and can really take a toll on your website's rankings, not in a good way.
My advice to you is to block this /whats-new/ directory at all costs in your robots.txt file. This is critical. If you've got any questions, please ask.
https://gaulard.com/forum/whats-new/
That's fine. That's the URL that's supposed to be there. If you now look right below that link, you'll see a few other links. There's one that says New Posts. The URL for that link is /whats-new/posts/. There should also be a number at the end of that URL. Now here's the tricky part. If you take note of that number and then click around the site a bit and then come back to that page again, you'll see that number change. It actually changes all the time. It increases. The reason for this has something to do with how the software handles user sessions (I believe). That What's New page changes, depending on who's logged in and what's new to each user. The problem with this is, even though there's a noindex tag on the page, Google is considered a new user every time it crawls that page. So what starts off as /whats-new/posts/10/ quickly turns into /whats-new/posts/2049586/. Do you see the issue with this? Yes, the page does have the noindex tag, but Google seems to crawl those pages very aggressively. I have a XenForo site that made it all the way to 20,000 before I stopped the madness.
Since these pages obviously shouldn't be crawled by search engines, they need to be blocked in the robots.txt file of the website in question. The reason the pages shouldn't be crawled is because each and every version of the same page is considered "new" because it's got a new number attached to the end of it. They really are the same page, but since they've got unique URLs, they're considered distinct in the eyes of the search engines. And because all of these pages are distinct, but the same, they're considered duplicates. Just because they've got a noindex tag on them makes no difference. They're using up your website's bandwidth and they're also using up your website's crawl budget with Google. And beyond that, they're actually considered low value pages by search engines and can really take a toll on your website's rankings, not in a good way.
My advice to you is to block this /whats-new/ directory at all costs in your robots.txt file. This is critical. If you've got any questions, please ask.