Search

Best Way to Set Up robots.txt on Xenforo Forum & Improve SEO

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
I'm looking to setup my Xenforo robots.txt file as best as possible to improve website SEO, improve Google Search Console (Valid Pages & Excluded Pages) also optimize Google Crawling, Crawl Budget, and indexed pages.

I've made a bunch of posts on this topic on Xenforo.com (all the problems I've been having)...and I seem to keep getting the same advice from the "experts"...which is to keep the robots.txt file short & simple. This doesn't seem to have worked for my site...since in Google Search Console...my site's average position keeps getting lower & lower...even though my site has lots of good content.

One of the issues I have is really really high "By Response"..."Other Client Error (4xx)" value in Google Search Console.

If not familiar with this...it can be found in Google Search Console >> Settings >> Crawl Stats (click on "Open Report >> By Response >> Other Client Error (4xx)

I'm told this value should be around 5% (I guess ideally it would be 0%). The value for my site has been over 40%...and currently it's at 31%.

I know if I scan my site for errors...many many many of the 4xx errors are "403 errors"...and many many of these errors are from member accounts. I'm thinking that all of these member account 403 errors may be getting crawled by Google...and making my site look bad.

Here's a screenshot when I scan Xenforo.com (below). As can be seen...these are 403 errors...and most of them are related to member account info. If I scan my site...I get exactly the same thing:

Screen Shot 2021-12-24 at 11.47.43 AM.png

I saw in your thread post (on Xenforo.com) where you listed what your robots.txt file looks like now. Do you think if I setup my robots.txt like your robots.txt...it would keep Google from crawling these areas of my site...and get rid of these member account 403 errors?

Thanks much:)
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Hi there - From what I've seen, Google hates 403 errors. I used to have a lot of them too until I began blocking them in my robots.txt file. There is absolutely no reason Google should crawl them (/members/, /attachments/). From what I have seen, every time I allowed Google to crawl them, both my crawl rate and valid pages dropped like a rock. When I block them from being crawled, both of those things go back up again. As far as I can tell, allowing those pages to be crawled tells Google that your website is very low quality. It also says that it's not worth crawling that much because 40% of the pages aren't available. I blocked them and it seems to be having a positive impact.

As you can see from my robots.txt file, I also blocked a bunch of other pages. Basically, if I don't want a page to appear in the search results, I don't want it crawled either. That's a waste of crawl budget. I also noticed that since I blocked all of these pages, Google is now crawling pages 2, 3, 4, and so on of both forums and threads. It never used to do that.

I've also seen the advice from the "experts" on XenForo's board as well. Many of them say, "Well, it's just a problem with Google and it will go away by itself." Yeah right.

If you decide to block the pages in your robots.txt file, understand that it'll take months to notice a difference and that your ranking will likely go down in the meantime. It's a huge change to make, but, in my opinion, a necessary one. Here are a few other XenForo forum robots files for you to look at. These are successful sites:

https://advrider.com/robots.txt

https://forums.tomshardware.com/robots.txt

https://forum.wordreference.com/robots.txt

There are varying degrees of blocking. For those sites that don't have things blocked, I'm at a loss as to how they are ranking so high. Maybe they've got tons of links of something. I just don't know.

Let me know your thoughts.

Jay
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Hello Jay...thanks very much for the help & info.

If interested...here are a couple Xenforo threads I posted previously on this topic. The first one is rather long & detailed...but could be interesting reading. Lots of responses from other XF members...could be something there helpful. Looks like you posted in the 2nd one.:)

https://xenforo.com/community/threads/traffic-down-since-vb-to-xf-migration.190434/

https://xenforo.com/community/threa...ole-valid-pages-dropping.188031/#post-1479353

...Google hates 403 errors. I used to have a lot of them too until I began blocking them in my robots.txt file. There is absolutely no reason Google should crawl them (/members/, /attachments/).

Yes I have loads & loads of these...and almost all of them seem to be related to member accounts (exactly like the screenshot I included earlier).

I would like to take care of these ASAP. Can you please tell me exactly what I need to add to my robots.txt file to keep Google from crawling this stuff (please)?:)

As you can see from my robots.txt file, I also blocked a bunch of other pages. Basically, if I don't want a page to appear in the search results, I don't want it crawled either. That's a waste of crawl budget.

Would you suggest that I make my robots.txt file look exactly like your robots.txt...and that will block everything I don't want Google crawling?

Or do you have a bunch of stuff there that may not apply?

I also noticed that since I blocked all of these pages, Google is now crawling pages 2, 3, 4, and so on of both forums and threads. It never used to do that.

I'm not an expert...but learning more all the time. Can you explain what you mean here please?

What I mean is...when you say "Google is now crawling pages 2, 3, 4..."...how do you find this out information out? What tool can I use to find this info? Can I find it in Google Search Console?

I'd love to check to see what Google is crawling for my site currently.

If you decide to block the pages in your robots.txt file, understand that it'll take months to notice a difference and that your ranking will likely go down in the meantime. It's a huge change to make, but, in my opinion, a necessary one.

I hear ya. I'm willing to "take the pain" for a while...if it means better things down the road!:)

Here are a few other XenForo forum robots files for you to look at. These are successful sites:

https://advrider.com/robots.txt

https://forums.tomshardware.com/robots.txt

https://forum.wordreference.com/robots.txt

Thanks for the links. I always like to see examples from successful sites. If they're successful...they must be doing something right!:)

There are varying degrees of blocking. For those sites that don't have things blocked, I'm at a loss as to how they are ranking so high. Maybe they've got tons of links of something. I just don't know.

I hear you!:)

Maybe the improvements of blocking more/less in the robots.txt varies by the site. Maybe some of these sites are run by persons really skilled in SEO...really skilled in coding...or make tons of revenue each month...and they don't worry too much about it.

I think if I was making lots of revenue each month...I'd probably be afraid to change anything. "If it ain't broke...don't fix it!";)

Really appreciate all the help,:)

Nick
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Hi Nick,

Thank you for the detailed response. I actually enjoy discussing the technical aspects of XenForo as well as SEO and it's nice when someone takes things as seriously as I do.

Okay, it would be helpful to know the URL of your website. I can't recommend any sort of robots.txt setup until I know what your directory structure looks like. So, if you wouldn't mind, send that over.

Also, I read the first page of the thread you sent over (the previous one you wrote). I see that you had a huge traffic drop when switching over to XenForo. From first glance, I noticed that you had a ton of 403 errors, which means you may have had your member pages being crawled on your old site and not allowed to be crawled (or actually crawled, but dead pages) on the new one (hence the 403 errors). That right there will throw Google for a loop. By doing that, you essentially told Google that half of your site went missing. When it comes to forums, member pages should rarely be crawled because they're such low quality. But whatever was happening, having all those 403 responses was no good. Hopefully, after blocking them, all of those pages will drop from Google's index and you won't have to worry about them anymore.

When I say Google is crawling page 2,3,4, etc...I mean Google is now crawling paginated pages. Take a look at this page:

https://xenforo.com/community/forums/styling-and-customization-questions.47/

See up on top where it goes from page 1 to 681? That's what I'm talking about. When low quality pages are being crawled by Google, those extra pages in the series rarely, if ever, get crawled (and then, by extension, the thread pages that are linked to from the forum pages don't get crawled either). The same is true for the paginated series for threads. See how there are pages 1 and 2 for this thread?

https://xenforo.com/community/threa...tars-in-topic-view-without-blurriness.156674/

The way I see these pages being crawled is by scanning my sites' log files. I can see Googlebot hitting those pages. I do this by hand. I download the log files and search "google" in the text file and then click through all the results. You can contact your host to see how to download the log files or you can use an analyzer. I'm no expert in this department either. I just know the one way to do it on my server.

So if you want, send your site over and I can take a look at your structure. I won't tell you that you should do exactly what I did, but I can point you in the right direction. I would hate to suggest something that would create a negative effect that you couldn't live with. Really though, I'd love to fix all these crawling issues once and for all so we all can move on and build our sites out like we should be doing. I can also offer you advice and give you my rationale behind offering it to you. If you agree and if what I say makes sense, you can go ahead with the suggestion. If you think I'm making a mistake, we can discuss why and we both become smarter in the process.

Jay
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
I actually enjoy discussing the technical aspects of XenForo as well as SEO and it's nice when someone takes things as seriously as I do.

I'm usually a pretty detail oriented person...and like to get to the bottom of a problem (or even improve things further when things are going well).:)

Okay, it would be helpful to know the URL of your website. I can't recommend any sort of robots.txt setup until I know what your directory structure looks like. So, if you wouldn't mind, send that over.

Here's what my robots.txt used to look like previously:

User-agent: *
Crawl-delay: 5
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
Disallow: /forums/members/
Disallow: /members/
Disallow: /forums/member.php
Disallow: /member.php
Disallow: /forums/calendar.php
Disallow: /calendar.php

Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /login/
Disallow: /members/
Disallow: /admin.php
Allow: /

Since then (at least past 6 months)...I've seriously thinned down my robots.txt...based on feedback I got from some XF folks...and what some other XF sites robots.txt looks like.

Here's what my robots/txt looks like now (and probably has looked like for at least the last 6 months):

User-agent: *

Allow: /wp-content/uploads/
Disallow: /wp-content/plugins/
Disallow: /wp-admin/
Disallow: /readme.html

Disallow: /login/
Disallow: /admin.php
Allow: /

My forum is located in the root directory of my server. I don't know if any robots.txt for my site has been setup properly or not since I migrated from vB4 to XF about 18 months ago.

As a start to improving things...can you tell me what I need to add to my robots.txt to make sure the Google Crawler doesn't read member account info (so I can get rid of all these 403 errors) please?

Anything else you can suggest to add to my robots.txt I will also try. Clearly what I'm doing now is not working. Lol.

Also, I read the first page of the thread you sent over (the previous one you wrote). I see that you had a huge traffic drop when switching over to XenForo. From first glance, I noticed that you had a ton of 403 errors, which means you may have had your member pages being crawled on your old site and not allowed to be crawled (or actually crawled, but dead pages) on the new one (hence the 403 errors). That right there will throw Google for a loop.

Exactly. That's why I would like to setup my robots.txt so that this sort of info is NOT crawled by Google.:)

The way I see these pages being crawled is by scanning my sites' log files. I can see Googlebot hitting those pages. I do this by hand. I download the log files and search "google" in the text file and then click through all the results. You can contact your host to see how to download the log files or you can use an analyzer. I'm no expert in this department either. I just know the one way to do it on my server.

I've done something similar to this before. I'm definitely no expert...I did this via what someone else told me exactly what to do. Then via some help from my server host...they ran the server log query...then put this info into a file that I could scan & review.

I was asking how you were doing this...just in case there was an easier way for a non-expert to uncover this data.

I would hate to suggest something that would create a negative effect that you couldn't live with.

No worries. I think anything suggested can only help (based on what I mentioned above what my robots.txt looks like now)...and how my Google Search Console statistics are dropping. If I don't understand something...I will certainly ask before changing anything.

Thanks,:)

Nick
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Hi - I just made an update to my earlier post. You may want to check it out.

https://xenforo.com/community/threads/decrease-in-google-indexed-pages.140231/post-1554160

Also, I'm not sure why you're blocking Wordpress directories in the root directory if you've got XenForo installed in the root directory.

Allow: /wp-content/uploads/
Disallow: /wp-content/plugins/
Disallow: /wp-admin/
Disallow: /readme.html

If you've got XenForo installed in the root directory and if you want your robots.txt file to look just like mine, this is what yours should look like:

User-agent: *
Disallow: /account/
Disallow: /admin.php
Disallow: /attachments/
Disallow: /conversations/
Disallow: /find-threads/
Disallow: /forums/*/create-thread
Disallow: /forums/*/post-thread
Disallow: /goto/
Disallow: /job.php
Disallow: /login/
Disallow: /logout/
Disallow: /lost-password/
Disallow: /members/
Disallow: /misc/
Disallow: /online/
Disallow: /posts/
Disallow: /profile-posts/
Disallow: /register/
Disallow: /search/
Disallow: /threads/*/add-reply
Disallow: /threads/*/approve
Disallow: /threads/*/draft
Disallow: /threads/*/latest
Disallow: /threads/*/post
Disallow: /threads/*/reply
Disallow: /threads/*/unread
Disallow: /whats-new/

Currently, you've essentially got nothing blocked. That's causing major problems, in my opinion. Just having the /whats-new/ pages crawled is chewing up tons of your crawl budget. That's a "spider trap". Those pages replicate to infinity. By using the robots.txt I gave you above, Google should be blocked from crawling any useless 403 pages and all other useless pages you don't want in the index. And hopefully, within a month/two months, you should see an uptick of valid pages in Google Search Console.

Jay
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Hi - I just made an update to my earlier post. You may want to check it out.

https://xenforo.com/community/threads/decrease-in-google-indexed-pages.140231/post-1554160

Just checked it out. Looks like your valid pages are increasing...and blocked pages are also increasing.
On the surface this is not a trend most folks would think was possible (both valid & blocked pages increasing at the same time). Most folks would probably say...if one goes up...the other must go down. Lol

  • Of course blocked pages are increasing due to how the robots.txt is setup now (versus previously).
  • And it would seem due to the complexities of how Google crawls websites...and how Google determines valid pages...your valid pages is increasing as well.

It would seem that your theory if Google sees less "junk"/"thin content" pages (blocked by robots.txt)...that the Google Crawler will reward a site with more valid pages.:)
https://xenforo.com/community/threads/decrease-in-google-indexed-pages.140231/post-1554160

Also, I'm not sure why you're blocking Wordpress directories in the root directory if you've got XenForo installed in the root directory.

I'm not 100% sure either. My sites Wordpress install is in a directory called "blog". I didn't set things up...someone else did the vBulletin to Xenforo migration for me. It may be possible they put some redirects in the .htaccess file...which maybe makes everything work ok. But maybe the robots.txt file is not setup properly for my sites WordPress install.

If my Wordpress install is in the directory "blog"...should the robots.txt look like this?:

Allow: /blog/wp-content/uploads/
Disallow: /blog/wp-content/plugins/
Disallow: /blog/wp-admin/
Disallow: /readme.html

If you've got XenForo installed in the root directory and if you want your robots.txt file to look just like mine, this is what yours should look like:

User-agent: *
Disallow: /account/
Disallow: /admin.php
Disallow: /attachments/
Disallow: /conversations/
Disallow: /find-threads/
Disallow: /forums/*/create-thread
Disallow: /forums/*/post-thread
Disallow: /goto/
Disallow: /job.php
Disallow: /login/
Disallow: /logout/
Disallow: /lost-password/
Disallow: /members/
Disallow: /misc/
Disallow: /online/
Disallow: /posts/
Disallow: /profile-posts/
Disallow: /register/
Disallow: /search/
Disallow: /threads/*/add-reply
Disallow: /threads/*/approve
Disallow: /threads/*/draft
Disallow: /threads/*/latest
Disallow: /threads/*/post
Disallow: /threads/*/reply
Disallow: /threads/*/unread
Disallow: /whats-new/

I'm definitely going to give this a try.:)

Currently, you've essentially got nothing blocked. That's causing major problems, in my opinion. Just having the /whats-new/ pages crawled is chewing up tons of your crawl budget. That's a "spider trap".

Yes I know my robots.txt at the moment is essentially not blocking much. I did this based on the advice of someone who was supposed to be an Xenforo expert...so I thought I would give this persons "less is more" robots.txt approach a try.

After about 6 months I don't think this is working...which is why I wanted to revise my robots.txt to either what my robots.txt was before...or giving your robots.txt setup a try.

Here's what my robots.txt used to look like (maybe 6 months ago):

User-agent: *
Crawl-delay: 5
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
Disallow: /forums/members/
Disallow: /members/
Disallow: /forums/member.php
Disallow: /member.php
Disallow: /forums/calendar.php
Disallow: /calendar.php

Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /login/
Disallow: /members/
Disallow: /admin.php
Allow: /

You'll notice some of the lines of my old robots.txt include a directory called "forums". My server doesn't actually have a directory called "forums". The person that did my Xenforo migration & install said that this "forums" directory is invisible in some way (something about how XF is written).

You'll also notice in this post by "djbaxter" (who I think most folks recognize as a pretty knowledgeable Xenforo person):

https://xenforo.com/community/threa...vb-to-xf-migration.190434/page-5#post-1509974

Here it is for easy reference:

User-agent: *
Disallow: /forums/whats-new/
Disallow: /forums/posts/
Disallow: /forums/tags/
Disallow: /forums/members/
Disallow: /forums/member.php
Disallow: /forums/calendar.php
Disallow: /forums/account/
Disallow: /forums/attachments/
Disallow: /forums/goto/
Disallow: /login/
Disallow: /forums/members/
Disallow: /forums/admin.php
Allow: /

Sitemap: http://{yourdomain.com}/forums/sitemap.xml


I didn't know if you knew anything about this. If you do...I'd be interested in hearing what your understanding is about this "invisible" forums directory for XF.

Thanks again for the help Jay!:)

Nick
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Hi Nick,

I wanted to let you know something about a rising "blocked by robots.txt" number in the graphs. From what I've learned over the years, once Google deems a page blocked, and if it's not linked to, that page will generally drop out of the index (if it was crawled and indexed) in about 90 days. So if none of my blocked pages had any links to them, that graph would go down rather quickly and suddenly. Just as suddenly as it went up. I do link to many of those pages though, so while some of them may disappear from the chart after 3 months, some won't. Also, Google decides to block pages at its own pace, so it can take some time.

I blocked the /whats-new/ directory in October of 2019 and it took until January of 2021 for most of the crawled pages to disappear from that chart. There were tens of thousands of them. That's why I say that directory is very bad. I also lost rankings due to blocking the pages, but it needed to get done. They all had a bit of pagerank and I basically cut them off, leaving them with no pagerank at all. Google didn't like that too much. But once Google cleared them out of its index and crawling routine, things got better. Also, (and this is debatable) pages with noindex on them get indexed. They just don't appear in the search results. I absolutely hate those pages and never let them get crawled. I have in the past on other sites and they've never benefited me in any way.

As far as the /forums/ directory - I've never heard of that. If you've got XenForo installed in the root directory, you shouldn't need to block that at all. I have a few installs in the root and I just checked the source code. The only /forums/ directory is the actually /forums/ directory that contains the nodes. There is no /forums/members/ or anything like that.

I hope this helps.

Jay
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
I'm always looking for more information...scouring the internet for information. Just isn't much out there that gives much guidance as to how to setup a robots.txt for forums (for maximizing SEO). Which is why it's great to hear from other XF users what they're doing.:)

  • If you wanted to see something interesting...check out google.com's robots.txt (very complex)!;)
  • Then take a look at Craigslist (somewhat very simple).
  • Then again...look at Facebook's robots.txt. Very interesting how Facebook has individual sections for each bot/crawler/spider it's trying to manage.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
When it comes to setting up the robots.txt file, I ask a few questions:

1. Do I want this page to appear in the search results? If no, block.
2. If I block this page, will I be blocking the only paths (links) to other pages I do want to appear in search results? If yes, don't block.
3. Is this page thin, system generated, or useless to search engines? If yes, block.
4. Is having this page crawled a waste of crawl budget? If yes, block.

You get the idea. If a page isn't going to appear in the search results and if it's not linking out to other essential pages, there's no reason in the world to have it crawled. People talk a lot about pagerank flow and all that. I would suggest that they concern themselves more with getting their rankings out of the basement first and then sculpting and optimizing their pagerank later on.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
2. If I block this page, will I be blocking the only paths (links) to other pages I do want to appear in search results? If yes, don't block.

Can you explain this one further please?

What would be an example of a page I wouldn't want to block? I know you mentioned pages that may have the only paths (links) to other pages.

For example:

If "Page A" contains the only link to "Page B"...then I think what you're saying is we wouldn't want to block "Page A" in the robots.txt. Is this correct?

Do we run into this sort of thing in a typical stock Xenforo installation?

Thanks.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Hi,

I would say that the pages I'm describing could be forum pages. The homepage links to forum pages and then the forum pages link to thread pages. We wouldn't want to block the forum pages. Nothing in my robots.txt file blocks any pages that link to others like what I'm referring to above.

Jay
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Remember earlier in the thread I mentioned how some other "experts" told me that the "forums" directory for Xenforo is invisible or hidden (or something like that).

Looking at your robots.txt...there are quite a few directories listed that don't show up in the root directory (public_html directory) on my server either.

For example none the directories listed below are visible on my servers root directory (public_html directory):

Disallow: /account/
Disallow: /attachments/
Disallow: /conversations/
Disallow: /find-threads/
Disallow: /forums/
Disallow: /goto/
Disallow: /login/
Disallow: /logout/
Disallow: /lost-password/
Disallow: /members/
Disallow: /misc/
Disallow: /online/
Disallow: /posts/
Disallow: /profile-posts/
Disallow: /register/
Disallow: /search/
Disallow: /threads/
Disallow: /whats-new/

I'm assuming I have a typical/stock XF install. Are these directories visible in the root directory (public_html directory)...on your server...or are these directories "invisible"?

Thanks
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Hi - These directories are created on the fly by XenForo. You won't see them as actual folders on the server.

Jay
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Ohh really...very interesting. Is this sort of thing common among websites...common for modern forum software...or something only Xenforo uses?

Thanks,

Nick
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Regarding your robot.txt file and what's listed in it. I noticed a small difference between what was posted on Xenforo...and what was mentioned in this thread.

In this thread the line:

Disallow: /forum/search-forums/location

...was not included. Did you remove this from your robot.txt...or should I also add it to mine?

Also...on the Xenforo thread for your robots.txt...you mentioned the lines below. Each path starts with "forum". Above in this conversation you mentioned "forums" in the beginning of some of the paths. Which is correct..."forum" or "forums"?

Disallow: /forum/account/
Disallow: /forum/admin.php
Disallow: /forum/attachments/
Disallow: /forum/conversations/
Disallow: /forum/find-threads/
Disallow: /forum/forums/*/create-thread
Disallow: /forum/forums/*/post-thread
Disallow: /forum/goto/
Disallow: /forum/job.php
Disallow: /forum/login/
Disallow: /forum/logout/
Disallow: /forum/lost-password/
Disallow: /forum/members/
Disallow: /forum/misc/
Disallow: /forum/online/
Disallow: /forum/posts/
Disallow: /forum/profile-posts/
Disallow: /forum/register/
Disallow: /forum/search/
Disallow: /forum/search-forums/location
Disallow: /forum/threads/*/add-reply
Disallow: /forum/threads/*/approve
Disallow: /forum/threads/*/draft
Disallow: /forum/threads/*/latest
Disallow: /forum/threads/*/post
Disallow: /forum/threads/*/reply
Disallow: /forum/threads/*/unread
Disallow: /forum/whats-new/

Thanks,

Nick
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
One more question.

Looking at "djbaxter's" robots.txt I mentioned above...here are some lines not included in your robots.txt. Would be be a good idea to add these to both of our robots.txt?:

Disallow: /forums/tags/
Disallow: /forums/member.php
Disallow: /forums/calendar.php


I also noticed that many websites have the following last line on their robots.txt (was in my previous robots.txt)...but didn't see it on your robots.txt. Should we add this line?

Allow: /

Thanks,

Nick
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Apologies for so many questions. I just want to get my robots.txt as "right as possible" this time around.

Since it may take months for the results to finally be seen in the Google Search Console...I wanted to get started on the best foot possible now.:)

Thanks,

Nick
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Ohh really...very interesting. Is this sort of thing common among websites...common for modern forum software...or something only Xenforo uses?

Thanks,

Nick
Most content management systems don't use actual directories for browsing. Even with Wordpress, you won't find hard (actual folders) directories for categories and posts, even though it looks like there are directories in the address bar at the top of your browser. Take a look at the top of the page these posts are written on. There's no "threads" directory to be found on the server itself. It's only after the URLs have been "rewritten" that the threads directory appears.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Regarding your robot.txt file and what's listed in it. I noticed a small difference between what was posted on Xenforo...and what was mentioned in this thread.

In this thread the line:

Disallow: /forum/search-forums/location

...was not included. Did you remove this from your robot.txt...or should I also add it to mine?

Also...on the Xenforo thread for your robots.txt...you mentioned the lines below. Each path starts with "forum". Above in this conversation you mentioned "forums" in the beginning of some of the paths. Which is correct..."forum" or "forums"?

Disallow: /forum/account/
Disallow: /forum/admin.php
Disallow: /forum/attachments/
Disallow: /forum/conversations/
Disallow: /forum/find-threads/
Disallow: /forum/forums/*/create-thread
Disallow: /forum/forums/*/post-thread
Disallow: /forum/goto/
Disallow: /forum/job.php
Disallow: /forum/login/
Disallow: /forum/logout/
Disallow: /forum/lost-password/
Disallow: /forum/members/
Disallow: /forum/misc/
Disallow: /forum/online/
Disallow: /forum/posts/
Disallow: /forum/profile-posts/
Disallow: /forum/register/
Disallow: /forum/search/
Disallow: /forum/search-forums/location
Disallow: /forum/threads/*/add-reply
Disallow: /forum/threads/*/approve
Disallow: /forum/threads/*/draft
Disallow: /forum/threads/*/latest
Disallow: /forum/threads/*/post
Disallow: /forum/threads/*/reply
Disallow: /forum/threads/*/unread
Disallow: /forum/whats-new/

Thanks,

Nick
Regarding the Disallow: /forum/search-forums/location line in the first robots.txt file I shared, you can ignore that. I set up some search forums for one of my sites and they're unique to that site, so you don't need to worry about that.

Regarding the /forum/ at the beginning of the first robots.txt file I shared, if you've got your XenForo install in the root directory of your server, you can leave that part out. I installed mine in the /forum/ directory that I physically created on my server, so that's why that's there.

For instance, if you have your discussion board installed in the root, you should see something like:

https://mysite.com/threads/mythread.123/ (of something like that)

and

https://mysite.com/forums/myforum.123/ (or something like that)

What's important here is that after the .com, your pages begin without and additional directory.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
One more question.

Looking at "djbaxter's" robots.txt I mentioned above...here are some lines not included in your robots.txt. Would be be a good idea to add these to both of our robots.txt?:

Disallow: /forums/tags/
Disallow: /forums/member.php
Disallow: /forums/calendar.php


I also noticed that many websites have the following last line on their robots.txt (was in my previous robots.txt)...but didn't see it on your robots.txt. Should we add this line?

Allow: /

Thanks,

Nick
djbaxter likes to block his tags. Many people do that, but many others like to have them crawled. I actually see most people allowing them to be crawled and it seems to be working out for them. Even the guys at XenForo have their tag pages being indexed by Google and I see them appear in search results quite often. If we let them be crawled? Something terrible will probably happen. I've added a few tags on one of my sites to see what will happen - whether or not they'll get indexed. We'll see. But if you don't use tags on your site, you shouldn't even need to worry about this addition to the robots.txt file.

If you have your URLs being rewritten in the Setup > Options > Search Engine Optimization (SEO) area (under "Use Full Friendly URLs) of your admin panel, I don't believe you need to block the member.php and calendar.php files. I may be wrong here, but I haven't seen those two in any page source I've viewed.

Also, I have no idea why people use the generic and broad Allow: / directive. That says "allow everything," which is set by default. If you'd like to allow certain URLs in a particular directory that you've already blocked, then using the Allow directive is appropriate. But that's more advanced and I'm pretty sure you don't need to worry about that.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Most content management systems don't use actual directories for browsing. Even with Wordpress, you won't find hard (actual folders) directories for categories and posts, even though it looks like there are directories in the address bar at the top of your browser. Take a look at the top of the page these posts are written on. There's no "threads" directory to be found on the server itself. It's only after the URLs have been "rewritten" that the threads directory appears.
Very interesting...thanks for explaining. Didn't realize this was possible.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
I installed mine in the /forum/ directory that I physically created on my server, so that's why that's there.
Thanks. I did figure this one out...after registering here. Your server actually has a "hard/actual"..."forum" folder. Compared to mine which does not.:)
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
djbaxter likes to block his tags. Many people do that, but many others like to have them crawled. I actually see most people allowing them to be crawled and it seems to be working out for them. Even the guys at XenForo have their tag pages being indexed by Google and I see them appear in search results quite often. If we let them be crawled? Something terrible will probably happen.
Kinda sounds like some sites block tags...some don't.

I've added a few tags on one of my sites to see what will happen - whether or not they'll get indexed. We'll see.
If you see any impact...would love to hear the results (good or or not good).

If you have your URLs being rewritten in the Setup > Options > Search Engine Optimization (SEO) area (under "Use Full Friendly URLs) of your admin panel, I don't believe you need to block the member.php and calendar.php files. I may be wrong here, but I haven't seen those two in any page source I've viewed.
I'll have to check how I have things setup.

Also, I have no idea why people use the generic and broad Allow: / directive. That says "allow everything," which is set by default.
Hmmm. I wonder if this is something more "old-school" thinking...and no longer necessary if it's already default.

I guess it really doesn't hurt to have it in the robots.txt...since all the "disallows" are in there already. But if it is in there...sounds like it's kind of redundant.

I'll have to get things rolling on my end with a with an updated robots.txt...then "wait the long wait"...until the site is re-crawled a bunch of times...and see how things change in Google Search Console.

Thanks for all the help.:)
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
I guess it really doesn't hurt to have it in the robots.txt...since all the "disallows" are in there already. But if it is in there...sounds like it's kind of redundant.
I would actually not put the Allow: / directive in the robots.txt. The reason is because search engines read directives in a certain order. It's entirely possible that you essentially say, "Hey, please don't crawl these pages." in a bunch of statements and then say after that, "But ignore what I just said above and go ahead and crawl everything." That is, if you add the Allow: / (allow everything) in the incorrect spot. Again, unless you need to allow a specific page or subdirectory that's contained within a directory that's been previously blocked, it isn't necessary.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
I also wanted to mention one other thing I was thinking of this morning that might reinforce my claims about Google punishing websites that are allowing many garbage pages to be crawled. Back in November of 2019, after reviewing the huge number of links that XenForo includes on virtually every single one of their pages, I decided it was time to strip some of them out. After all, on WordPress blog pages, there are only a handful of links, making crawling and indexation very efficient. On XenForo pages, there are hundreds of links, everything from 301 redirects to the most recent posts, member profiles, attachments, more 301 redirects, reactions, etc... I think I counted a single page's links one day and for a page that included 100 crawlable forum links, there were a total of 650 additional links that didn't need to be crawled. If Google divides its pagerank between links per page, I'd much rather have that pagerank divided between 100 links and not 750. Anyway, that's a story for another time. If you ever become interested in pagerank flow optimization and website link structure optimization, please ask. I can go on and on about this. We'll need to start a new thread though.

What I did was allow everything to be crawled by blocking nothing in my robots.txt file. Google had free reign of the site. I also removed all the non-essential links from every page of the website. In the beginning, Google continued to crawl all the good pages (homepage, forum pages, thread pages) as well as the bad (previously crawled member, attachment pages, etc...most of those bad pages were returning 403 errors). My Valid Pages and Crawl Rate stats were terrible. From the graphs in the Search Console, I could see that many of the good pages were being crawled, but that they weren't making it into the actual index as valid pages that would appear in the search results. At the time, I didn't think much of it. It wasn't until a few months later (February of 2020) that I noticed Google growing tired of those undesirable pages I no longer linked to. It reduced its crawl rate of those bad pages and began increasing its crawl rate of the good pages. Simultaneously, as more of the good pages began getting crawled, the Valid Pages graph in the Google Search Console began rising. It continued to rise weekly for about a month until I became impatient and added those bad links back to the page templates. One week later, my Valid Pages graph dropped like a rock.

I suspect the same thing is occurring now. The only difference is that I'm leaving many of the bad links, but am physically blocking their crawling via the robots.txt file. I'm not sure which is a better method - removing the links and allowing everything to be crawled or keeping the links and blocking them. It's a balance between the features XenForo offers to website visitors and SEO. Either way, it does appear that Google does not like crawling bad pages and refuses to include good pages in its index when it does.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Lots of details there...very interesting...thanks for sharing.

You would think if each Xenforo page has all theses links on each page that can be crawled (some good & some not so good)...and if the developers of Xenforo are concerned about SEO...maybe they could do more about this (optimizing things).

But like you said...maybe it's a balance between Xenforo features & SEO.

With what you said about all of the links on each page for Xenforo (good & bad links getting crawled)...is the robots.txt setup we've been talking about (above in this conversation) taking any of this into consideration? Or is blocking these bad links from being crawled something that would need to be added to the robots.txt in the future (if it makes sense to block them)?

Thanks
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Lots of details there...very interesting...thanks for sharing.

You would think if each Xenforo page has all theses links on each page that can be crawled (some good & some not so good)...and if the developers of Xenforo are concerned about SEO...maybe they could do more about this (optimizing things).

But like you said...maybe it's a balance between Xenforo features & SEO.

With what you said about all of the links on each page for Xenforo (good & bad links getting crawled)...is the robots.txt setup we've been talking about (above in this conversation) taking any of this into consideration? Or is blocking these bad links from being crawled something that would need to be added to the robots.txt in the future (if it makes sense to block them)?

Thanks
I think the best way to go about things is to keep the default templates the way they are and just block certain pages in the robots.txt file they way we've been discussing. That tactic is apparently tried and true by some of the largest and most successful websites out there that are using XenForo. Really, some of us have no choice as far as blocking because we've been seeing our valid pages dropping.

Over time, once the valid pages begin increasing (hopefully), it would most likely be beneficial to begin analyzing which links aren't being clicked on at all, or very much, by website guests and then begin pruning them. I have done some of this and know each and every link that can be removed. I have the template code and everything. The way I did it is to code the template to say, "If user isn't logged in, remove link. If user is a registered member and is logged in, show the link." It's pretty easy and most of these links I'm referring to are never clicked on by anyone.

For you, I'd say just keeping the templates as they are right now is best. Just update your robots.txt file. Later on, you can get more sophisticated about your pagerank sculpting. But to answer your question more directly, the robots.txt file I suggested does block all of these bad links, whether they're removed from the templates or not.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
For you, I'd say just keeping the templates as they are right now is best. Just update your robots.txt file. Later on, you can get more sophisticated about your pagerank sculpting. But to answer your question more directly, the robots.txt file I suggested does block all of these bad links, whether they're removed from the templates or not.
Yeah modifying templates is not something I want to do at this point. That's really getting technical...and I certainly wouldn't want to mess up any of the template code...start getting server errors...and not know what's up.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
I would actually not put the Allow: / directive in the robots.txt. The reason is because search engines read directives in a certain order. It's entirely possible that you essentially say, "Hey, please don't crawl these pages." in a bunch of statements and then say after that, "But ignore what I just said above and go ahead and crawl everything." That is, if you add the Allow: / (allow everything) in the incorrect spot. Again, unless you need to allow a specific page or subdirectory that's contained within a directory that's been previously blocked, it isn't necessary.
I was looking into this a bit further.

Here's a "Google Search Central" link (and screenshot)...which seems to indicate how the "Allow: /" command in the robots.txt can be used to define/control just the single robots.txt line before it.

Thus I'm assuming in theory...multiple "Allow: /" lines could be used in the same robots.txt

https://developers.google.com/search/docs/advanced/robots/create-robots-txt

Screen Shot 2022-01-03 at 4.10.47 PM.png
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Yes, I've seen multiple Allows in a single robots.txt file. I usually use the directive to allow "User-agent: Mediapartners-Google" to crawl the entire site to scan the page text to show the most relevant ads. I then use a different set of directives for the regular Googlebot crawler.

The thing is, the default state is to allow all crawlers to access everything on a website. You don't even need a robots.txt file for that. In the typical XenForo setup scenario, the goal is to restrict crawling, so you would add directives to do that. You can't "more allow" a crawler to access a website. But yes, I totally get that using the Allow: / directive is good for slicing and dicing certain parts of a website for certain crawlers. I just think the average user would likely screw things up and do more damage than good. I happen to know of one XenForo forum that has about 20 restrictive commands in their robots.txt file and none of them work because the syntax was written incorrectly. I don't have the heart to tell them about it because it's just not my place.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Yes, I've seen multiple Allows in a single robots.txt file.
Cool...thanks for verifying. I've never seen a robots.txt with multiple "Allows" (not that I've looked super hard);)...so good to know (in theory) that it can be done (if done correctly).
The thing is, the default state is to allow all crawlers to access everything on a website. You don't even need a robots.txt file for that. In the typical XenForo setup scenario, the goal is to restrict crawling, so you would add directives to do that. You can't "more allow" a crawler to access a website.
This is what I understood as well. And like you mentioned previously...if the default state is to allow everything...kind of confusing why many robots.txt have the single "Allow: /" as the last line in them. Even Xenforo does this.

Maybe one of us should post this question at XF (Why is there an Allow: / at the end of their robots.txt?). Or maybe this question..."Do we need an Allow: / at the end of robots.txt...and if so...why?

Maybe someone out there has a good reason/explanation for it.:)

Thanks
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Maybe one of us should post this question at XF (Why is there an Allow: / at the end of their robots.txt?). Or maybe this question..."Do we need an Allow: / at the end of robots.txt...and if so...why?
That's not a bad idea. I'd love to know what they say. I have a hunch it's sort of a follow the leader type of situations, but I may be wrong.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
UPDATE: Google updated its Search Console graphs today and here is what I have to report. Every one of my sites' pages have increased. I'll use one site as an example.

Valid Pages continues to climb.

valid-pages-google-search-console.gif

Block by Robots continues to climb. Virtually all blocked pages are attachments and members. I expect this graph to begin decreasing in a few months as the URLs fall out of the index.

blocked-by-robots-google-search-console.gif

Crawled - Currently Not Indexed continues to fall.

crawled-currently-not-indexed-google-search-console.gif

Discovered - Currently Not Indexed continues to fall.

discovered-currently-not-indexed-google-search-console.gif

It seems like everything is heading in the right direction.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
That's really awesome information Jay...thanks for sharing!:)

I'm assuming the "blocked by robots" statistic initially increases as the Google crawler initially identifies them (especially when a newly modified robots.txt is implemented). Then as the Google crawler performs future crawls...see's the same pages blocked by robots.txt...the Google system drops those pages...and this is why the "blocked by robots.txt statistic gets smaller & smaller over time.

Does this sound correct?

Thanks

p.s. How to I find the "blocked by robots" statistic in Google Search Console?

Update: I may have found it...but not 100% sure.

Google Search Console >> Coverage >> click on "Valid with Warning" >> click on "blocked by robots.txt" warning in the "Details" list.

Is this the correct location?
 
Last edited:

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
I finally updated my robots.txt today...using what was suggested way up in post #6. I also removed the "Allow: /" to see if this helps as well. Hopefully I see similar results across the board!

I think another useful statistic from Google Search Console is the "Average Position" statistic. With all our SEO efforts...the net result we want is to land higher & higher in Google search results (Page #1 ideally...and #1 search position the ultimate goal)!:)

To find it...go to:

Google Search Console >> Performance >> Average Position

This statistic may or may not be useful for all websites (small vs. medium vs. large). But can be something to keep an eye on as especially as a site gets bigger & see's more traffic!:)
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
I'm assuming the "blocked by robots" statistic initially increases as the Google crawler initially identifies them (especially when a newly modified robots.txt is implemented). Then as the Google crawler performs future crawls...see's the same pages blocked by robots.txt...the Google system drops those pages...and this is why the "blocked by robots.txt statistic gets smaller & smaller over time.
Spot on. Through the years, Google identified a whole bunch of 403 pages (members/attachments) and just recorded them somewhere. Now that they're blocked, those pages are being identified as so. From past experience, Google seems to give up on blocked pages and that's why the graph begins to trail downward.

To find these graphs, log into your Google Search Console. Then, click on Coverage. After that, click on the Valid tab to the right and the Excluded tab to the right of that. It's under the Excluded tab you'll find the bottom three graphs I posted above.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
I finally updated my robots.txt today...using what was suggested way up in post #6. I also removed the "Allow: /" to see if this helps as well. Hopefully I see similar results across the board!

I think another useful statistic from Google Search Console is the "Average Position" statistic. With all our SEO efforts...the net result we want is to land higher & higher in Google search results (Page #1 ideally...and #1 search position the ultimate goal)!:)

To find it...go to:

Google Search Console >> Performance >> Average Position

This statistic may or may not be useful for all websites (small vs. medium vs. large). But can be something to keep an eye on as especially as a site gets bigger & see's more traffic!:)
Yeah, hopefully something will happen. If there is a move in Average Position, it'll most likely occur after some sort of an update from Google. I haven't seen anything good happen yet. It's maddening, but it could take months.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
To find these graphs, log into your Google Search Console. Then, click on Coverage. After that, click on the Valid tab to the right and the Excluded tab to the right of that. It's under the Excluded tab you'll find the bottom three graphs I posted above.
Awesome thanks...and thanks for confirming. Poking around I almost found it on my own. The only part I was missing was clicking the "Excluded" tab to the right.

I'm going to screen shot this ASAP...since as you know from my old robots.txt...it basically wasn't blocking anything! Lol
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
If you want to see something sort of "funny"...here's a screenshot of what my "blocked by robots.txt" from Google Search Console looks like at the moment.

As we expected (looking at my old robots.txt)...it really wasn't blocking anything. Which of course this graph proves!;)

Looking at the file date of my old robots.txt...it was put in place July, 2021 (call it about 5-6 months ago). As can be seen from the graph...it was still showing close to 300 pages blocked in early October...but then tapering off to almost nothing by the end of October (3 months)!

Thus as you've been saying...looks like it took the Google crawler about 3 months to fully "do its thing"...following what was in my old robots.txt.

Screen Shot 2022-01-04 at 6.25.32 PM.png
 
  • Like
Reactions: JGaulard

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
If you want to see something sort of "funny"...here's a screenshot of what my "blocked by robots.txt" from Google Search Console looks like at the moment.

As we expected (looking at my old robots.txt)...it really wasn't blocking anything. Which of course this graph proves!;)

Looking at the file date of my old robots.txt...it was put in place July, 2021 (call it about 5-6 months ago). As can be seen from the graph...it was still showing close to 300 pages blocked in early October...but then tapering off to almost nothing by the end of October (3 months)!

Thus as you've been saying...looks like it took the Google crawler about 3 months to fully "do its thing"...following what was in my old robots.txt.

View attachment 285
And what's weird is that the graph didn't begin to change for me until almost a month after I started blocking everything. It actually continued going down until it started going up weeks later. I thought something was wrong. Like I said though, almost everything that's being blocked on all my sites now are member pages and attachment pages. Both were showing 403 header response codes.

In the same "Excluded" area, there's another graph called Blocked Due to Access Forbidden (403). This is what it looks like for one of my sites.

blocked-due-to-access-forbidden-google-search-console.gif

I wonder what yours looks like. Since your pages were being crawled, your graph should be increasing.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
...your graph should be increasing.
That's exactly what it looks like.

My graph for "Blocked due to access forbidden (403)"...is the exact opposite trend compared to yours. The number of pages is increased...with the highest number of pages right now (latest crawls).
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
That's exactly what it looks like.

My graph for "Blocked due to access forbidden (403)"...is the exact opposite trend compared to yours. The number of pages is increased...with the highest number of pages right now (latest crawls).
I was thinking about your site and your graphs last night. I figured that your 403 pages should be increasing, along with your crawled, but not indexed, and your discovered, but not indexed. And your valid pages should be dropping. But then I remembered that you moved your site over from a different forum software, so your graphs may not be as clear as mine are. I've been running XenForo from the beginning. Your site is going through different things because of all those redirects from one to the other.

If you want, you're welcome to post any chart (graph - whatever they're called) here as a baseline. Then, through the months, you'll have something to look back on to see if there have been changes. In the Search Console, they only give you three months to look back on.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
But then I remembered that you moved your site over from a different forum software, so your graphs may not be as clear as mine are.
Your site is going through different things because of all those redirects from one to the other.
My site was migrated to XF 2+ years ago. But due to mistakes on my part...and maybe the person who did the migration...is why things are messed up at the moment.

After the site was migrated...traffic went down about 50% overnight. Not sure what the person did (or didn't do)...but something wasn't right from the start. Then efforts on my part to turn things around may have lead to additional issues.:(
If you want, you're welcome to post any chart (graph - whatever they're called) here as a baseline. Then, through the months, you'll have something to look back on to see if there have been changes. In the Search Console, they only give you three months to look back on.
Sure will. I did notice Search Console only goes back 3 months (too bad). Going back at least a year would be nice.

I'll be taking screenshots along the way...so I should have some sort of graph history to fall back on.:)

Thanks.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Just checked my "Blocked by robot.txt" statistic/graph in Google Search Console (4 days since implementing the massively reworked robots.txt). It appears the site must have been crawled the very same day...since the first "blip" on the chart was also Tuesday January 4th (same day the new robots.txt was put in place:

Screen Shot 2022-01-08 at 4.59.55 PM.png

I'm assuming/expecting this number will continue to grow.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Wow, what a jump. You obviously have a large site. Yes, that should grow tons in the next few months, until you run out of pages to block.

Here's one of my tiny sites (they're all tiny) from today:

valid-pages.gif

blocked-by-robots.gif

Just last night, and only for this site, I unblocked all the 301 redirected links that are contained within. I just have to know which one is causing the issues. Is it the 403 pages or the 301 redirects? Or both? I'll leave it like this for a month or so to see the trajectory.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Yes, that should grow tons in the next few months, until you run out of pages to block.
From what you've seen with your sites...what should the process/progression be for these robots.txt blocked pages over time?

Will Google crawl these pages again (plus more the next crawl)...or does Google say "Ok these 800+ pages are blocked by robots...I'll skip these...and start where I left off".

Then maybe next update in Google search console the 800 blocked pages may grow to 1200 (for example)?

When you say "run out of blocked pages". Does this mean when Google has finally crawled the whole site...has found each & every page that's supposed to blocked by robots.txt...and this is theoretically the end for total robots.txt blocked pages (will see the highest number of blocked pages on the graph)?

If this is true...then what happens:

1. Does this peak value for blocked pages stay there forever (unless something gets modified)?
2. Or does the blocked pages number start to decrease as Google says..."Ok I've crawled or notated these blocked pages enough now...and will no longer track them"...and the blocked pages number starts to decrease?

Am I even using the proper terminology? Lol

I just have to know which one is causing the issues. Is it the 403 pages or the 301 redirects? Or both? I'll leave it like this for a month or so to see the trajectory.
What issues are you referring to? If corrected...what will improve? Just asking to understand things better.:)

I've always heard 4xx errors are more important to fix. If 403 errors are mostly associated with (from internet search via this link):

https://www.howtogeek.com/357785/what-is-a-403-forbidden-error-and-how-can-i-fix-it/

"A 403 Forbidden Error occurs when a web server forbids you from accessing the page you’re trying to open in your browser. Most of the time, there’s not much you can do. But sometimes, the problem might be on your end."

I got a lot of these 403 errors...and was hoping a lot of them would get fixed with my updated robots.txt.

I'm reading that 301 redirects are when an old page has moved to a new/updated URL. Does this happen on our sites (and we need to write a redirect to correct things)...or are 301's mostly when we have a link on our sites...and when the link is clicked a visitor is sent to the wrong/old/bad URL? Or am I confused as far as 301's are concerned?;)
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
From what you've seen with your sites...what should the process/progression be for these robots.txt blocked pages over time?

Will Google crawl these pages again (plus more the next crawl)...or does Google say "Ok these 800+ pages are blocked by robots...I'll skip these...and start where I left off".

Then maybe next update in Google search console the 800 blocked pages may grow to 1200 (for example)?
Okay, just for some background, I've run classifieds websites since 2004. Classifieds are notorious for having oodles and oodles of junk pages as well as lots of options for users to take advantage of. They're sort of like forums. Tons of great stuff (links and pages) for web surfers, but a real mess for search engines to try to figure out. It's this experience I use to guide me today. I have done things that have caused my rankings to utterly collapse and then to scream back better than they ever were. I've dealt with this sort of thing way too much. Just so you know where I'm coming from.

I can tell you with certainty that I have never had luck allowing pages that are thin (respond to seller, print friendly, member pages, etc...) or with noindex on them to be crawled. Some people say, "Yeah, put the noindex attribute on it and it should be okay." If you're dealing with duplicate content and that's what you're suffering from, that's one thing. In our cases, that's not the problem. We're dealing with not only crawl budget issues, we're also dealing with crawl demand issues. Google seems to have plenty of budget for our websites. The problem is, when it does crawl many of XenForo's pages, it says, "Yuck, I'll keep crawling slowly, but I'm not going to include much in my actual index." That's what those two graphs in the Search Console are telling us. The Crawled but Not Indexed and Discovered but Not Indexed. Yes, Googlebot has the willingness to crawl all those pages (albeit very lazily), but it's simply not indexing them due to other reasons. Those reasons seem to include low quality pages, such as ones that return 403 and perhaps 301 responses. I'm sure of the 403 and I'm still testing the 301.

To answer your first question, there are a few different factors at play. If you're running the default template, you've likely got a link at the top that says What's New. The pages contained in that directory contain the noindex tag, so you won't see them in any search results. On some of my sites, Googlebot never had any interest in crawling any of them. On another, it crawled over 200,000 of them. 200,000 you ask? Yes. The pages contained within that /whats-new/ directory replicate based on user sessions (or simple clicks). If you go in there and click the New Posts link and then click out somewhere else and then go back in and click that New Posts link again, you'll see the URL change. There's a new number in it. Now, pretend you're a search engine clicking and clicking and clicking. Googlebot goes in once and sees one URL and then goes back and sees another, thinking it's unique. This goes on forever. It'll crawl those URLs all day long. It's a wonder it has time to crawl anything else. This directory is absolutely the most important to be blocked.

There is one link pointing to this directory that's got all the crawled pages contained within it. Google knows how many pages it crawled. In my case, it was around 200,000. That's a shame because in actuality, there were only a handful of unique pages. Anyway, once that directory is blocked, Googlebot will stop crawling and those pages it did crawl and the URLs will appear on your Blocked by robots.txt graph. Over time, (let's pretend that's the only directory you blocked) that graph will grow as Google sees each URL it has already crawled on its scheduler. When a URL appears on the scheduler and it's blocked in the robots.txt file, it'll end up on the graph and it won't need to be crawled again (or appear on the scheduler again). It'll just sit there in stasis. After about 3 months, it'll fall out of the graph. I've seen this time and time again and it does seem to be around 90 days. So, for example, if every single crawled page in that directory appears on your graph tomorrow, the graph would jump to 200,000 pages, not grow any more from that, and then, after three months, fall back to zero. The pages would be gone. History, as if they were deleted. The thing is, the only reason those URLs would fall out of the graph is because they're not linked to anymore. The only one that's linked to is the What's New link in the header of the site. So the moral of this story is, when pages aren't linked to (when Googlebot can't get to a link), the page will eventually fall out of the graph.

Other links, such as the member pages and the attachment pages (if you're using them) are still linked to, so they may stay in the graph forever. Googlebot will see those links and always want to crawl them, so they'll hang around. Although, from what I've been seeing in recent years, they do seem to be disappearing like the others do. So yes, you'll see that graph rise as far as it can go, meaning, Google will block as many pages as you're telling it to block (it can only block so many - you have a finite number of pages) and then the graph should plateau and then begin to decrease. That is, unless you've got an insanely busy forum that is creating large numbers of new pages to block everyday.

Also, yes, it'll say, "I've already seen that these pages are blocked, so I'll move onto other ones." The process seems to accelerate and then decelerate, depending on the day. Some of mine are being blocked at a rate of 10% growth some days and then 100% others. I guess it depends where they are on the scheduler. Also, as time is going on, my crawl rate is increasing, so I think the scheduler is picking up steam.

When you say "run out of blocked pages". Does this mean when Google has finally crawled the whole site...has found each & every page that's supposed to blocked by robots.txt...and this is theoretically the end for total robots.txt blocked pages (will see the highest number of blocked pages on the graph)?
Yes.

If this is true...then what happens:

1. Does this peak value for blocked pages stay there forever (unless something gets modified)?
2. Or does the blocked pages number start to decrease as Google says..."Ok I've crawled or notated these blocked pages enough now...and will no longer track them"...and the blocked pages number starts to decrease?
I talked about this above. Pages that are contained within a sealed directory like the /whats-new/ directory will eventually disappear in their entirety while other accessible URLs may hang around a lot longer. The very first Pagerank algorithm stated that pages that are blocked won't be counted in the website's overall Pagerank score though, so that's good. Who knows if that still holds today.

I've always heard 4xx errors are more important to fix. If 403 errors are mostly associated with (from internet search via this link):
In your website's permissions, you've most likely got it set so guests (folks who aren't logged in, such as search engines) can't see member pages and attachment pages. When someone clicks on a member avatar or link and tries to see their profile, they're sent to a "You Must Log In to Do This" page as opposed to the member's profile page. The status code of that login page is a 403 as opposed to a 200. All that means is that the user needs authentication to see the content of the page. Also, when a user tries to click on an image and they're not logged in, they'll go to the same login page. More 403s. In Google's eyes, 403, 401, 410, and 404 are all the same thing. They're dead pages that can't be indexed. These are the ones that reduce crawl demand. If you ran a previous forum that didn't use these 403 pages, you're crawl rate may have been 10,000 pages per day. Now that you've got 50% of your crawled pages showing 403 status codes, your crawl rate is probably about a tenth of that. Error status codes kill crawl demand. Not crawl budget - your budget (possible pages crawled) is big, but the demand is small. You're essentially linking to pages that aren't there and Googlebot doesn't like it at all.

Now, 301 response codes mean that the page has permanently moved. Google treats these just fine. It'll see a link to a page that's been moved, follow the 301 redirect, and then visit the new URL. It's perfectly normal. The old URL is like a link to the new one. The problem is, XenForo forums use tons of these redirects all throughout their websites for various reasons.

Take a look at these four links that are on my website here:

https://gaulard.com/forum/threads/91/
https://gaulard.com/forum/threads/91/latest
https://gaulard.com/forum/threads/91/post-330
https://gaulard.com/forum/goto/post?id=329

Go ahead and click on them and you'll see they go to the same page. The URL may have a hashtag in it, but Google ignores those. In the case of these 301 redirects, Google doesn't have a problem with them. The problem we're having is that because there are so many of them, Googlebot is crawling them instead of crawling our good canonical URLs. Basically, we're burning through our allotted crawl budget because of these redirects.

Now, what I'm testing currently is whether the crawl demand will pick back up because the 403 pages are being blocked. Will I get back to a healthy demand and will those 301 redirects not really have an impact? Time will tell.

I'm reading that 301 redirects are when an old page has moved to a new/updated URL. Does this happen on our sites (and we need to write a redirect to correct things)...or are 301's mostly when we have a link on our sites...and when the link is clicked a visitor is sent to the wrong/old/bad URL? Or am I confused as far as 301's are concerned?
All the 403 and 301 URLs we are seeing are caused by the software. They're meant to be there. They improve functionality. It's not server related at all. With your updated robots.txt file you have now, all crawling for all of them should stop. You should see those two graphs begin to decline within weeks. The only thing that concerns me has to do with Pagerank flow because of blocking those 301s. We can discuss that in another thread though. Or here. If you're interested in it, just ask.

I hope this helps. If you need anything clarified further, just let me know. Talk about long winded!
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
On some of my sites, Googlebot never had any interest in crawling any of them. On another, it crawled over 200,000 of them. 200,000 you ask? Yes. The pages contained within that /whats-new/ directory replicate based on user sessions (or simple clicks). If you go in there and click the New Posts link and then click out somewhere else and then go back in and click that New Posts link again, you'll see the URL change. There's a new number in it. Now, pretend you're a search engine clicking and clicking and clicking. Googlebot goes in once and sees one URL and then goes back and sees another, thinking it's unique. This goes on forever. It'll crawl those URLs all day long. It's a wonder it has time to crawl anything else. This directory is absolutely the most important to be blocked.
This is great to know. The /whats-new/ definitely in my robots now!
There is one link pointing to this directory that's got all the crawled pages contained within it. Google knows how many pages it crawled. In my case, it was around 200,000. That's a shame because in actuality, there were only a handful of unique pages. Anyway, once that directory is blocked, Googlebot will stop crawling and those pages it did crawl and the URLs will appear on your Blocked by robots.txt graph. Over time, (let's pretend that's the only directory you blocked) that graph will grow as Google sees each URL it has already crawled on its scheduler. When a URL appears on the scheduler and it's blocked in the robots.txt file, it'll end up on the graph and it won't need to be crawled again (or appear on the scheduler again). It'll just sit there in stasis. After about 3 months, it'll fall out of the graph. I've seen this time and time again and it does seem to be around 90 days. So, for example, if every single crawled page in that directory appears on your graph tomorrow, the graph would jump to 200,000 pages, not grow any more from that, and then, after three months, fall back to zero. The pages would be gone. History, as if they were deleted. The thing is, the only reason those URLs would fall out of the graph is because they're not linked to anymore. The only one that's linked to is the What's New link in the header of the site. So the moral of this story is, when pages aren't linked to (when Googlebot can't get to a link), the page will eventually fall out of the graph.
I believe you explained portions of this previously...now that all this info is in one paragraph...it's easier to get the full picture.

I think what's a shame or confusing for folks that aren't experts at this. After the 3 months when pages start dropping off on the "Blocked by robots.txt" graph...let's say 6, 9 or 12 months later someone who runs a site (but doesn't check Google search console "GSC" very often)...goes to GSC...then goes to their "Blocked by robots.txt" graph...and it shows zero pages (or maybe a very small number.

Let's say the have a good sized site...they know they have good robots.txt that blocks the important areas...and GSC "blocked by robots.txt" is small. The first thing the person might think is..."What the heck...is my robots.txt file not blocking anything...or blocking very little?".

Just a theory. Maybe all the actual historical data is still available at Google somewhere...but because Google only gives us 3 months of historical data in the GSC...we no longer see the older data when there were a lot of blocked pages (and then they were slowly de-linked). I think having access to more historical data would help alleviate some confusion (especially among site owners with less expertise).

Maybe Google limits the GSC data to 3 months...and purposely crawls sort of slowly on purpose (like you said above Google has plenty of crawl budget). Maybe Google is concerned with more than 3 months of data on the GSC (or if Google crawled sites more quickly)...some smart folks out there would be able to run more "experiments" (tweaking settings on their website/websites)...they'd gain more insight into how the Google crawler works...then of course do better in the rankings & search results.

Then of course (as I'm sure you know)...Google keeps "moving the cheese"...with the constant minor & major algorithm updates!:( Just when you think you got it figured out...Google changes something!;)
Other links, such as the member pages and the attachment pages (if you're using them) are still linked to, so they may stay in the graph forever. Googlebot will see those links and always want to crawl them, so they'll hang around. Although, from what I've been seeing in recent years, they do seem to be disappearing like the others do. So yes, you'll see that graph rise as far as it can go, meaning, Google will block as many pages as you're telling it to block (it can only block so many - you have a finite number of pages) and then the graph should plateau and then begin to decrease. That is, unless you've got an insanely busy forum that is creating large numbers of new pages to block everyday.

Also, yes, it'll say, "I've already seen that these pages are blocked, so I'll move onto other ones." The process seems to accelerate and then decelerate, depending on the day. Some of mine are being blocked at a rate of 10% growth some days and then 100% others. I guess it depends where they are on the scheduler. Also, as time is going on, my crawl rate is increasing, so I think the scheduler is picking up steam.
Sounds like lots of factors go into this. And of course on a forum that's busy...lots of new stuff being created...and older stuff dropping off as Google crawls.

I'm thinking about internet forum websites compared to WordPress websites...and then their relationship with the Google Crawler.

With forums...a busy forum can generate lots of new pages each day for Google to crawl (maybe not always the best/"thin" content). Versus a WordPress blog website...where an active blogger might add 3 pages/week...or very active blogger maybe 1 new page/day. The WordPress pages are probably "better/richer" content...but fewer.

I wonder how the Google crawler sees this (like/dislike). 3 new pages/week with a Wordpress blog site...versus potentially 100's of new pages/week with a busy forum. I know everyone says "content is king". I'm assuming if there was/were someway to insure higher quality forum discussion pages...this would be a lot better for forums.
In your website's permissions, you've most likely got it set so guests (folks who aren't logged in, such as search engines) can't see member pages and attachment pages. When someone clicks on a member avatar or link and tries to see their profile, they're sent to a "You Must Log In to Do This" page as opposed to the member's profile page. The status code of that login page is a 403 as opposed to a 200. All that means is that the user needs authentication to see the content of the page. Also, when a user tries to click on an image and they're not logged in, they'll go to the same login page. More 403s.
Ok...good deal. This probably explains why when I scan a forum website for error's...many many 403 errors are generated. And many of these are totally member account related (members avatars, member profile area). My issue is...if Google crawler can "see" these items when it crawls a forum...this has got to be "super-bad" for the site (since 400 errors are bad)!
In Google's eyes, 403, 401, 410, and 404 are all the same thing. They're dead pages that can't be indexed. These are the ones that reduce crawl demand. If you ran a previous forum that didn't use these 403 pages, you're crawl rate may have been 10,000 pages per day. Now that you've got 50% of your crawled pages showing 403 status codes, your crawl rate is probably about a tenth of that. Error status codes kill crawl demand. Not crawl budget - your budget (possible pages crawled) is big, but the demand is small. You're essentially linking to pages that aren't there and Googlebot doesn't like it at all.
I know in the GSC in the Crawl Stats >> Reports area. Google breaks out 404 errors from other 400 errors (it calls them "Other client error (4xx)". Thus since Google breaks out 404 errors separately from other "4xx" errors...seems Google considers 404 errors to be the worst of the 400 errors.

Regarding the 403 errors. If for example...many many 403 errors on a forum are being generated from member avatar or member account links...and Google see's these...that's sounds like it should be really really bad for a forum. If Google is seeing all these 403 errors...it's probably thinking (as you stated)..."This website stinks...I'm not going to crawl this site as much...or maybe as deeply!"

Is it possible to block these member account pages that generate the 403 errors...so a forum doesn't get penalized for them? Is this taken care of in the robots.txt....with the /members/ or /account/ lines?

Now, 301 response codes mean that the page has permanently moved. Google treats these just fine. It'll see a link to a page that's been moved, follow the 301 redirect, and then visit the new URL. It's perfectly normal. The old URL is like a link to the new one. The problem is, XenForo forums use tons of these redirects all throughout their websites for various reasons.

Take a look at these four links that are on my website here:

https://gaulard.com/forum/threads/91/
https://gaulard.com/forum/threads/91/latest
https://gaulard.com/forum/threads/91/post-330
https://gaulard.com/forum/goto/post?id=329

Go ahead and click on them and you'll see they go to the same page. The URL may have a hashtag in it, but Google ignores those. In the case of these 301 redirects, Google doesn't have a problem with them. The problem we're having is that because there are so many of them, Googlebot is crawling them instead of crawling our good canonical URLs. Basically, we're burning through our allotted crawl budget because of these redirects.
Ahh I see. So 301 errors aren't necessarily bad...but since XF uses a lot of them...we're using up a lot of crawl budget with them.

Since my site was migrated from vBulletin to XF...I'm thinking I may have a lot of redirecting going on. On top of this...I believe my sites URL structure was changed as well. Previously with vBulletin...the forum was in the mysite.com/forums/ directory...now with XF the forum is in the root directory (mysite.com).

I know an add on product was installed during the migration that's supposed to make this transition easier (vB to XF)...but not sure if this has anything to do with how Google sees things...and 301 errors.

I know there's a bunch of redirects written in the .htaccess file as well (not sure if this would create even more 301 errors).
Now, what I'm testing currently is whether the crawl demand will pick back up because the 403 pages are being blocked. Will I get back to a healthy demand and will those 301 redirects not really have an impact? Time will tell.
When you say 403 errors are being blocked...is this via the robots.txt file. If so...and if my robots is similar to your robots...then I should be blocking these 403's too?
All the 403 and 301 URLs we are seeing are caused by the software. They're meant to be there. They improve functionality. It's not server related at all.
Ok I see. But if 403 errors are not good (are bad)...but the software is generating them due to improving the functionality of the software (I'm assuming the XF developer folks know this). And...if XF is supposed to be built with SEO in mind...why would the XF developers purposely add functionality that they know will generate 403 errors...and hurt website SEO?

Or is it simply a trade-off between the Pro's of functionality...versus the Con's of reduced SEO?

With your updated robots.txt file you have now, all crawling for all of them should stop. You should see those two graphs begin to decline within weeks.
Good deal...I think that answers one of my questions above.
The only thing that concerns me has to do with Pagerank flow because of blocking those 301s.
Is this something we're blocking in our robots.txt. If so...what robots.txt line/lines are blocking 301's?
I hope this helps. If you need anything clarified further, just let me know. Talk about long winded!
I guess I asked/covered a lot of stuff in my thread post #47 above. Had a lot of thoughts/questions....and if I didn't get them all out there...I might have forgotten them!;)

Thanks super super much for answering in such detail!:)

Nick
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
I think what's a shame or confusing for folks that aren't experts at this. After the 3 months when pages start dropping off on the "Blocked by robots.txt" graph...let's say 6, 9 or 12 months later someone who runs a site (but doesn't check Google search console "GSC" very often)...goes to GSC...then goes to their "Blocked by robots.txt" graph...and it shows zero pages (or maybe a very small number.
I have a feeling this number would actually still be pretty large, especially if the forum is busy. New pages would become blocked just as fast as old ones would drop out, so I don't think the average person would be able to tell the difference.

Just a theory. Maybe all the actual historical data is still available at Google somewhere...but because Google only gives us 3 months of historical data in the GSC...we no longer see the older data when there were a lot of blocked pages (and then they were slowly de-linked). I think having access to more historical data would help alleviate some confusion (especially among site owners with less expertise).
They used to give more data and then some privacy law was passed that said they could only hold onto it for a certain length of time. I think it's two years in analytics. I'm not sure about GSC.

With forums...a busy forum can generate lots of new pages each day for Google to crawl (maybe not always the best/"thin" content). Versus a WordPress blog website...where an active blogger might add 3 pages/week...or very active blogger maybe 1 new page/day. The WordPress pages are probably "better/richer" content...but fewer.

I wonder how the Google crawler sees this (like/dislike). 3 new pages/week with a Wordpress blog site...versus potentially 100's of new pages/week with a busy forum. I know everyone says "content is king". I'm assuming if there was/were someway to insure higher quality forum discussion pages...this would be a lot better for forums.
On forums, many of those new pages are supposed to consolidate with one another via 301 redirects. This doesn't always happen and can be very sloppy. With blogs, like you said, things are simple. Especially on one author blogs. For every new post, there's only one new page. If I sign up for a forum, without even doing anything I already created a new page (member profile page). Then, if I make a post, I just created three URLs (two will redirect to the canonical). And if I respond to a post using the "Reply" feature (roll over your name in orange directly above), I just created another URL (that will redirect to the canonical). That's a lot of pages. Just in this thread we're writing in right now there are many URLs. It should only be two member account URLs and one thread URL.

Is it possible to block these member account pages that generate the 403 errors...so a forum doesn't get penalized for them? Is this taken care of in the robots.txt....with the /members/ or /account/ lines?
Yes, by blocking /members/ in your robots.txt file, Google will never know the 403 errors exist (on a new site). Since it's already crawled many on our sites, they now need to wither away and die.

Since my site was migrated from vBulletin to XF...I'm thinking I may have a lot of redirecting going on. On top of this...I believe my sites URL structure was changed as well. Previously with vBulletin...the forum was in the mysite.com/forums/ directory...now with XF the forum is in the root directory (mysite.com).

I know an add on product was installed during the migration that's supposed to make this transition easier (vB to XF)...but not sure if this has anything to do with how Google sees things...and 301 errors.

I know there's a bunch of redirects written in the .htaccess file as well (not sure if this would create even more 301 errors).
In many cases, 301 redirects are perfect for the situation. In your case, when moving from one CMS to another, you need those redirects. They shouldn't be blocked. It's only when the current CMS is pumping out brand new 301 like crazy that they *may* need to be blocked. Depending on my testing, I may unblock mine in the future. You certainly wouldn't want to remove any of the 301s that are redirecting your old site to the new. And just as a reminder, 301s aren't errors. They're perfectly normal and are used everywhere. They're just like links, but they physically send a web surfer from one URL to another without them knowing it.

When you say 403 errors are being blocked...is this via the robots.txt file. If so...and if my robots is similar to your robots...then I should be blocking these 403's too?
Yes.

Ok I see. But if 403 errors are not good (are bad)...but the software is generating them due to improving the functionality of the software (I'm assuming the XF developer folks know this). And...if XF is supposed to be built with SEO in mind...why would the XF developers purposely add functionality that they know will generate 403 errors...and hurt website SEO?

Or is it simply a trade-off between the Pro's of functionality...versus the Con's of reduced SEO?
They could have done it another way. They could have simply sent people to a login page with one common URL as opposed to a whole heck of a lot of different URLs that show 403 response codes. I'm no developers though, so I can't comment on why something was done as opposed to something else. I guess it's our responsibility to manage the SEO side. The software is very good. It just needs tweaking in certain places.

Is this something we're blocking in our robots.txt. If so...what robots.txt line/lines are blocking 301's?
These are the lines:
Disallow: /forum/goto/
Disallow: /forum/threads/*/latest
Disallow: /forum/threads/*/post

In yours, it would be:
Disallow: /goto/
Disallow: /threads/*/latest
Disallow: /threads/*/post

You can always unblock these for now and just keep the 403 pages blocked if you want. By blocking the redirects, we're blocking a lot of URLs, depending on how many responses a post receives. There's a lot to talk about in this regard. Pagerank flow and the nofollow attribute. We can discuss that if you'd like.

Happy to help!

Jay
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
In your case, when moving from one CMS to another, you need those redirects. They shouldn't be blocked. It's only when the current CMS is pumping out brand new 301 like crazy that they *may* need to be blocked. Depending on my testing, I may unblock mine in the future. You certainly wouldn't want to remove any of the 301s that are redirecting your old site to the new. And just as a reminder, 301s aren't errors. They're perfectly normal and are used everywhere. They're just like links, but they physically send a web surfer from one URL to another without them knowing it.
If the robots.txt lines below are blocking 301's...and if my site was migrated from vBulletin to XF...do you think I should unblock these?:

Disallow: /goto/
Disallow: /threads/*/latest
Disallow: /threads/*/post

Don't know if this makes a difference...but the site was migrated over 2 years ago.

Thanks,

Nick
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
If the robots.txt lines below are blocking 301's...and if my site was migrated from vBulletin to XF...do you think I should unblock these?:

Disallow: /goto/
Disallow: /threads/*/latest
Disallow: /threads/*/post
The URLs that are being blocked in the above quote are unique to XenForo software. No URLs that are being used to forward your old site to the new are being blocked.

Thinking about it though, you might want to unblock those above - just to make it more of a stepped approach into what we're trying to accomplish. The primary issue you're dealing with is the 403 pages. That's been dealt with. Give it a few months and if you need to tighten the screws farther, then you can block these 301s.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
I think my main goal through all of this is to prevent the Google Crawler from wasting resources crawling "website junk" (pages it doesn't need to crawl & pages 99.9% of website visitors are not interested in)...and at the same time not accidentally blocking important content.

I want to be sure the Google Crawler is crawling & indexing the "good stuff"...the stuff folks are looking for via search engines. I know the site has this...I'm just not 100% sure (in the past) the Google Crawler is ever getting to the "good stuff"...due to all of the unimportant stuff.

I'm hoping the recent major overhaul of the robots.txt will go a long way to accomplishing this. If there's anything else I should do now (since it can sometimes take months for the improvements to show up)...please suggest.:)

Thanks,

Nick
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
I think my main goal through all of this is to prevent the Google Crawler from wasting resources crawling "website junk" (pages it doesn't need to crawl & pages 99.9% of website visitors are not interested in)...and at the same time not accidentally blocking important content.

I want to be sure the Google Crawler is crawling & indexing the "good stuff"...the stuff folks are looking for via search engines. I know the site has this...I'm just not 100% sure (in the past) the Google Crawler is ever getting to the "good stuff"...due to all of the unimportant stuff.

I'm hoping the recent major overhaul of the robots.txt will go a long way to accomplishing this. If there's anything else I should do now (since it can sometimes take months for the improvements to show up)...please suggest.:)

Thanks,

Nick
Sounds like a plan. I'd say wait a month/two months to see if those valid pages begin to rise. You can reassess as time goes on. That crawler is definitely crawling a bunch of stuff it doesn't need to, so the valid pages should increase. By the way, look what's happening to my 200 OK crawl rate. It's rising because I think Google is now valuing the site more. The traffic is slowly falling though, but I suspect that's because the old crud is being removed. It'll definitely take a Google update to see any positive changes. Hopefully.

crawl-rate-200.gif
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Sounds like a plan. I'd say wait a month/two months to see if those valid pages begin to rise. You can reassess as time goes on. That crawler is definitely crawling a bunch of stuff it doesn't need to, so the valid pages should increase. By the way, look what's happening to my 200 OK crawl rate. It's rising because I think Google is now valuing the site more. The traffic is slowly falling though, but I suspect that's because the old crud is being removed. It'll definitely take a Google update to see any positive changes. Hopefully.
We're actually continuing this particular conversation regarding crawl requests in this thread:

https://gaulard.com/forum/threads/166/
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
djbaxter likes to block his tags. Many people do that, but many others like to have them crawled. I actually see most people allowing them to be crawled and it seems to be working out for them. Even the guys at XenForo have their tag pages being indexed by Google and I see them appear in search results quite often. If we let them be crawled? Something terrible will probably happen. I've added a few tags on one of my sites to see what will happen - whether or not they'll get indexed. We'll see.
I think I may have come up with a reason to block tags in robots.txt.

I've been painstakingly correcting lots & lots of link errors of all sorts on my site...and some errors that are being spotted have "tags" URLs...and are coming back as 404 errors.

As an example..here's the current tag cluster/tag map for Xenforo:

Screen Shot 2022-02-05 at 5.55.54 PM.png

I think in many forum software packages there's a setting for how many tags to retain in the tag map (top 50, top 100, etc). As frequency drops for one keyword (tag)...it may get kicked off the map & be replaced by a new word.

The 404 error URLs I'm finding are for tag "words" that no longer show up on the tag map (and I get the XF "Oops Page").

As an example...if one of these this URLs is visited...it will display a sites "tag map" (depending on how the directories are setup for a site):

https://xenforo.com/community/tags/

Here's an example of a "tag" URL that comes back as 404 (at least on my site):

https://xenforo.com/community/tags/apple

If the word "apple" no longer appears on a sites tag map...and a "tag" URL contains "apple"...but "apple" no longer appears on the tag map...then it would seem that this apple tag URL comes back with a 404 error.

Thus I'm thinking this could be a reason to block tags in a sites robots.txt.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
I think I may have come up with a reason to block tags in robots.txt.

I've been painstakingly correcting lots & lots of link errors of all sorts on my site...and some errors that are being spotted have "tags" URLs...and are coming back as 404 errors.

As an example..here's the current tag cluster/tag map for Xenforo:

View attachment 1524

I think in many forum software packages there's a setting for how many tags to retain in the tag map (top 50, top 100, etc). As frequency drops for one keyword (tag)...it may get kicked off the map & be replaced by a new word.

The 404 error URLs I'm finding are for tag "words" that no longer show up on the tag map (and I get the XF "Oops Page").

As an example...if one of these this URLs is visited...it will display a sites "tag map" (depending on how the directories are setup for a site):

https://xenforo.com/community/tags/

Here's an example of a "tag" URL that comes back as 404 (at least on my site):

https://xenforo.com/community/tags/apple

If the word "apple" no longer appears on a sites tag map...and a "tag" URL contains "apple"...but "apple" no longer appears on the tag map...then it would seem that this apple tag URL comes back with a 404 error.

Thus I'm thinking this could be a reason to block tags in a sites robots.txt.
This is very interesting. I have a love/hate relationship with tags. I see that many XenForo sites use them, but I don't have a lot of luck with them. Google likes to index and then deindex them every so often (in my case anyway). That could be because of the whole "low valid pages" thing we're dealing with, but I've even had issues with them before XenForo. Let's just say they've never done me any good. They've never ranked for anything. But again, I see XenForo getting ranked for their tags all the time when I look something up in Google.

Anyway, are your tags in a state of flux? Meaning, are they used and then removed from some posts and then added back later on? If you visit your admin panel and then Content > Tags and then click on a tag, you'll see a Permanent check box. It says Making a tag permanent prevents it from being removed when it is no longer in use. below it. I'm guessing that you don't remove the tags in question, which makes this situation very curious.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Anyway, are your tags in a state of flux? Meaning, are they used and then removed from some posts and then added back later on? If you visit your admin panel and then Content > Tags and then click on a tag, you'll see a Permanent check box. It says Making a tag permanent prevents it from being removed when it is no longer in use. below it. I'm guessing that you don't remove the tags in question, which makes this situation very curious.
Thanks for mentioning this. I checked a bunch of settings to double check what sort of "tag" settings I have set up.

1. I went to Setup > Options > Content Tagging...once there...at the top there's a check box for "Enable content tagging". This box is checked (active).

2. Further down the page there's a check box for "Enable tag cloud with up to X tags:". This check box is checked...thus active.

3. Tag Cloud is set for 100 tags at the moment.

4. I went to Content > Tags (as you suggested)...then clicked on a handful of the tags (one by one)...then examined if the "Permanent" check box was checked for any of them. In all cases none of the "Permanent" check boxes was checked (all un-checked).

What do you think optimal settings should be?

"Tag Cloud" is enabled...but no individual tag is checked as "Permanent"...thus I think this means the 100 tags that show up in the 'Tag cloud" are dynamic (tags listed in the tag cloud change as specific words get used more or less often).

Also keeping in mind when doing a bad-link/dead-link scan of the site...I'm finding some individual 404 error URLs for individual tags.

Thanks:)

p.s. I still have a lot of bad-link/dead-links to correct. I've only come across a couple of these "tag" 404 error URLs so far. I'll have to see if they are super common or not. If there's only a small amount...maybe it's not something to be concerned about. But if there are lots of them...I'm thinking maybe I either need to adjust Admin Panel settings...or block tags in robots.txt.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
What do you think optimal settings should be?
I think you probably already have the optimal settings. The only reason you would need to check the Permanent box is if your tags are being removed from threads and then added back to the site. That doesn't seem to be the case with many forums that have permanent content in stable threads. I'm not sure why tags would be added and then deleted. Even if they were removed for some reason, a person probably shouldn't have the empty page hanging around. Google doesn't like "placeholder" pages (empty pages like this). If I had to guess, removed tags would be the result of two situations: 1. threads being removed (thus having the existing tag applied to nothing, therefor essentially being deleted), and 2. threads being merged (thus having one thread's tags removed because the thread has taken on the tags, or no tags, of the thread it's been merged into). If you are doing neither of these things, having existing tag page show as empty is something you'd need to investigate further. If the tag is active and it's currently being applied to a thread, but is returning a 404 page, there's a different problem that needs to be discovered.

You may want to look around at other XenForo sites to see how valuable tags even are. It's somewhat common knowledge that tag pages can cause duplicate content and "penalties," which would lead to a lower crawl rate by Google.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
I'm not sure why tags would be added and then deleted. Even if they were removed for some reason, a person probably shouldn't have the empty page hanging around. Google doesn't like "placeholder" pages (empty pages like this). If I had to guess, removed tags would be the result of two situations: 1. threads being removed (thus having the existing tag applied to nothing, therefor essentially being deleted), and 2. threads being merged (thus having one thread's tags removed because the thread has taken on the tags, or no tags, of the thread it's been merged into). If you are doing neither of these things, having existing tag page show as empty is something you'd need to investigate further. If the tag is active and it's currently being applied to a thread, but is returning a 404 page, there's a different problem that needs to be discovered.
My understanding of how a "Tag Cloud" works...it's composed of the top "tags" or words used on a forum. If in the Admin Panel an admin chooses the "Tag Cloud" to contain 100 tags...then the tag cloud contains the top 100 words used on the forum by members (excluding words like (and, it, if, the, of, they them, I, me)...etc etc.

Also the font size of a tag/word found in the "tag cloud" indicates how often a tag/word is used (the larger the tag's font size the more it is used relative to other tags...and the smaller the tag font size...then less the tag is used)...relative to other tags in the tag cloud.

As mentioned (in this example)...the tag cloud contains the top 100 words based on frequency of use by forum members. The tag cloud is dynamic (tags can & will be added...and other tags as their frequency drops...will be removed from the tag cloud). The tags/words at the bottom of this list of 100 are going to be the most volatile.

Thus my theory is...when tags fall off the tag cloud (when they are not in the top 100 tags)...this (in my case)...may be why I'm finding "tag URLs" with a 404 error. They would be tags that at one point were in the tag cloud...but then dropped out of the tag cloud when they started to be used less frequently (and replaced by tags/words used more frequently).

As an example. In December the tag/word "Christmas" might become part of the "Tag Cloud". But in many other months where the word "christmas" is not used very often...then "Christmas" would fall out/off of the tag cloud.

I think (I'm not 100% sure)...when each tag is created (or maybe only when a tag is part of the "tag cloud")...a "tag URL" is also created. Maybe as long as a tag remains in the tap cloud...this tag URL is valid (no 404 error). But if a tag falls out/off of the tag cloud...then maybe this "tag URL" (since it's no longer in the tag cloud)...gives a 404 error.

I'm thinking...if a specific tag falls out of the top 100 tags (off of the tag cloud)...such as the word "Christmas"...then the "tag URL" for "Christmas" either should be auto-deleted somehow (so it does throw a 404 error)...or it gets "stored" somewhere where Google won't crawl it (so it doesn't throw a 404 error).

In my case (when I scan the site for bad/dead links...and the scanner is told to follow the robots.txt rules)...I think the scanner is finding tag URLs that are not part of the current "tag cloud" (top 100 words used on the forum)...and showing these tag URLs as a 404 error. Why I don't know. Not sure if this is the way it's supposed to work or not.
You may want to look around at other XenForo sites to see how valuable tags even are. It's somewhat common knowledge that tag pages can cause duplicate content and "penalties," which would lead to a lower crawl rate by Google.
I'm thinking the average XF forum website has far more tags that are not in the tag cloud than are in it (based on the number of tags a site admin decides should be in the tag cloud). In this example we're talking 100 most popular tags are in the tag cloud).

If someone goes to the Admin panel > Content > Tags...they will see the total list of tags. This number of total tags (in many cases)...should exceed the number of tags/words in the tag cloud. Maybe a site has 750 total tags in the tag list...but only 100 of them are in the tag cloud at any one time.

This could mean 100 tag URLs are represented in the tag cloud...and as many as 650 "tag URLs" are floating around as potential "tag URL" 404 errors.

If anything I mentioned above makes sense/is true...then this is where I was thinking maybe disallow tags could be added to the robots.txt...to prevent any 404 errors from happening due to these tag URLs triggering 404 errors. This assumes of course that the benefits of tags being "crawlable" is less than the penalties of some tag URLs coming back as 404 errors.

Thanks:)

p.s. Again I'm not 100% sure how this works exactly. What I mentioned is just a theory...since when my site is scanned...it's finding tag URLs throwing a 404 error. I'm not exactly sure yet how many there are...since I haven't completed a full scan of the site. If what was mentioned above is true...then there should be a lot of them. Thus back to the idea of maybe adding disallow tags to the robots.txt.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Hi Nick,

Unfortunately, that's not how it works. A tag cloud is merely a visual thing. It's not even necessary, as individual tags are listed on the thread pages. Tag pages are actual pages. Those pages don't get deleted when a tag isn't popular enough to make it into the cloud. You can test this. Go to your tag cloud and click on a tag that's contained inside of it. Test the URL response code in the header checker I sent you. Then, set your tag cloud to show only one tag. Then test the URL in the header checker again. It should still show a 200 response code.

Tags are real pages. They're like category or forum pages. As navigational elements, they're not meant to come and go. When added to a permanent thread, they're permanent as well. I mean, obviously anyone can do anything they want with their site, but (excessive) tags are something to be very careful of.

Jay
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Unfortunately, that's not how it works.
Wow...if that's true...man was I misinformed! I'll keep my lips sealed!:censored:

I searched our threads but couldn't find the header checker website you suggested previously...thus I tried a couple other header checker websites for one of the tag URLs I'm getting 404 errors for. If you highly trust the header checker website you use...if you post the URL again...I'll test with it too. Thanks:)

The tag URL I tested had the format = https://www.example.com/tags/gdpr

Here are the results:

tag test 1 copy.png
tag test 2.png

Edit: Here's a 3rd Header-Checker website that seems to be giving two response codes for the same tag URL tested above:

tag test 3.png
 
Last edited:

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Wow...if that's true...man was I misinformed! I'll keep my lips sealed!:censored:

I searched our threads but couldn't find the header checker website you suggested previously...thus I tried a couple other header checker websites for one of the tag URLs I'm getting 404 errors for. If you highly trust the header checker website you use...if you post the URL again...I'll test with it too. Thanks:)

The tag URL I tested had the format = https://www.example.com/tags/gdpr

Here are the results:

View attachment 1572
View attachment 1574

Edit: Here's a 3rd Header-Checker website that seems to be giving two response codes for the same tag URL tested above:

View attachment 1575
Hi - You weren't misinformed. What you described is how most of it works (what I wrote didn't come out how I intended). What I meant to say was that tags don't automatically delete themselves when they get pushed out of a tag cloud. So yes, the beginning part of your post was correct, but the second part about tags with not as much action showing 404 header codes wasn't. At least they're not supposed to show 404s automatically like that.

Are you running any tag add-ons? I wonder if something is getting screwed up. There's a lot that I don't know about your site, so I'm not sure how much help I can be. What I am fairly certain of though is that any tag that's attached to at least one thread is supposed to return a 200 OK header code. Actually, this is why tags in general are so dangerous. It's because such a high percentage of them are only used on a few threads. This creates tons of thin content that can affect the crawling of the rest of the site.

Jay
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Hi - You weren't misinformed. What you described is how most of it works (what I wrote didn't come out how I intended). What I meant to say was that tags don't automatically delete themselves when they get pushed out of a tag cloud.
Ok good deal. I was thinking my brain needed to be "reformatted" if I misunderstood most of what I mentioned about tag cloud's. Ha ha!

I think what I meant to say about the "auto-delete" part was...if tag URLs are not auto-deleted (if not in the current tag cloud)...then if they do "hang-around"...they should be in a format that is favorable to crawlers (200 ok). In my case (at least some tag URLs)...are throwing 404 errors...and this certainly isn't a format that's favorable to Google/crawlers.

Are you running any tag add-ons? I wonder if something is getting screwed up. There's a lot that I don't know about your site, so I'm not sure how much help I can be. What I am fairly certain of though is that any tag that's attached to at least one thread is supposed to return a 200 OK header code. Actually, this is why tags in general are so dangerous. It's because such a high percentage of them are only used on a few threads. This creates tons of thin content that can affect the crawling of the rest of the site.
Guess what?...I just discovered something. As I mentioned my site was migrated from vBulletin to XF about 2.5 years ago. Of course no migration is ever 100% perfect...and of course the vBulletin software has a tagging system as well.

I know what tags are throwing the 404 errors (via the site scan)...and "gdpr" is one of them (mentioned above). I was in the XF Admin panel...and was in the "Tags" area (Content > Tags)...this area lists all of the tags for the site.

There's a search box on this page...so I decided to search for "gdpr". When I did the search for "gdpr"...XF came back with "No results found". I then tested a few more of the tags that are throwing 404 errors...and they also came back as "No results found".

It's my guess something must have happened during the vB to XF migration...and maybe some of the tags that existed with vB...didn't translate or migrate properly to XF (otherwise why would there be a tag URL for tags that the XF Admin panel doesn't think it has).;)

There's also an "Add Tag" button on the XF Admin panel Tag page...thus I'm thinking if I add all the tags that are throwing 404 URL errors...maybe this will eliminate the tag URL 404 errors....since now these tags will now be in the XF tag list.

Can't hurt to try I guess.:)
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Ok good deal. I was thinking my brain needed to be "reformatted" if I misunderstood most of what I mentioned about tag cloud's. Ha ha!

I think what I meant to say about the "auto-delete" part was...if tag URLs are not auto-deleted (if not in the current tag cloud)...then if they do "hang-around"...they should be in a format that is favorable to crawlers (200 ok). In my case (at least some tag URLs)...are throwing 404 errors...and this certainly isn't a format that's favorable to Google/crawlers.


Guess what?...I just discovered something. As I mentioned my site was migrated from vBulletin to XF about 2.5 years ago. Of course no migration is ever 100% perfect...and of course the vBulletin software has a tagging system as well.

I know what tags are throwing the 404 errors (via the site scan)...and "gdpr" is one of them (mentioned above). I was in the XF Admin panel...and was in the "Tags" area (Content > Tags)...this area lists all of the tags for the site.

There's a search box on this page...so I decided to search for "gdpr". When I did the search for "gdpr"...XF came back with "No results found". I then tested a few more of the tags that are throwing 404 errors...and they also came back as "No results found".

It's my guess something must have happened during the vB to XF migration...and maybe some of the tags that existed with vB...didn't translate or migrate properly to XF (otherwise why would there be a tag URL for tags that the XF Admin panel doesn't think it has).;)

There's also an "Add Tag" button on the XF Admin panel Tag page...thus I'm thinking if I add all the tags that are throwing 404 URL errors...maybe this will eliminate the tag URL 404 errors....since now these tags will now be in the XF tag list.

Can't hurt to try I guess.:)
That's good news! I love solving a good mystery. Well, we haven't solved it yet, I suppose, but you're well on your way. Let me know what you find out.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
A bit of an update. As a reminder...made a massive update to my robots.txt on January 4th...it's been about 7 weeks...and of course Google has crawled things a bunch of times.

Here's what the "Blocked by Robots" graph from Google Search Console (GSC) looks like after about 7 weeks:

GSC Blocked by Robots.png

* Red circle indicates when robots.txt was significantly changed.
* Blue circle indicates today's date 2-25-22.
* As can be seen..number of blocked pages continues to increase.
* Valid pages has increased about 3.5% (not sure if this is a blip in the data...or a real increase).
* Average position in search results has not changed. Hoping this changes & improves.
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
* Red circle indicates when robots.txt was significantly changed.
* Blue circle indicates today's date 2-25-22.
* As can be seen..number of blocked pages continues to increase.
* Valid pages has increased about 3.5% (not sure if this is a blip in the data...or a real increase).
* Average position in search results has not changed. Hoping this changes & improves.
Well, we know something is working here. This is where things get interesting. Hopefully you'll continue to see those valid pages increase and over the next year, hopefully you'll see the blocked by robots.txt pages (or URLs) peak and then begin to fade away. I think about this issue a lot and beyond physically removing the offending links from a XenForo website (which is what I've done on this site for those who aren't logged in - an entirely different ballgame) and letting them die off with 301 redirects and 403 responses, the only way to handle this is by blocking the pages in the robots.txt file. It's a constant process of asking yourself, "Did I just do the right thing?" But as you said, you've been dealing with ranking and crawling issues for two years, so I think you've got enough data to go on. Things aren't going to magically fix themselves as you wait. You needed to take action. Keep me updated.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
...hopefully you'll see the blocked by robots.txt pages (or URLs) peak and then begin to fade away.
Is that the way it usually works?

1. We make changes/additions to the robots.txt
2. Then "Pages Blocked by Robots" increases for a period of time.
3. Pages Blocked peaks at some point...then decreases

* How far will the decrease go (assuming robots.txt remains unchanged). Will it go to zero or close to zero?
* Even though the blocked pages number decreases (after peaking)...does Google still "remember" what the blocked pages were somewhere in its system? Or because they're blocked by robots...no need to "remember"? I think maybe you called this "deindexing".

Thanks
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Is that the way it usually works?
If the blocked pages are contained inside of a directory and the path to those pages is blocked in the robots.txt file, then those URLs should eventually fade away. Back in late 2019, I blocked a directory with tens of thousands of URLs that Google previously crawled and it took about a year, but they finally dropped from the graph. If the pages are still linked to on the website though, meaning Google can crawl the pages that link directly to a blocked page, such as attachment and member pages, they may not drop from the graph. I've seen it happen and I've seen it not happen. The thing is, many XenForo sites have these types of pages blocked and they rank very well, so there's evidence that suggests it doesn't make a difference one way or the other (whether or not they appear in the GSC graph).

1. We make changes/additions to the robots.txt
2. Then "Pages Blocked by Robots" increases for a period of time.
3. Pages Blocked peaks at some point...then decreases
This is what typically happens. In your case though, you're blocking many resource files, such as the ones we discussed earlier. I'm not sure what will happen with them. Ideally, none of those resource files would be causing errors so wouldn't need to be blocked. And to go one step further, ideally, you wouldn't even have any links that produce 403 responses on the site for Google to crawl.

* How far will the decrease go (assuming robots.txt remains unchanged). Will it go to zero or close to zero?
* Even though the blocked pages number decreases (after peaking)...does Google still "remember" what the blocked pages were somewhere in its system? Or because they're blocked by robots...no need to "remember"? I think maybe you called this "deindexing".
If the pages are contained in a directory as I described above, yes, they should eventually fall to zero. That's what mine did. But again, if the pages are still linked on the site, such as member and attachment pages, it's unknown what will happen.

Google remembers everything, but the longer time goes on, it remembers less and it cares less.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Have had 2 Google crawls since the last post...and starting to get bit concerned. Before the last post a traffic drop was starting (but didn't want to "cry wolf" quite yet in case it suddenly bumped back up). After the latest 2 crawls...the traffic drop seems more of a trend (not sure if it will continue or not).

Here's the traffic drop via Google Analytics (last 7 days vs last 7 days before that)...9-14% traffic drop day to day:


1 Week Traffic Drop.png

Valid URLs have also started dropping last 2 crawls. With the last crawl...valid URLs are now lower than they were before Robots.txt was changed about 2 months ago. Right now valid URLs are -1.0% vs. 1-4-22. Valid URLs 2 weeks ago (2-15-22) were about +5%.

Nothing has been changed since 1-4-22.

Any thoughts or ideas?

Thanks
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
The issue here is that you've got a very complex situation and it's difficult to get a grasp on what may be happening. I think, right off the bat, that you may want to unblock the 301 redirects. For instance:

Disallow: /goto/
Disallow: /threads/*/latest
Disallow: /threads/*/post

Depending on your site, these URLs may be in the thousands and now that they're blocked, the traffic that used to go to them is being stopped. In your case, it seems like the biggest problem areas, in regards to crawl budget are that one we discussed above that has to do with the advertisement add-on (I forget what it was you blocked) and these below:

User-agent: *
Disallow: /attachments/
Disallow: /members/
Disallow: /search/
Disallow: /whats-new/

If this was your total robots.txt file, it would be fine.
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Thanks for the ideas Jay.

I'm totally ok with tweaking things to get the right balance. I won't put too much in one post to keep things simple (1-2 things at a time).:)

I think, right off the bat, that you may want to unblock the 301 redirects. For instance:

Disallow: /goto/
Disallow: /threads/*/latest
Disallow: /threads/*/post
Yes these 3 lines are currently in my robots.txt. Just so I'm 100% understanding...are you saying these 3 lines are associated with 301 redirects?

Because my site was migrated from vBulletin...and server file structure was changed during the migration...the site has lots & lots of URLs that are 301 redirects.

Thus are you saying I should remove these 3 lines from the robots.txt?

Depending on your site, these URLs may be in the thousands and now that they're blocked, the traffic that used to go to them is being stopped. In your case, it seems like the biggest problem areas, in regards to crawl budget are that one we discussed above that has to do with the advertisement add-on (I forget what it was you blocked) and these below:
Yes in the Google Search Console "Crawl Stats" report...in the "Other Client Error (4xx)" area...there were two VERY common error examples that had the same term in them. A line you suggested was added to the robots.txt that contained this common term found in both of these errors. Since 1-26-22 (about 5 weeks)...these 2 errors have not shown up in the "Other Client Error (4xx)" report.:)

User-agent: *
Disallow: /attachments/
Disallow: /members/
Disallow: /search/
Disallow: /whats-new/

If this was your total robots.txt file, it would be fine.
Wasn't 100% sure what was meant here. Are you saying if I modified my robots.txt to contain just the 5 lines in the quote above...that would be ok?

Right now my robots.txt has about 41 lines (included 2 sitemap lines)...per what we discussed a while ago about matching what you were doing on one of your sites (trying to block everything possible that's not necessary for Google to crawl)...thus maximizing Google crawl budget for the site content we want Google to crawl.

If I went all the way back to just the 5 lines above...I'd pretty much be back to where I started at the end of last year (before the major robots.txt change was put in place on 1-4-22). Please clarify if I'm misunderstanding.:)

Thanks,

Nick
 

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
Hi Nick,

Unfortunately, I think I'm out of gas on the issues you're experiencing. I simply don't have enough information to go on. In order to give you a solid opinion on what to do, I'd need to have access to your site (plus all of the information that pertains to your old site), log files, GSC data, and probably some other things. And then, in order for me to spend the time analyzing those things to come up with suggestions, I'd need to charge you a consulting fee. This type of thing is very complex and isn't something that can be handled with the back and forth of a forum. I'd love to help, so please let me know if you're interested.

Thanks,

Jay
 

Alfuzzy

Member
Site Supporter
Joined
Dec 30, 2021
Messages
68
Reaction Score
2
Points
8
Thanks much Jay for all the help...unfortunately don't have the budget for paid assistance. I'll keep plodding along like I have been...maybe I'll figure this out some day. Lol

Nick
 
Top