Google Search Console "Queries" and "Pages" Information

Alfuzzy · Jan 24, 2022

In Google Search Console if someone clicks on "Performance" (in the left navigation area)...then click on "Queries" and "Pages":

Screen Shot 2022-01-24 at 12.35.16 PM.png

Screen Shot 2022-01-24 at 12.35.16 PM.png

When each of these is clicked on you get a list of information:

* For "Queries" you get a list of words or statements.
* For "Pages" you get a list of URL's.

What exactly do each of these lists mean?

I'm guessing the "Queries" list are search terms internet visitors are searching for via a search engine. The "Pages" list of URL's...are pages for the Google Search Console website in question that meet the search terms from the "Queries" list.

Hopefully this is a close approximation of what each of these mean. But would be very interested in any additional details/thoughts anyone may have. Wanted to be 100% sure I understand each of these before doing anything or drawing any conclusions.

Thanks

JGaulard · Jan 24, 2022

Hello @Alfuzzy!

Yes, I do believe you are spot on with your assessment of these two areas in the search console. If you scroll up on the very same page and click the "Average Position" tab above the graph, you'll see a new column (with orange numbers) added that will help out a lot in understanding where your site sits in the Google search results. For instance, let's say you have 1000 impressions for one keyword, but only 10 clicks. If you look at the Average Position column, you'll likely see that that keyword is not on page one. If it's not on page one, unfortunately, it's the equivalent of being buried deeply. Even on page one, if it's not first, you're not getting nearly the traffic you could be getting. I once read that the difference between the first and second position in Google is 100%.

You can tell a lot from just the Total Clicks and Total Impressions columns though. Oftentimes, after a jump in impressions, you won't see a big difference in clicks. It's good that impressions increased, but that can mean that more keywords/phrases were introduced to the search results, yet those keywords/phrases are still in position 35 or something like that. It sometimes takes time for those keywords to climb to page one, where they'll garner actual clicks. It's always a good thing when you watch those average positions fall down to position number one.

Although, after an update, if Google deems your site worthy, you may see a big jump in both clicks and impressions. and positions for that matter. That's when stuff gets crazy. I don't see a big difference in any of these numbers in between updates. They're kind of boring/frustrating to watch.

I hope this helps.

Alfuzzy · Jan 24, 2022

JGaulard said:
Yes, I do believe you are spot on with your assessment of these two areas in the search console.

Ok good deal. So we always hear about how important it is to rank well for keywords...or at least ranking well for the keywords a website wants to rank well for.

Does either the "Queries" or "Pages" info in the Google search console Performance area have any relation to which keywords a site ranks for?

Also...one of the reasons I was asking about this "Queries" or "Pages" info in GSC. When I check GSC...if I look at just the first page for each of these areas (the top 10)...some of the words/statements for:

* The "Queries" area are words/terms/statements that I really don't see much in forum threads for quite some time...yet they seem to be ranking high in GSC "Queries"s area.
* The "Pages" area...some of the URLs listed (especially on page #1)...are old (I'm talking like 2010, 2013, 2017). Some URLs listed are 2020/2021..but then there are these old URLs that are ranking high. Seems to me newer topics/info/keywords should rank higher (than old)...due to current relevancy.

The reason why I'm concerned about this info is...I'm wondering if my site is getting crawled properly. If the info found on page #1 (and beyond) for each of the "Queries" and "Pages" areas is outdated or old...I'm wondering if there's something on the site/server that's preventing the site from being crawled properly.

As you know I redid a major revamp of my robots.txt file recently. But much of the info I'm seeing in GSC now...I also remember from a while ago (when I was running a more "standard" robots.txt file). If there is an issue...and it's not the robots.txt file...I'm wondering what else it could be (if the site is not getting crawled properly...or something is preventing the site from being crawled properly)?

Thanks

JGaulard · Jan 24, 2022

Alfuzzy said:
Does either the "Queries" or "Pages" info in the Google search console Performance area have any relation to which keywords a site ranks for?

Yes, the Queries are the keywords your site ranks for. Copy and paste those phrases right into Google and you should see your site in the results, give or take a few spots from where the average position claims to be. It's never right on the exact spot. And yes, those pages are your pages that are in the Google search results. Just be aware that the Queries and Pages lists are two distinct lists. The second query in the list isn't associated with the second page on the other list.

Alfuzzy said:
* The "Queries" area are words/terms/statements that I really don't see much in forum threads for quite some time...yet they seem to be ranking high in GSC "Queries"s area.
* The "Pages" area...some of the URLs listed (especially on page #1)...are old (I'm talking like 2010, 2013, 2017). Some URLs listed are 2020/2021..but then there are these old URLs that are ranking high. Seems to me newer topics/info/keywords should rank higher (than old)...due to current relevancy.

Yeah, I'm not sure about that. If the old URLs from your previous CMS are still being listed there, you may not have good crawl efficiency. The 301 redirects you set up when you moved your site to XenForo might not have been followed. Although, Google has claimed to use old URLs, even after they've been redirected, so I wouldn't worry too much about this. And hopefully after your site begins getting crawled better, you'll see those old URLs transition over to the new ones.

In regards to older queries and pages ranking, that's fine. That just means they're entrenched in the system and most likely have a few links pointing to them.

To see if your site is being crawled properly, go into the search console and scroll down on the left side until you see the Settings link. Then find the Crawling section on the resulting page and click the Open Report link. Find the By Response area and then click into each response code to see which pages are being crawled. If you see new (XenForo) URLs under the 200 section, you should be fine.

This is what one of my site's report looks like:

You likely have tons of 301s and 403s.

Alfuzzy · Jan 25, 2022

JGaulard said:
To see if your site is being crawled properly, go into the search console and scroll down on the left side until you see the Settings link. Then find the Crawling section on the resulting page and click the Open Report link. Find the By Response area and then click into each response code to see which pages are being crawled. If you see new (XenForo) URLs under the 200 section, you should be fine.

This is what one of my site's report looks like:

View attachment 976

You likely have tons of 301s and 403s.

This is an issue I've been trying to solve for quite some time. The issue is the "Other Client Error (4XX)" category (which isn't even in the top 5 in your "By Response" data above).

Approximately last January the website had what I would call a "standard" robots.txt (similar to Xenforo.com)...and this robots.txt was in place for at least a year (thus Google had lots of time to crawl the site many times).

Last January, 2021...here's what the "By Response" data looked like:

GSC By Response (Jan 25, 21) copy.png

As can be seen...the "Other Client Error (4xx)" is absolutely out of control...and the "Moved Permanently (301)" is not great either!

I have a long thread over at Xenforo.com where I asked for help/feedback on this...but did not get any feedback that helped with the issue.

June, 2021 is when I massively changed my robots.txt (basically wasn't blocking much of anything). By Late July, 2021 (about 4-6 weeks later)... "By Response" data with the massively changed robots.txt looked like this:

7-21-21 copy.png

This would seem to indicate that having an almost completely open robots.txt(basically blocking almost nothing) was an improvement.

By late December, 2021 (with the exact same robots.txt in place for about 6 months)..."By Response" data looked like this:

12-21-21 copy.png

As can be seen...the By Response "Other client error (4xx)" kind of returned to a really high value again.

It would seem having a "standard" robots.txt (similar to xenforo.com & other Xenforo sites)...or a "wide-open" robots.txt...was not the solution to solving the really high "Other client error (4xx)" statistic/problem.

Late December, 2021 robots.txt file was massively reshaped (robots.txt blocks much more than it has ever had). Since then Google has crawled the site about 5 times. Now (about 4 weeks later)...the "By Response" data looks like this:

1-23-22 copy.png

4 weeks later with the new robots.txt..."By Response" data is slightly better. But not 100% sure if it's truly better (and hopefully continue to get better)...or if the "Other client error (4xx)" going from 31% to 28% in 4 weeks...is due to the massively different robots.txt...or if it's due to random statistical fluctuation. More time should tell of course.

Would love to hear any ideas, comments, or suggestions that may help with this issue...in addition to an adjusted robots.txt. Reason I say this is...I've run a standard robots.txt many other Xenforo sites use...and they don't seem to have a GSC "By Response" issue like my site has.

Thus not 100% sure a changed robots.txt is going to be the single solution. I'm hoping it will be...but wondering if there is something else that should be investigated.

Thanks

JGaulard · Jan 25, 2022

You have a few conflicting statements in your post. I have been thinking about it and it may be best for me to gather some information as opposed to parsing out what has gone on over the past year. If you wouldn't mind:

1. Log into your XenForo admin panel and go to Groups & Permissions > User Group Permissions > Unregistered/Unconfirmed and verify that:
- View member lists
- View member profiles
- View attachments to posts
are all set to No.

By having these set to No, if anyone, including a search engine crawler, who's not logged in tries to access either a member profile or an uploaded post image by clicking either of their links (or otherwise), they'll be prompted by a login page. That login page shows the 403 response code (otherwise knows as 4XX in the GSC). Also, has these settings changed over the past year? Have they always been No?

2. I will need to see the URLs that are being crawled and that show the 4XX response code. I need to know if they are member pages (they'll have /members/ in the URL), attachment pages (they'll have /attachments/ in the URL), or something else. We need to make sure that some other pages aren't being crawled.

Thanks.

Alfuzzy · Jan 25, 2022

JGaulard said:
You have a few conflicting statements in your post.

Hopefully the data isn't conflicting (just sharing data I have).

Maybe I didn't write things well...and it was confusing.

JGaulard said:
1. Log into your XenForo admin panel and go to Groups & Permissions > User Group Permissions > Unregistered/Unconfirmed and verify that:
- View member lists
- View member profiles
- View attachments to posts
are all set to No.

Checked Xenforo admin panel:

* View member lists was set to "No".
* View Member Profiles was set to "No".
* View Attachments to posts was set to "Yes".

JGaulard said:
Also, has these settings changed over the past year? Have they always been No?

The settings mentioned above (no, no, yes)...as far as I can remember...have been set this way for the Unregistered/Unconfirmed user group since the site was migrated from vBulletin 4.2.5 to XF Summer 2019. I'm guessing these must have been auto-set during the vB to XF migration.

Should I set the "View Attachments to posts" to "No"?

JGaulard said:
2. I will need to see the URLs that are being crawled and that show the 4XX response code. I need to know if they are member pages (they'll have /members/ in the URL), attachment pages (they'll have /attachments/ in the URL), or something else. We need to make sure that some other pages aren't being crawled.

This is a bit more complex...but will provide as much detail as possible. After looking at the individual URL's for the "Other client Error (4xx)"...not exactly a clear cut answer.

First I should share the "Crawl requests: Other Client Error (4xx)" graph (kind of interesting):

Total Crawl Requests 1-25-22.png

The area circled in red is when the revamped robots.txt was put in place (1-4-22)...and have had multiple Google crawls since then. Last data point on the graph is 1-21-22. As can be seen...this graph is extremely flat now (compared to previous).

I checked the first 50 URLs (first 5 pages...there are many more)...for "Other client Error (4xx)":

** The URL crawl dates (for the first 5 pages) vary from 1-9-22 to 12-31-21. Some of these URL dates are from when the old robots.txt was in place...and some of these URL dates are from after the new robots.txt was in place.

The URL's seem to fall into 3 categories:

1. Only 3 out of the first 50 URLs contained "member" in them.

2. 31 of 50 URLs look like this:

Error 2.png

If any of these 31 URLs are visited...this is displayed in the browser window:

Error 3.png

3. 16 out of 50 URLs look like this:

If any of these 16 URLs are visited...this is shown in the browser window (one of the Xenforo "Oops" pages):

Error 5.png

As a reminder...some of the "Other client Error (4xx)" URL dates are from when the old robots.txt was in place (30 of them)...and 20 of the URL dates are from after the new robots.txt has been in place.

All three categories of URLs mentioned above...are represented both before & after the new robots.txt was put in place on 1-4-22.

Please let me know what you think.

Thanks

p.s. None of the first 50 URLs I checked contained "attachments" in the URL. I actually checked the first 100 URLs for "attachment"...and found none.

JGaulard · Jan 25, 2022

I'll respond to this post later on tonight or maybe tomorrow. I have to head out right now. I did want to ask you a few more questions first though. If you go to Setup > Options > Search Engine Optimization (SEO) in your admin area, which boxes are checked? Is the Use Full Friendly URLs box checked? Also, do you have any addons installed from outside vendors or from XenForo themselves (such as the Media Gallery or the Resource Manager)?

Alfuzzy · Jan 25, 2022

JGaulard said:
I'll respond to this post later on tonight or maybe tomorrow. I have to head out right now. I did want to ask you a few more questions first though. If you go to Setup > Options > Search Engine Optimization (SEO) in your admin area, which boxes are checked? Is the Use Full Friendly URLs box checked?

Thanks for the help.

In the Search engine optimization (SEO) area...these are checked:

* Use full friendly URLs
* Include content title in URLs
* Convert URLs to page titles

JGaulard said:
Also, do you have any addons installed from outside vendors or from XenForo themselves (such as the Media Gallery or the Resource Manager)?

Looks like about 29 add-ons installed. Many are simple uni-tasker add-ons (just do one thing). 2 of them are Xenforo add-ons...both related to when the site was imported/migrated to XF (from vBulletin).

Thanks

JGaulard · Jan 26, 2022

Alfuzzy said:
Should I set the "View Attachments to posts" to "No"?

Yes, you should set this value to No. Those attachment URLs are strange. You have the /attachments/ directory blocked in your robots.txt file now, but if the View Attachments value for guest is set to Yes, search engines can't crawl site images easily (for image search). It's weird - just the way XenForo is set up. Change that to No, so people will have to log in to see full size images. That is, if you're having them uploaded to your server. Some folks use outside servers for this.

Also, I'm not sure what to tell you about the strange URLs that are being crawled on your site. You've got URLs from your prior software that I'm not familiar with and you've also got 29 add-ons. I have no idea about any of these. They may be spitting out URLs like crazy. I keep just one to two add-ons on my sites and I know that those create new files that Google just loves to crawl (CSS/XML resource files). I can't imagine what 29 add-ons is doing to Google.

I was also thinking about the pages that are being 301 redirected from your old software to the new. Let's say you have:

OLD MEMBER PAGE A ---> (redirected to) NEW MEMBER PAGE A

...and OLD MEMBER PAGE A isn't blocked in your robots.txt file. That's still going to redirect to the NEW MEMBER PAGE A. The new page will be blocked, but Google is still crawling the old pages all day long. Member profiles should surely be blocked. You may want to find the URLs of the old software's profile pages and block them too, just so Google doesn't continue to crawl them and waste crawl budget.

I can see that you've got a very complex situation going on. All the 301 redirects as well as those add-ons is making things messy. I don't know how your site looks or operates, but you might want to consider removing some of those add-ons if they're not mission critical.

Alfuzzy · Jan 26, 2022

JGaulard said:
Yes, you should set this value to No. Those attachment URLs are strange. You have the /attachments/ directory blocked in your robots.txt file now, but if the View Attachments value for guest is set to Yes, search engines can't crawl site images easily (for image search). It's weird - just the way XenForo is set up. Change that to No, so people will have to log in to see full size images. That is, if you're having them uploaded to your server. Some folks use outside servers for this.

Good deal...will change this to "No".

JGaulard said:
I was also thinking about the pages that are being 301 redirected from your old software to the new. Let's say you have:

OLD MEMBER PAGE A ---> (redirected to) NEW MEMBER PAGE A

...and OLD MEMBER PAGE A isn't blocked in your robots.txt file. That's still going to redirect to the NEW MEMBER PAGE A. The new page will be blocked, but Google is still crawling the old pages all day long. Member profiles should surely be blocked. You may want to find the URLs of the old software's profile pages and block them too, just so Google doesn't continue to crawl them and waste crawl budget.

Glad you mentioned this...made me think of something I hadn't thought about before.

Call me a "pack-rat"

...but when the site was migrated from vB to XF...we set things up so the original vB site was still fully operational (but located in a different directory on the server). The idea being if the migration to XF had some glitches...or if things got a bit disorganized...we had the old website to refer to...and make adjustments on the new XF site as necessary.

As an example:

* The new XF website resides in the root directory/URL = www.mysite.com
* The old vB site resides in the directory/URL = www.mysite.com/oldsite/

Some thoughts:

** Don't know if Google is crawling the old site or not...or if the old site is "hurting" the new site in some way if it is being crawled.
** The old site is not defined in Google Analytics or Google Search Console (don't know if this matters or not).
** If Google is still crawling the old vB site...the current robots.txt is not blocking anything for the old site.
** The old site is in "maintenance mode" (no old website forum nodes, categories, or threads can be seen/accessed). Not sure if this matters from a Google crawling perspective (if Google can still crawl the old website content when in maintenance mode).
** Not sure if this is important. I think the old site was still submitting sitemaps. I changed the "auto submit sitemap" to "No"...just in case it does matter. I think Google still crawls a site even if sitemaps are not submitted.

I guess the bottom line thoughts here are:

1. Is Google still crawling the old vB site...and if so...is this bad for the new XF site?
2. If the old vB site is in "maintenance mode"...does this block any Google crawling of website content?
3. Do I still want Google to crawl the old vB site?
4. If not...what should I add to the robots.txt to 100% block any crawling of the old vB site?

JGaulard said:
I can see that you've got a very complex situation going on. All the 301 redirects as well as those add-ons is making things messy. I don't know how your site looks or operates, but you might want to consider removing some of those add-ons if they're not mission critical.

It does seem like a complex situation. Not sure if the info mentioned above about the old vB site still being active has any significance on things.

29 addons does seem like a lot. Many of them are not mission critical...but many of them (in my opinion)...add useful features to the site. As mentioned above...many of them are very simple.

For example...one addon adds a small "OP" banner across the avatar of the OP of the thread. This helps site visitors identify who asked the original question...and who replies in the thread could/should be directed towards (very helpful especially in long multi-page threads). Many of the 29 addons are very simple like this.

Thanks tons for the help!

JGaulard · Jan 26, 2022

Alfuzzy said:
Call me a "pack-rat"...but when the site was migrated from vB to XF...we set things up so the original vB site was still fully operational (but located in a different directory on the server). The idea being if the migration to XF had some glitches...or if things got a bit disorganized...we had the old website to refer to...and make adjustments on the new XF site as necessary.

As an example:

* The new XF website resides in the root directory/URL = www.mysite.com
* The old vB site resides in the directory/URL = www.mysite.com/oldsite/

Okay, I have a few questions about this. Was the original site moved to a different directory when you made the switch? Was the old site once in the root directory and then moved over to /oldsite/? This is important.

Alfuzzy said:
** Don't know if Google is crawling the old site or not...or if the old site is "hurting" the new site in some way if it is being crawled.

You need to look in your log files to see what's being crawled. You need to understand what URLs you'd like to be crawled on your new XenForo site and which URLs are to be redirected from your old site. Without having a clear picture of what's going on, you'll never figure out what to do.

Alfuzzy said:
** If Google is still crawling the old vB site...the current robots.txt is not blocking anything for the old site.

You need to be careful with this. The only thing Google should be crawling from the old site are valuable pages like the homepage, forum pages, and thread pages (pages that should be indexed). Everything else should be allowed to be crawled and return a 404 or should be blocked in the robots.txt file. Most people will say allow the old junk (cruft) to be crawled and allow it to fall out of the index naturally. I've found that this takes years and years. Sometimes it's faster to block it and have it drop from the index that way.

Alfuzzy said:
** The old site is in "maintenance mode" (no old website forum nodes, categories, or threads can be seen/accessed). Not sure if this matters from a Google crawling perspective (if Google can still crawl the old website content when in maintenance mode).

When the old site is in maintenance mode, are the redirects live? Use a header checker like this (https://www.webconfs.com/http-header-check.php) to check some URLs from the old site. They should be accessible and crawlable. That is, if you haven't moved the site into a new directory. If you did move the old site into a new directory, the point is moot. The old URLs are no longer where they were so there are none to crawl and redirect. All the old URLs will be returning 404 errors, unless I'm mistaken here.

Alfuzzy said:
** Not sure if this is important. I think the old site was still submitting sitemaps. I changed the "auto submit sitemap" to "No"...just in case it does matter. I think Google still crawls a site even if sitemaps are not submitted.

If the old site is in maintenance mode, all the pages are likely returning 401, 403, 404, or some 5XX response code. A sitemap wouldn't matter here, but this should be dealt with by someone who knows SEO.

Alfuzzy said:
It does seem like a complex situation. Not sure if the info mentioned above about the old vB site still being active has any significance on things.

If you had your old site set up for years and then moved everything around, it's having a big impact on the situation today. This needs to get squared away.

Alfuzzy · Jan 26, 2022

I should probably make sure we don't get "lost in the weeds" with some of these details concerning the old site.

The site was migrated 2.5 years ago.

I'm guessing in the vast majority of forum migrations:

* The first step is to do a full site backup (of the old site).
* Then the site is migrated from the old software to the new software.
* Once the migration is complete...testing & review of the new site is performed.
* When everything checks out...old site is immediately taken down...and old site data purged from the server.
* All that remains on the server is the new migrated site...possibly with some redirects for old site URL's in the .htaccess file.

Maybe (in some migrations) the old site is left running (in a different server directory)...for a small amount of time (let's say less than a month)...just to be sure the new migrated site is running fine. But in probably 95+% of the situations...the old site is removed from the server in a short period of time (if not immediately in many cases).

Thus (in my case)...if the site was migrated 2.5 years ago...making the old site "invisible" to the Google crawler shouldn't be an issue...since most folks would have purged the old site long long ago.

I'm not an IT or a forum migration expert...thus if I said something that's not mostly correct...of course please correct me.

If what I said regarding typical forum migrations (and purging of the old site from a server either immediately or in a short time)...is true...then we probably don't need to be concerned about any relationship between the old site & new site (at this point 2.5 years later). Since in most "normal" site migrations the old site would have been purged long ago.

If the old site is still up & running (like it is in my case). Then maybe the one thing to ensure is that the old site is 100% invisible to any internet crawlers (Google & otherwise).

Please let me know if this makes sense.

Thanks

JGaulard · Jan 26, 2022

Hi Nick - Yes, you are correct and let me say that you handled your post very diplomatically. I notice these things, so thank you. For some reason, I was thinking that the old software is still humming away somewhere and everything from that is being forwarded to the new site. Obviously (now that I've come to my senses), the old one is gone and the new one is here. All the redirects are in your .htaccess file. I guess the thing you'll need to do is find out what's being redirected. I would hate to see random and unnecessary files and pages being redirected because that's going to cause a lot of crawling you don't need. For instance, you'll want to find out what the directory structure was for your old member pages. If it's something like the new /members/ directory, you can add it to the robots.txt file as well. Also, are images being redirected? I'd like to know what that directory (URL) structure looks like as well. With XenForo, images are strange. There's the actual image location, such as /data/attachments/image.jpg, but when it comes to clicking on a thumbnail or full sized image, it looks like /attachments/image-1234-jpeg-4950/. The URL is treated as a unique page, which don't need to be crawled either. All you need is for the actual image file itself to be crawled and it will be, even when the /attachments/ directory is blocked. So that's why I say to look in your log files to see what's being forwarded. Strange URLs with questions marks should be noted.

Alfuzzy · Jan 26, 2022

I know sometimes when we/folks get involved in technical discussions...it can be easy to get "lost in the weeds".

I was thinking if the old site was purged from the server 2+ years ago after it was migrated (as it would have been in many cases)...many of the questions I posed earlier would be unnecessary (since the old site would be gone).

JGaulard said:
I guess the thing you'll need to do is find out what's being redirected. I would hate to see random and unnecessary files and pages being redirected because that's going to cause a lot of crawling you don't need. For instance, you'll want to find out what the directory structure was for your old member pages. If it's something like the new /members/ directory, you can add it to the robots.txt file as well. Also, are images being redirected? I'd like to know what that directory (URL) structure looks like as well. With XenForo, images are strange.

If we pretend that the old site does not exist (like it would have been in many cases). What do I need to do in the robots.txt file to make the old site 100% "invisible" to any crawlers (Google & otherwise)?

If for example the old site resided at the URL = www.mysite.com/oldsite/

What would I need to add to my robots.txt so no crawlers see it/crawl it (block everything on the old site from crawlers as if it wasn't even on the server)?

JGaulard said:
So that's why I say to look in your log files to see what's being forwarded. Strange URLs with questions marks should be noted.

I think you're a lot more comfortable looking at the log files them me.

I've only done it a couple times...and in each case I needed the help of the folks at my hosting company to do the exact searches (using "grep" commands to narrow the searches for what I was looking for). The host company folks then saved the results in a file for me...then I downloaded the file for examination.

If the searches aren't done to narrow things down...there's just too much "extra" stuff to sift thru. Of course maybe sifting thru the raw log files is better than nothing...it that's easier (for me).

Thanks

JGaulard · Jan 27, 2022

Alfuzzy said:
I was thinking if the old site was purged from the server 2+ years ago after it was migrated (as it would have been in many cases)...many of the questions I posed earlier would be unnecessary (since the old site would be gone).

The thing is, those old files last forever. I have had sites that I've moved from one CMS to another and three years later Google is still hitting those old pages like crazy. Google never lets go. It does slow down after a while, but depending on the size of the old site, it can take a very long time. That's why it's important to keep 301 redirects from the old important pages to the new forever.

Also, I'm almost certain that you have many pages from the old site that only get crawled every few years. I've seen this a lot. I've had pages that haven't been crawled in 3+ years, but were still in the index and very visible. It's crazy how long Google's memory is. While many pages on a website get crawled frequently, so many don't. Again, it depends on the size of your site. If you've got thousands and thousands of pages, I can almost guarantee that a good portion of them don't get crawled more than once per year.

Alfuzzy said:
If we pretend that the old site does not exist (like it would have been in many cases). What do I need to do in the robots.txt file to make the old site 100% "invisible" to any crawlers (Google & otherwise)?

If for example the old site resided at the URL = www.mysite.com/oldsite/

What would I need to add to my robots.txt so no crawlers see it/crawl it (block everything on the old site from crawlers as if it wasn't even on the server)?

This is not something you want to do. It's important to distinguish between necessary 301 redirects and feature 301 redirects. The redirects you have from your old site to your new are necessary. The old URLs hold pagerank and a history within Google. If you were to remove them, your rankings would plummet. The redirects that XenForo creates when someone creates a new thread or post are feature redirects. Normal links can lead a search engine crawler to a thread just fine. XenForo adds all of their redirects just to make it more exciting and helpful for the users of the website. It's okay to block these types of redirects, which you are already doing.

Alfuzzy said:
I think you're a lot more comfortable looking at the log files them me. I've only done it a couple times...and in each case I needed the help of the folks at my hosting company to do the exact searches (using "grep" commands to narrow the searches for what I was looking for). The host company folks then saved the results in a file for me...then I downloaded the file for examination.

Another way to see what Google is crawling is to simply look through the data from the tables in the screenshots you added to your post above. You'll find the same information there. You'll see what Google is crawling and what header response code it's receiving.

A good strategy might be to keep your eye on those tables. If you see URLs come through in the 4XX one, copy it and paste it into a browser. See what it's leading to. If it's a member profile page, record that. The same is true for the 404 URLs and 301 URLs. You want to keep the old homepage, forum pages, and thread pages flowing, but block anything else. Once you gather this data over a few weeks, bring it back here and we can discuss what to do with it. Perhaps we'll decide to let some of it die and continue responding with 404 headers and perhaps we'll decide that it should be blocked in the robots.txt file. But you definitely don't want to block most of the 301 redirects from the old site. You likely won't recover from that.

Alfuzzy · Jan 27, 2022

I think (maybe) unless I'm missing something (maybe)...one important point being forgotten is normally the old site would have been taken down/deleted from the server over 2 years ago.

Technically (I think) I should be able to delete the whole directory the old site resides in today from the server...and it shouldn't be a problem.

In probably 95+% of the cases when a forum website is migrated from one software product to a different software product...the old site normally would be removed from the server either immediately...or in a short period of time (30 days or less).

Everything being discussed about "don't do this" or "don't do that"...I'm not sure is important since the old site normally wouldn't be there after a site is migrated.

It just wouldn't be there for Google to crawl.

Alfuzzy · Jan 27, 2022

I should add this information as well...maybe this will help.

JGaulard said:
When the old site is in maintenance mode, are the redirects live? Use a header checker like this (https://www.webconfs.com/http-header-check.php) to check some URLs from the old site. They should be accessible and crawlable. That is, if you haven't moved the site into a new directory. If you did move the old site into a new directory, the point is moot. The old URLs are no longer where they were so there are none to crawl and redirect. All the old URLs will be returning 404 errors, unless I'm mistaken here.

The directory where the old site exists now...is different (different name) than what it was when the old site was the active site.

I tested the old forum site with the header checker provided above (thanks):

* I tested about 10 URLs from the old site (home page and 9 other random thread URLs).
* I tested the site while it was in maintenance mode.
* I tested the site when not in maintenance mode (active mode).

In all cases...with all the URLs tested (maintenance mode & non-maintenance mode)...the header checker website returned 403 errors for all of them.

In a search for why does Google Search Console "By Response" statistics return such a high "Other client error (4xx)" value:

1-23-22 copy-1.png

Maybe the old site returning 403 errors for every URL tested is the reason?

Thanks

JGaulard · Jan 27, 2022

Alfuzzy said:
I think (maybe) unless I'm missing something (maybe)...one important point being forgotten is normally the old site would have been taken down/deleted from the server over 2 years ago.

Technically (I think) I should be able to delete the whole directory the old site resides in today from the server...and it shouldn't be a problem.

In probably 95+% of the cases when a forum website is migrated from one software product to a different software product...the old site normally would be removed from the server either immediately...or in a short period of time (30 days or less).

Everything being discussed about "don't do this" or "don't do that"...I'm not sure is important since the old site normally wouldn't be there after a site is migrated.

It just wouldn't be there for Google to crawl.

I hope we're not talking about two different things here. I tend to get lost in this stuff sometimes. Anyway, I just want to throw this out there. Just because the old site has been removed, doesn't mean that Google will stop crawling it. If you have a website that's got 1,000,000 pages and then physically delete it and create something new in a new directory, Google has no way of knowing the old one's been deleted until it crawls each and every last page it has in its index. That can take years. And even after that, it will continue to crawl those URLs forever and see either 404 or 301 responses. I'm working on a blog right now that I launched in 2004 and am seeing pages from that era being crawled. I deleted the blog and then resurrected it a few times. So, what I'm trying to say is that old websites on the same domain that have been removed can consume tons of crawl budget. The URLs that are being crawled need to be dealt with one way or another. 301 redirects are preferable.

JGaulard · Jan 27, 2022

Alfuzzy said:
The directory where the old site exists now...is different (different name) than what it was when the old site was the active site.

I tested the old forum site with the header checker provided above (thanks):

* I tested about 10 URLs from the old site (home page and 9 other random thread URLs).
* I tested the site while it was in maintenance mode.
* I tested the site when not in maintenance mode (active mode).

In all cases...with all the URLs tested (maintenance mode & non-maintenance mode)...the header checker website returned 403 errors for all of them.

Okay, I think we're getting somewhere. You moved the old site into a new directory. That new directory can be blocked in the robots.txt. It would look like this:

User-agent: *
Disallow: /new-directory/

That will block everything inside of the moved website, which is what you want to do because it's not meant to be crawled or shown in the search results. And if sitemaps have already been submitted from this new directory, you'll want to turn them off too. Yes, you certainly want to block this one.

I'm not sure what you mean in the second part of your post. Are you saying that instead of the old URLs redirecting to the new XenForo URLs, they're actually showing 403 errors instead? It's fine if the new directory is showing those 403s, but if the original site's URLs are showing 403s as opposed to redirecting with 301s, that's a problem.

---

Okay, I just reread what you wrote. Obviously, since the old site is no longer live in its old directory (the root has been taken over by the new site), you can't turn that one to maintenance mode. So I think blocking the new directory will cut down on the 403s tremendously. That new directory should never have been seen by Google in the first place. It's like you've got sister sites showing. One that's good and crawlable and one that's submitting sitemaps, yet has tons of pages that only show 403s. I think I remember you saying it was submitting sitemaps...

Alfuzzy · Jan 27, 2022

JGaulard said:
Okay, I think we're getting somewhere. You moved the old site into a new directory. That new directory can be blocked in the robots.txt. It would look like this:

User-agent: *
Disallow: /new-directory/

Thanks. Yes this is what I'm thinking I want/need to do (block the entire directory the old site resides in).

I did do some internet searching earlier today...and that's the same robots.txt addition I was finding. Use the asterisk to block all crawlers...or use Googlebot instead of the asterisk if only wanting to block Google. Of course probably want to block all crawlers.

JGaulard said:
I'm not sure what you mean in the second part of your post. Are you saying that instead of the old URLs redirecting to the new XenForo URLs, they're actually showing 403 errors instead? It's fine if the new directory is showing those 403s, but if the original site's URLs are showing 403s as opposed to redirecting with 301s, that's a problem.

My understanding is/was when the new XF site was set up (and data imported from the VB site)...rewrite rules and rewrite conditions were set up in the .htaccess file (old site URLs to new site URLs). And then the old site could be totally deleted from the server.

Additionally...there is a special add-on product supplied by Xenforo which handles redirects for forums migrated from vB to XF. This was installed & activated on my site. I guess the formats for each software package are different enough to require this special add-on:

Screen Shot 2022-01-27 at 3.15.31 PM.png

Screen Shot 2022-01-27 at 3.15.31 PM.png

https://xenforo.com/community/resources/xenforo-redirects-for-vbulletin.6123/

As far as the header checker site. When I enter any of the URLs from the old site into the header checker site...all of them come back with a 403 error.

Again (since in most cases)...the former/old version of a website would have been deleted from the server once it was migrated to the new software (XF in this case)...the old site wouldn't be there for Google to scan any longer after the migration (which is why I think I want to make sure Google can no longer crawl it).

JGaulard said:
It's like you've got sister sites showing. One that's good and crawlable and one that's submitting sitemaps, yet has tons of pages that only show 403s. I think I remember you saying it was submitting sitemaps...

Yes this is an excellent way of putting it "sister sites". Maybe even more accurate could be "twin sites" or "clone sites".

The old vB site and the new XF site...would have been exactly the same Summer of 2019 (same forum content when the site was migrated to XF). Of course since that time the new XF site has continued to build new content...and the old vB site has no new member contributed content newer than Summer 2019.

Both sites (old & new) are complete & intact (just reside in different server directories). Both (old & new) sites have a fully operational & seperate database. Which unfortunately doubles the amount of space needed on the server....and doubles the size of backups.

Just today I did discover that the old site was still receiving new RSS feed content from RSS feeds I have/had setup. And if the old site was still submitting sitemaps...then I guess to the crawlers...this might make the old site look like it's still active.

The main thing is I don't want this "sister site" situation to continue...which is why I want to totally block crawlers from "seeing"/crawling the old vB site...and will adjust the robots.txt accordingly (and make sure no sitemaps are sent from the old vB site).

I am hoping if the old site is blocked from Google crawler seeing it...this will greatly reduce the "By Response: Other Client Error (4xx)" statistic in Google Search Console. Up to this point all ideas/suggestions I've received to solve this problem have been unsuccessful.

This is the first time the idea came to me that the old vB site being on the server could be the issue...and the directory it's in not being blocked from crawlers in the robots.txt file.

Hopefully this explains things better...in case there was some earlier confusion.

Thanks

JGaulard · Jan 27, 2022

Hi - It sounds like you're getting things figured out. I think the confusion on my part was stemming from the terminology we have been using. When you say "old site," it can mean "old site old directory" or "old site new directory." So when you say the old site is now showing all 403 response codes, do you mean the "old site old directory" redirects are showing 403s as opposed to 301s? Or do you mean the "old site new directory" (the one that's been moved and that's not supposed to be seen by anyone) is showing these 403s. From what you've written above, I'm assuming it's the "old site new directory" that's the problem. If so, why not just get rid of it in its entirety? I think you said that you kept it in case you needed it for something, but since such a time has passed, can you now delete both the files and the database? That would solve all of this. As long as those 301 redirects are working properly in your .htaccess files.

Also, you don't need to add the asterisk if you want to block the directory I mentioned above. You can simply add it to the list of blocks that you've already put in there.

I'm wondering how much Google has been crawling this sister site of yours. If it's been hitting it hard and all it's been getting are 403 pages, yes, from what I've seen, that will definitely affect your crawl rate and the quality of your site in Google's eyes (in my humble opinion). Let's just say it can't be good, so I'm happy we're getting to the bottom of this issue.

Alfuzzy · Jan 27, 2022

JGaulard said:
Hi - It sounds like you're getting things figured out.

Gotta keep trying...maybe eventually strike gold!

I've got to trust that the person that did the migration did everything properly. He said he'd done 100's of Xenforo migrations (many of them vB to XF). Probably the biggest difference was my request to keep the old website up & running (and that's why it was put into a different server directory).

I'm thinking at that time nothing was said (got overlooked) about blocking the former vB site/server directory it was in...in the robots.txt to prevent Google from continuing to crawl it (when normally the old site would have been deleted from the server...and this sort of wrinkle would never have been a concern).

JGaulard said:
I think the confusion on my part was stemming from the terminology we have been using. When you say "old site," it can mean "old site old directory" or "old site new directory."

* When I say "old site"...I'm talking about the original vBulletin site (intact)...but in a different sub-directory than it originally resided.

* When I say "new site"...I'm talking about the XF site that was created (with the imported vBulletin site data/content). And now resides in the root directory of the server.

JGaulard said:
So when you say the old site is now showing all 403 response codes, do you mean the "old site old directory" redirects are showing 403s as opposed to 301s? Or do you mean the "old site new directory" (the one that's been moved and that's not supposed to be seen by anyone) is showing these 403s.

* The new XF site (with the imported data from the vB site)...resides in the root directory of the server...www.mysite.com
* The former website running vBulletin...now resides in the subdirectory...www.mysite.com/oldsite/.

This /oldsite/ directory ...is a different directory than the vB site used to reside in when the vB site was the ONLY site.

The former vB site (that resides in the www.mysite.com/oldsite/ directory)...it's these URLs that I placed in the Header Checker to check them...and got all 403 errors for the all the vB site URLs I tested.

To clarify further:

* New XF site resides in the root directory...www.mysite.com
* The directory the former vB site resides in now is...www.mysite.com/oldsite/
* The directory the former vB site USED TO reside in when it was the "live" and ONLY site was...www.mysite.com/forum/

As can be seen...the former vB site now resides in a different sub-directory than it used to (when it was the "live" and only site).

JGaulard said:
From what you've written above, I'm assuming it's the "old site new directory" that's the problem. If so, why not just get rid of it in its entirety? I think you said that you kept it in case you needed it for something, but since such a time has passed, can you now delete both the files and the database? That would solve all of this.

Yes I kept the former vB website up & running just in case I needed to refer to it if I ran into some issues with the new XF site. Like I mentioned earlier...call me a "pack rat"...I hate to get rid of it. "Murphy's Law" would more than likely kick in 1 week after I deleted the old website.

I figure if (now) I can block Google from crawling the directory the former vB website resides in...that will be good enough to stop any possible Google crawling of it that may be happening now & in the future. Then when I'm finally comfortable enough to delete the former vB sites directory...then I'll delete it.

JGaulard said:
As long as those 301 redirects are working properly in your .htaccess files.

This is probably getting beyond my level of expertise. I find the .htaccess file a place I shouldn't mess with. Where if something is messed with in the .htaccess file...and it's done incorrectly...serious things can be messed up!

I did review the current .htaccess file...and the person that set things up placed a lot of comments in there (I believe comments start with the start with the...#...character).

I'm also not 100% sure what a 301 redirect in an .htaccess file should look like. But the only line in my .htaccess file that does not start with the # comment character...and does have...301...in it...is this one line:

RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

If you were expecting more...or something different...maybe this is where the special Xenforo add-on I mentioned above takes over:

Screen Shot 2022-01-27 at 3.15.31 PM.png

JGaulard said:
Also, you don't need to add the asterisk if you want to block the directory I mentioned above. You can simply add it to the list of blocks that you've already put in there.

I think I'll add what you mentioned above:

User-agent: *
Disallow: /new-directory/

I think I understand it better...and it matches what I found searching the internet earlier today.

Do I need to add these 2 lines anywhere special in the robots.txt...so it functions properly?

JGaulard said:
I'm wondering how much Google has been crawling this sister site of yours. If it's been hitting it hard and all it's been getting are 403 pages, yes, from what I've seen, that will definitely affect your crawl rate and the quality of your site in Google's eyes (in my humble opinion). Let's just say it can't be good, so I'm happy we're getting to the bottom of this issue.

I've been wondering this as well. Any forum content (before the Summer 2019 migration)...I'm thinking would be seen by the Google crawler as "duplicate content". All forum threads pre-Summer 2019 (vB site and XF site)...would be EXACTLY the same. I'm thinking Google crawler would not be liking this!

Plus since the former vB site would be seeing much less new content (as compared to the XF site)...I think if a site is not generating much new content...Google crawler crawls it less.

But then again (not 100% sure about this)...Google crawler may see both sites (vB & XF) as 2 parts of a single site. And maybe all of the Google crawl data for everything goes into "one bucket"...and this is why the "By Response: Other Client Error (4xx)" statistic in Google Search Console is so messed up (all the 403 errors from the vB site are messing things up for the XF site).

Also...if the former vB site is "gobbling up" some/bunch of the website crawl budget...this could explain if some/many of the good-content-threads on the XF site are not getting crawled & indexed properly (showing up as "Valid URLs" in Google Search Console).

And maybe explain why in Google Analytics the #2 most visited forum page on the site is from 2010 (this doesn't make sense). Content from 2010 would be totally irrelevant in 2022...and I doubt many folks in 2022 are "Googling" for information from 2010 (and it be the #2 most visited page for the site)! Lol

Thanks!

JGaulard · Jan 28, 2022

Alfuzzy said:
This is probably getting beyond my level of expertise. I find the .htaccess file a place I shouldn't mess with. Where if something is messed with in the .htaccess file...and it's done incorrectly...serious things can be messed up!

There's no reason you would need to mess with the .htaccess file or the redirects. What I wouldn't mind seeing is those redirects in action though. You can use the header checker I sent over earlier to test some URLs. If you can find (or remember) some URLs from your original site when it was in the /forum/ folder, you can plug them into the header checker to make sure the 301s are working properly. When you plug the URL into the checker, it should return the 301 status code and also tell you which new URL the old one is redirecting to. It's a very useful tool to test things like this.

Alfuzzy said:
I've been wondering this as well. Any forum content (before the Summer 2019 migration)...I'm thinking would be seen by the Google crawler as "duplicate content". All forum threads pre-Summer 2019 (vB site and XF site)...would be EXACTLY the same. I'm thinking Google crawler would not be liking this!

This is why you have the 301 redirects in place. When an old pre-summer 2019 URL is visited by Google, Google will automatically be redirected to the new XenForo matching URL, so Google won't think the two URLs are duplicate. That's what 301 redirects are for.

Also, even if Google is attempting to crawl the directory you moved the old software to (/oldsite/), you mentioned that all those URLs are returning 403 status codes, no matter what you do. If that's the case, Google isn't actually scanning the pages. To Google, 403 pages are dead and they have no content. They are the equivalent of 404 pages. So there's no duplicate content on that side either.

Alfuzzy · Jan 28, 2022

JGaulard said:
There's no reason you would need to mess with the .htaccess file or the redirects.

I mentioned what I mentioned above in case you were expecting something to to be in the .htaccess file...and it wasn't. I definitely wasn't going to mess with anything in the .htaccess (unless absolutely necessary & with detailed instructions what to do)! Lol

JGaulard said:
What I wouldn't mind seeing is those redirects in action though. You can use the header checker I sent over earlier to test some URLs. If you can find (or remember) some URLs from your original site when it was in the /forum/ folder, you can plug them into the header checker to make sure the 301s are working properly. When you plug the URL into the checker, it should return the 301 status code and also tell you which new URL the old one is redirecting to. It's a very useful tool to test things like this.

Good news. If I take any random URL from the former vB site (and insert the proper sub-directory name that the vB site used when it was the "live" site)...I do get a 301 redirect response from the header checker website (the blocked URL in the image displays the proper URL format for the current "live" XF site):

If I take the same URL and enter it into the header checker website...and this URL has the sub-directory name it resides in today (different sub-directory than it used to be previously)...I get a 403 error (as I mentioned earlier in this thread):

Header Check 2.png

This of course is good news (the 301 redirects are working properly)...and it's these 403 errors which I'm trying to get rid of (hopefully)...by updating the robots.txt file so that crawlers can't "see" the sub-directory the former vB site resides in today.

JGaulard said:
Also, even if Google is attempting to crawl the directory you moved the old software to (/oldsite/), you mentioned that all those URLs are returning 403 status codes, no matter what you do. If that's the case, Google isn't actually scanning the pages. To Google, 403 pages are dead and they have no content. They are the equivalent of 404 pages. So there's no duplicate content on that side either.

Very true what you said about "no duplicate content".

I do think there's the possibility that if Google is crawling the former vB site (in the different sub-directory it resides in today...than it used to)...the 403 errors are still a problem. Seems like all these 403 errors are about the only explanation for the large amount of "Other client error (4xx)" in the Google Search Console report (below).

If I'm correct...the vast majority of the "Other client error (4xx)" in this GSC report have got to be coming from the 403 error page responses when Google is crawling the former vB site server sub-directory. Just no way the current "live" Xenforo site is returning this many 403 errors.

12-21-21 copy.png

Thanks

JGaulard · Jan 28, 2022

Two things:

Alfuzzy said:
Good news. If I take any random URL from the former vB site (and insert the proper sub-directory name that the vB site used when it was the "live" site)...I do get a 301 redirect response from the header checker website (the blocked URL in the image displays the proper URL format for the current "live" XF site):

This is great. One thing though. Your old vB site is likely 301 redirecting member pages (the old ones are stored in Google's memory). Those member pages are now blocked on your XenForo site. Each member page that's being redirected is using crawl budget because of that hop from the old site to the new site. It would probably be good to identify the old member page directory and block that as well, so no redirect even takes place. It'll still be there for users, but search engines won't be able to crawl it.

Alfuzzy said:
If I'm correct...the vast majority of the "Other client error (4xx)" in this GSC report have got to be coming from the 403 error page responses when Google is crawling the former vB site server sub-directory. Just no way the current "live" Xenforo site is returning this many 403 errors.

Why not just click on that row of the report in GSC? If you do that, you'll see a list of the URLs that are giving 403 responses.

Alfuzzy · Jan 28, 2022

JGaulard said:
This is great. One thing though. Your old vB site is likely 301 redirecting member pages (the old ones are stored in Google's memory). Those member pages are now blocked on your XenForo site. Each member page that's being redirected is using crawl budget because of that hop from the old site to the new site. It would probably be good to identify the old member page directory and block that as well, so no redirect even takes place. It'll still be there for users, but search engines won't be able to crawl it.

If I add the two lines we discussed earlier in the thread...won't that take care of this?

My understanding is/was if these 2 lines are added to the robots.txt...it will take care of it (block everything in that directory):

User-agent: *
Disallow: /new-directory/

JGaulard said:
Why not just click on that row of the report in GSC? If you do that, you'll see a list of the URLs that are giving 403 responses.

Yes I did this & we discussed this earlier in the thread (thread post #7 above)....but we really didn't come up with an answer/solution.

Unfortunately when I do this...the URLs from that line of the report ("Other Client error (4xx)")...do not match the URLs from the server directory the former vB site resides now.

In post #7 above...I listed the 3 categories these "Other Client error (4xx)" URLs look like...and they look nothing like the URLs from the former vB site directory.

JGaulard · Jan 28, 2022

Alfuzzy said:
If I add the two lines we discussed earlier in the thread...won't that take care of this?

My understanding is/was if these 2 lines are added to the robots.txt...it will take care of it (block everything in that directory):

User-agent: *
Disallow: /new-directory/

I think this is where we're having a sticking point. Unless I'm missing something. I'll try to explain and please let me know if I'm repeating myself.

The disallow: /new-directory/ will block Google from crawling anything in that directory. So yes, you are correct in that they won't be crawling those 403s any longer. So that directory is pretty much out of the picture now. What I'm referring to is Google's attempted crawling of the pages that once resided in the /forum/ directory. Those old URLs will be crawled for eternity, even though you deleted the actual pages. Even if you deleted everything in that /forum/ directory, Google will still crawl the URLs it once crawled. Even if you deleted everything in that directory and abandoned the website for 10 years and someone from Egypt bought the domain name and set up an entirely new site, Google would still crawl the URLs it once knew about (to your old pages). Google never stops crawling the URLs it has once crawled. So my point is, if you moved your current XenForo website installation to a new directory called /newxenforowebsite/ tomorrow and deleted every file from the current root directory, Google would continue to crawl all the URLs it's learned about in the root directory forever. There's no way to get rid of them. There's no way to get Google to "un-know" about those old URLs. It's like forum member accounts that have been deleted. Google doesn't care that the account has been removed. It'll still try to crawl those accounts, just in case they return some day.

Also, I am referring to (or am trying to refer to) "URLs" as the things you type into the address bar in your browser and "pages" as things that you physically have that construct your website. Every page will have a URL attached to it that points to the page.

I hope I didn't just over explain this, but it's important. And I really hope you didn't already know all this and we're just confusing words and ideas.

Alfuzzy said:
Yes I did this & we discussed this earlier in the thread (thread post #7 above)....but we really didn't come up with an answer/solution.

Unfortunately when I do this...the URLs from that line of the report ("Other Client error (4xx)")...do not match the URLs from the server directory the former vB site resides now.

In post #7 above...I listed the 3 categories these "Other Client error (4xx)" URLs look like...and they look nothing like the URLs from the former vB site directory.

Okay, then there's a problem. If this is the case, then it's not your old site that's causing the 403s. At least not in that part of the report. You'll need to figure out where this /index.php?sam-item stuff is coming from. My guess is that it's stemming from one or more of the add-ons that you've got installed. Add-ons create a lot of resource files that aren't "linked" to from anywhere normally visible on the site. They're in the code of the site. If you open up a few of your pages and right-click on the page somewhere and choose "View Page Source" from the menu, or whatever the option is, you'll see your page source code. Then, click CTRL+F on your keyboard and search for:

index.php?sam-item

You should see some files in there that use this file structure. You'll need to track down all the URLs this way. Even though your add-ons are used for simple tasks, they may be creating lots of things for Google to follow. Why they're returning 403 responses, I don't know.

Also, how many 403 response code URLs are we talking here? I think I only saw percentages. Are there thousands and those sam-item ones are most of them?

Alfuzzy · Jan 28, 2022

JGaulard said:
I think this is where we're having a sticking point. Unless I'm missing something. I'll try to explain and please let me know if I'm repeating myself.

The disallow: /new-directory/ will block Google from crawling anything in that directory. So yes, you are correct in that they won't be crawling those 403s any longer. So that directory is pretty much out of the picture now. What I'm referring to is Google's attempted crawling of the pages that once resided in the /forum/ directory. Those old URLs will be crawled for eternity, even though you deleted the actual pages. Even if you deleted everything in that /forum/ directory, Google will still crawl the URLs it once crawled. Even if you deleted everything in that directory and abandoned the website for 10 years and someone from Egypt bought the domain name and set up an entirely new site, Google would still crawl the URLs it once knew about (to your old pages). Google never stops crawling the URLs it has once crawled. So my point is, if you moved your current XenForo website installation to a new directory called /newxenforowebsite/ tomorrow and deleted every file from the current root directory, Google would continue to crawl all the URLs it's learned about in the root directory forever. There's no way to get rid of them. There's no way to get Google to "un-know" about those old URLs. It's like forum member accounts that have been deleted. Google doesn't care that the account has been removed. It'll still try to crawl those accounts, just in case they return some day.

Also, I am referring to (or am trying to refer to) "URLs" as the things you type into the address bar in your browser and "pages" as things that you physically have that construct your website. Every page will have a URL attached to it that points to the page.

I hope I didn't just over explain this, but it's important. And I really hope you didn't already know all this and we're just confusing words and ideas.

I think I mostly follow what you're saying.

The part where I'm maybe still a little lost is the part where you say...

"Those old URLs will be crawled for eternity, even though you deleted the actual pages. Even if you deleted everything in that /forum/ directory, Google will still crawl the URLs it once crawled."

If I (or anyone)...deleted all traces of something in a directory (information no longer resides on a person's server). If Google continues to crawl the URLs like you said...where or what is Google crawling if the actual information no longer appears on the server?

If this is true (website owner deleted the info...but Google continues to crawl the info)...the website owner really has no control over this. Gogle crawler will do what it does.

I guess my understanding is/was...if data/URLs no longer actually exist on a website owners server (because the website owner deleted it)...then what is Google crawling in future crawls if it's not on the server anymore?

I hope I understand the process:

* Google crawls a website.
* Discovers what content exists there via the crawl.
* Then Google at some point index's these URLs and declares them "Valid" URLs (which we see in Google Search Console).

Now if a website owner decides to delete a bunch of website pages (or changes the URL server structure of the pages & doesn't put the correct 301 redirects in place)...when Google does future crawls it won't be able to find this information anymore (because it was deleted or moved & Google wasn't told where it is now with 301 redirects).

If Google try's & fails to crawl these URLs in future crawls (and doesn't find them)...then eventually Google will de-index these pages...which is not good for a website (unless the website owner wanted these pages to no longer appear).

Bottom line (I think). If I decide to modify the robots.txt to block Google from crawling the former vB site directory...or if I decided to delete all info in this directory. But Google continues to Crawl these URLs (as you mentioned)...there's really nothing a website owner can do about this (I think).

JGaulard said:
Okay, then there's a problem. If this is the case, then it's not your old site that's causing the 403s. At least not in that part of the report. You'll need to figure out where this /index.php?sam-item stuff is coming from.

Yes I agree.

JGaulard said:
My guess is that it's stemming from one or more of the add-ons that you've got installed. Add-ons create a lot of resource files that aren't "linked" to from anywhere normally visible on the site. They're in the code of the site. If you open up a few of your pages and right-click on the page somewhere and choose "View Page Source" from the menu, or whatever the option is, you'll see your page source code. Then, click CTRL+F on your keyboard and search for:

index.php?sam-item

You should see some files in there that use this file structure. You'll need to track down all the URLs this way. Even though your add-ons are used for simple tasks, they may be creating lots of things for Google to follow. Why they're returning 403 responses, I don't know.

I can give this a try...but I'm pretty sure the high percentage of "Other Client Error (4xx)" in GSC existed when far fewer XF add-ons were in place.

If I do this (turn off a bunch of add-ons). Is there a quick way of checking to see if doing this was successful (in getting rid of these weird URLs)...rather than waiting months & months for Google to crawl the site many times...and "hopefully"...the "Other Client Error (4xx)" percent in GSC goes down?

JGaulard said:
Also, how many 403 response code URLs are we talking here? I think I only saw percentages. Are there thousands...

Thousands! Lol

JGaulard said:
...and those sam-item ones are most of them?

The other day when I went thru the first 5 pages of these URLs (first 50 URL's)...all 50 URLs fell into the 3 categories I mentioned in post #7.

I of course could continue to check more pages. But my gut feeling is these URLs will fall into the same 3 categories.

Edit: I just quickly checked another 15 pages (20 total pages of errors/200 URLs in total)...and they continue to be the same 3 categories of errors mentioned in post #7 of the thread.

As I mentioned. I can try turning off some add-ons. But I think this high percent of "Other Client Error (4xx)" has existed before many of these add-on's were put in place. Unfortunately as we both know...the historical data in Google Search Console only goes back 3 months.

I don't think I have any screenshots from a long time ago (2 years). The oldest data/screenshots I think I have for "Other Client Error (4xx)" is from last January, 2021 (12 months ago)...which I shared in post #5 above.

Thanks

JGaulard · Jan 28, 2022

Alfuzzy said:
If I (or anyone)...deleted all traces of something in a directory (information no longer resides on a person's server). If Google continues to crawl the URLs like you said...where or what is Google crawling if the actual information no longer appears on the server?

Here's where another definition comes into play. If a page is gone and Google "crawls" the URL, it's not actually crawling it. It's attempting to access the page the URL used to point to. It's hitting the server (a crawl is an actual page crawl for the information on it). A hit is a request to a web server for a file (such as a web page, image, JavaScript, or Cascading Style Sheet). It's the URLs Google keeps in its memory once it's taken the page out of the index. So just because the page content and file is gone, that doesn't mean that Google will stop trying to access it. And if the page content and file is gone and Google attempts to access it via its old URL, then nothing will be found and the server will respond with a 404 header code.

Eventually, Google will get the message and stop (or at least slow down dramatically) accessing these old URLs. The problem is, as I've mentioned before, that can take a very long time. If a website has one million pages, Google crawls nowhere near all one million often at all. It may crawl 10% of them frequently, depending on how popular the site is. So the reality of it is, if Google crawled a page and doesn't have any intention of recrawling it for another two years, after it's been deleted, how does Google even know it's gone? It's got to try to access it again, but that's not for another two years. This is what's so frustrating about dealing with large websites - trying to get rid of old pages. This is my life.

A bigger problem is a website getting hacked. Oftentimes, hackers will add directories to a site that's full of thousands and thousands of weird pages. While the site owner is sleeping, Google crawls all these pages. When the site owner wakes up in the morning, the site's rankings are in the toilet. This is when the site owner starts asking questions to Google like, "Hey, I used to have 10 pages on my website, but I got hacked and now my GSC account is saying I've got 100,010. I fixed the hack and removed the bad pages, but you're still telling me that I've got all these pages on my site. The hack and fix occurred months ago. How do I tell you the pages are gone?" And Google replies something like, "As we recrawl the site, we'll see that the pages are gone. Don't worry." Yeah right. That takes years and everyone knows that. I would personally block the offending directory in the robots.txt file and let the pages drop out of the index. I think it's faster that way.

The best part of all this is that every time Google crawls a URL that returns a 404, Google says, "Oh, this website is crap, I'll reduce my crawl rate." Obviously, a small number of 404s won't make much of a difference, but a large number will. I've seen this time and time again. And 403 are probably even worse because it's telling Google that the page is actually still alive, but it's inaccessible. Obviously, from the crawl stats we've been observing, 403s have a negative impact. This is why I prefer to use the robots.txt blocking method in certain cases. I think of it as tying Google's hands behind its back and not letting it do what it wants to do.

Alfuzzy said:
Bottom line (I think). If I decide to modify the robots.txt to block Google from crawling the former vB site directory...or if I decided to delete all info in this directory. But Google continues to Crawl these URLs (as you mentioned)...there's really nothing a website owner can do about this (I think).

Correct. The best thing to do, when dealing with isolated directories, is to use both robots.txt and 301 redirects. In your case, the /old-site/ directory (or whatever we're calling it) can be blocked with the robots.txt file because it was never meant to be found anyway. The "phantom" URLs that Google still thinks are in the /forum/ directory can be dealt with by using 301 redirects. So you currently have the correct setup. I was merely mentioning that some of those redirects are most likely being unnecessarily crawled. The member page redirects for instance. Since you're blocking them on the XenForo end, there's no sense in having the first half still be crawled on the vB end.

Alfuzzy said:
I can give this a try...but I'm pretty sure the high percentage of "Other Client Error (4xx)" in GSC existed when far fewer XF add-ons were in place.

Those things are coming from somewhere. You can check the code of other XenForo sites and do a search for that same snippet you're finding on your site. You'll probably see it's not there. I think I checked mine and it wasn't.

Alfuzzy said:
If I do this (turn off a bunch of add-ons). Is there a quick way of checking to see if doing this was successful (in getting rid of these weird URLs)...rather than waiting months & months for Google to crawl the site many times...and "hopefully"...the "Other Client Error (4xx)" percent in GSC goes down?

You can turn off a few at a time and then press F5 or CTRL+F5 on your keyboard for a complete refresh of the page. Then do a code search again for the snippet. First though, try to find the code on the pages while the add-ons are still active. If you can't even find that code, then you'll need to dig deeper.

Alfuzzy said:
Thousands! Lol

Then I think we're onto something.

Alfuzzy said:
As I mentioned. I can try turning off some add-ons. But I think this high percent of "Other Client Error (4xx)" has existed before many of these add-on's were put in place. Unfortunately as we both know...the historical data in Google Search Console only goes back 3 months.

I just did a quick search and found this page:

https://xenforo.com/community/threads/ads-manager-2-by-siropu-paid.142629/page-119

I then searched the page for "sam-item" and found that this add-on uses those URLs. You wouldn't happen to be running "Ads Manager 2 by Siropu" would you? Take a look at post 2,363 here. If you are running it, apparently Google really likes its resource URLs and it's sucking up a lot of your crawl budget.

Alfuzzy · Jan 31, 2022

JGaulard said:
Here's where another definition comes into play. If a page is gone and Google "crawls" the URL, it's not actually crawling it. It's attempting to access the page the URL used to point to. It's hitting the server (a crawl is an actual page crawl for the information on it). A hit is a request to a web server for a file (such as a web page, image, JavaScript, or Cascading Style Sheet). It's the URLs Google keeps in its memory once it's taken the page out of the index. So just because the page content and file is gone, that doesn't mean that Google will stop trying to access it. And if the page content and file is gone and Google attempts to access it via its old URL, then nothing will be found and the server will respond with a 404 header code.

I think the confusion was when we're using the term "crawling"...I guess we need to differentiate between Google "crawling" a site when pages exist vs. Google "crawling" a site when the pages no longer exist (website owner has deleted them).

If there are better/more clear terms to use for each of these situations...I'm sure I'd be less confused.

JGaulard said:
Eventually, Google will get the message and stop (or at least slow down dramatically) accessing these old URLs. The problem is, as I've mentioned before, that can take a very long time. If a website has one million pages, Google crawls nowhere near all one million often at all. It may crawl 10% of them frequently, depending on how popular the site is. So the reality of it is, if Google crawled a page and doesn't have any intention of recrawling it for another two years, after it's been deleted, how does Google even know it's gone? It's got to try to access it again, but that's not for another two years. This is what's so frustrating about dealing with large websites - trying to get rid of old pages. This is my life.

Good deal...I get this. Even if the pages are gone...Google still "remembers" them somewhere.

I guess all I can do is block the former vB website directory (via robots.txt)...to start the process of Google eventually "forgetting" this content exists. Even if this former vB directory content is not causing any issues...Google really shouldn't be crawling them...since in most cases many website owners wouldn't keep an old version of their website up & running (like I am). Lol

JGaulard said:
Those things are coming from somewhere. You can check the code of other XenForo sites and do a search for that same snippet you're finding on your site. You'll probably see it's not there. I think I checked mine and it wasn't.

You can turn off a few at a time and then press F5 or CTRL+F5 on your keyboard for a complete refresh of the page. Then do a code search again for the snippet. First though, try to find the code on the pages while the add-ons are still active. If you can't even find that code, then you'll need to dig deeper.

I'll give this a try...possible what you mentioned below may be the source of the issue.

JGaulard said:
I just did a quick search and found this page:

https://xenforo.com/community/threads/ads-manager-2-by-siropu-paid.142629/page-119

I then searched the page for "sam-item" and found that this add-on uses those URLs. You wouldn't happen to be running "Ads Manager 2 by Siropu" would you? Take a look at post 2,363 here. If you are running it, apparently Google really likes its resource URLs and it's sucking up a lot of your crawl budget.

Thanks much for investigating things further. Just so happens I am running this "Ads Manager 2 by Siropu".

I contacted the add-on developer "Siropu"...here was the response:

"sam-item is the route for ad related actions such as statistics or lazy loading ads. Google bot might trigger those errors when crawling. Not sure but make some changes to return the "no index no follow" robot tag when accessing those urls."

Not 100% sure if the developer is saying the add-on is responsible for this issue or not. But since I was a little confused by the developers reply message...in a follow-up message (just so I was 100% clear what was being said)...he confirmed that he would update the add-on to hopefully address this issue (and that there wasn't anything I needed to do at this point).

Of course not sure how quickly this add-on update will be done by the developer...so I guess I will need to turn off this add-on...until an add-on update is confirmed & I update the add-on.

Thanks

JGaulard · Jan 31, 2022

Alfuzzy said:
"sam-item is the route for ad related actions such as statistics or lazy loading ads. Google bot might trigger those errors when crawling. Not sure but make some changes to return the "no index no follow" robot tag when accessing those urls."

I'm glad we're getting somewhere. Unfortunately, adding noindex or nofollow to those URLs won't do anything. When they're accessed, they're returning 403 header codes. Google isn't even crawling the "pages" (they're really files). So adding noindex is useless. WordPress does this all the time. Developers of certain addons and features in the core application place the noindex attribute in their files as if that's supposed to do something. As a matter of fact, just the other day, I had to add two lines to one of my WordPress sites:

Disallow: */wp-json/wordpress-popular-posts/
Disallow: */wp-json/oembed/

Google was hammering these pages (hundreds of them - 1 for each post) for months without me knowing and I actually saw a ranking drop because of it. Crazy. Anyway, what I would personally do is stick the following line in the robots.txt file. It'll stop Google in its tracks from crawling those URLs at all. That's what you need. You don't need Googlebot to continue crawling them to find noindex and nofollow. Let me know if this makes sense to you.

User-agent: *
Disallow: /index.php?sam-item

Alfuzzy · Jan 31, 2022

JGaulard said:
Unfortunately, adding noindex or nofollow to those URLs won't do anything. When they're accessed, they're returning 403 header codes. Google isn't even crawling the "pages" (they're really files). So adding noindex is useless.

Hopefully the developer knows what he's doing.

Maybe whatever he does adds some value...plus the robots.txt idea you mentioned below. Maybe both ideas will improve things.

His Ads Manager 2 add-on is very useful from a monetizing standpoint. Gives a website admin more flexibility where ads can be placed. I'm sure someone who knows how to code Xenforo could do this sort of thing directly. This add-on does a heck of a lot more too.

JGaulard said:
WordPress does this all the time. Developers of certain addons and features in the core application place the noindex attribute in their files as if that's supposed to do something. As a matter of fact, just the other day, I had to add two lines to one of my WordPress sites:

Disallow: */wp-json/wordpress-popular-posts/
Disallow: */wp-json/oembed/

Google was hammering these pages (hundreds of them - 1 for each post) for months without me knowing and I actually saw a ranking drop because of it. Crazy.

I guess catching it sooner than later is still good!

JGaulard said:
Anyway, what I would personally do is stick the following line in the robots.txt file. It'll stop Google in its tracks from crawling those URLs at all. That's what you need. You don't need Googlebot to continue crawling them to find noindex and nofollow. Let me know if this makes sense to you.

User-agent: *
Disallow: /index.php?sam-item

Wow...it's amazing how specific lines can be for robots.txt.

What bothers me is. As far as I know my Xenforo install is not much different than other XF installs...and I'm fairly sure other XF installs don't have the super high "Other Client Error (4xx)" percent I have. Thus it would be better if I could find out what the route cause of this issue is. But I'll still do the robots.txt idea in the meantime.

By the way...if I wanted to add the 2 lines below (that we also discussed previously) to my robots.txt...do they need to be added anywhere special in the robots.txt...so it functions properly (very top, very bottom, or anywhere)?

User-agent: *
Disallow: /new-directory/

Thanks

JGaulard · Feb 1, 2022

Alfuzzy said:
By the way...if I wanted to add the 2 lines below (that we also discussed previously) to my robots.txt...do they need to be added anywhere special in the robots.txt...so it functions properly (very top, very bottom, or anywhere)?

You can add that line anywhere within the mix of the others I gave you. That way, that wildcard (*) will apply to all directories and files in the following list. Good luck!

Alfuzzy · Feb 1, 2022

Thanks for confirming...will make the changes...and hopefully it won't take forever to see some results.

Alfuzzy · Feb 1, 2022

2 quick questions. We talked about making the following 2 additions to the robots.txt file:

User-agent: *
Disallow: /new-directory/

User-agent: *
Disallow: /index.php?sam-item

This is the way my robots.txt looks now with the changes:

User-agent: *

User-agent: *
Disallow: /new-directory/

User-agent: *
Disallow: /index.php?sam-item

1. Should the User-agent: * line appear each time with each of these additions. Or does the User-agent: * only need to appear in the robots.txt just once?

2. If the User-agent: * line does appear more than once in the robots.txt...does it hurt anything?

Thanks

p.s. Of course I changed the "new-directory" part to the correct directory name.

JGaulard · Feb 2, 2022

Alfuzzy said:
1. Should the User-agent: * line appear each time with each of these additions. Or does the User-agent: * only need to appear in the robots.txt just once?

2. If the User-agent: * line does appear more than once in the robots.txt...does it hurt anything?

You only need to add the user agent * once in your robots.txt file. So it would look like this:

User-agent: *
Disallow: /all-other-directories-from-earlier/
Disallow: /new-directory/
Disallow: /index.php?sam-item

Just make sure everything you want blocked by "all crawlers" (*) is listed directly after the User-agent: * line, one after the other with no gaps.

Alfuzzy · Feb 2, 2022

JGaulard said:
You only need to add the user agent * once in your robots.txt file. So it would look like this:

User-agent: *
Disallow: /all-other-directories-from-earlier/
Disallow: /new-directory/
Disallow: /index.php?sam-item

Just make sure everything you want blocked by "all crawlers" (*) is listed directly after the User-agent: * line, one after the other with no gaps.

I see thanks.

Sounds as you said...important that the disallow lines are immediately after the user-agent * line.

In my example above (post #36)...is there technically anything wrong with it...other than it's less efficient (more robots.txt lines than necessary)?

Thanks

JGaulard · Feb 2, 2022

Alfuzzy said:
In my example above (post #36)...is there technically anything wrong with it...other than it's less efficient (more robots.txt lines than necessary)?

No. I don't think there is anything wrong with having more than one of those lines. From what I understand, you can have as many as you want.

Alfuzzy · Feb 2, 2022

Good deal...thanks.

Search

Google Search Console "Queries" and "Pages" Information

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

Alfuzzy

Member

JGaulard

Administrator

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

Alfuzzy

Member

JGaulard

Administrator

Alfuzzy

Member

JGaulard

Administrator