Search

Comparison of XenForo Robots.txt Files

  • Thread starter KodyWallice
  • Start date
KodyWallice

KodyWallice

Member
Joined
May 7, 2021
Messages
123
Reaction Score
1
Points
23
  • #1
One of the most critical aspects of owning and operating a large online web discussion board is technical SEO. While many website owners out there like to say things like, "Hey, this software handles SEO great right out of the box", they'd be wrong. I'm not sure I've ever seen any piece of web software handle SEO perfectly right out of the box. There's always something that needs to be done. Unless it comes with a fully tested for success robots.txt, you'll need to do some work. And that work often includes blocking pages that don't need to be crawled.

I'll admit that XenForo forum software does a lot of great things right out of the box. There are a lot of pages that don't need to be crawled though. Why not? Well, if you let the search spiders crawl every little thing and leave them to figure it all out, your success might be a long ways off. From what I've seen though the years, the more that search engine spiders crawl pages that lead to 403 header codes, 301 redirects, and noindex code on pages, the less they like to crawl the website in question. I've allowed search engines such as Google to crawl pages such as these and I've watched my website's crawl rate diminish through the months.

Figuring out what to put in a robots.txt file is an art though. There's no right or wrong. I've gotten to the point that I'm not sure I know what to block and what not to block and that's why I look to those who are much smarter and much more talented than I am.

With that in mind, I thought I'd do a robots.txt comparison of some of the largest XenForo forums right here in this post. I'll make sure that the forums I discuss below have tons of activity and that they rank well in Google. I'll post the contents of their files here and we can all try to decipher how they achieved their success together. While I haven't come to a definitive conclusion in regards to whether it's better to block more or few pages for this software, I'm leaning towards blocking more. I really think that allowing Googlebot to crawl hundreds of thousands of useless pages, such as member profiles, attachment links, 301 redirects, and thin content that's labeled noindex is not such a good thing. I've been allowing this type of crawling on my sites and nothing good has come of it. And from what I've ready on the XenForo forums themselves, there are other users who are having issues as well. Although, I may be wrong. Please correct me if I am.

Large XenForo Forums Robots.txt File Contents​

https://www.avforums.com/forums/

User-agent: *
Disallow: /find-new/
Disallow: /account/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /collection/
Disallow: /8833889/
Disallow: /avfadmin/
Disallow: /members/
Disallow: /conversations/
Disallow: /admin.php
Disallow: *AF%81*
Disallow: */write$
Disallow: */viewing$
Disallow: */add-reply$

-----------------------------

https://forums.macrumors.com/

User-agent: *
Disallow: /whats-new/
Disallow: /account/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /admin.php
Disallow: /deferred.php
Disallow: /threads/*?view=reaction_score$
Disallow: /labs/
Disallow: /uix/toggle-
Disallow: */unread$
Disallow: */latest$

Allow: /search/$
Disallow: /search/*

Allow: */threads/post-*
Disallow: */post-*

-----------------------------

https://xenforo.com/community/

User-agent: *
Disallow: /community/whats-new/
Disallow: /community/account/
Disallow: /community/attachments/
Disallow: /community/goto/
Disallow: /community/posts/
Disallow: /community/login/
Disallow: /community/admin.php
Allow: /

-----------------------------

https://www.neogaf.com/

User-agent: Mediapartners-Google
Disallow:

-----------------------------

https://www.resetera.com/

User-agent: *
Disallow: /whats-new/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /admin.php
Allow: /

-----------------------------

https://www.hearth.com/talk/

User-agent: *
Disallow: /talk/conversations/
User-agent: *
Disallow: /talk/conversations/*
Disallow: /talk/conversations/add?to=*
Disallow: /talk/bookmarks/*
Disallow: /talk/members/*
Disallow: /talk/search/*

-----------------------------

https://forums.whathifi.com/

User-agent: *
Disallow: /admin.php
Disallow: /account/
Disallow: /ajax/
Disallow: /conversations/
Disallow: /credits/
Disallow: /attachments/
Disallow: /find-new/
Disallow: /forums/-/
Disallow: /forums/tweets/
Disallow: /goto/
Disallow: /help/
Disallow: /login/
Disallow: /logout/
Disallow: /lost-password/
Disallow: /members/
Disallow: /misc/
Disallow: /online/
Disallow: /posts/
Disallow: /recent-activity/
Disallow: /search/
Disallow: /watched/
Disallow: /trophies/
Disallow: /covers/
Allow: /

-----------------------------

https://forum.wordreference.com/

User-agent: *
Disallow: /test/
Disallow: /account/
Disallow: /admin.php
Disallow: /ajax/
Disallow: /conversations/
Disallow: /events/birthdays/
Disallow: /events/monthly
Disallow: /events/weekly
Disallow: /find-new/
Disallow: /forums/-/
Disallow: /forums/tweets/
Disallow: /goto/
Disallow: /help/
Disallow: /login/
Disallow: /lost-password/
Disallow: /media/category/
Disallow: /media/keyword/
Disallow: /media/user/
Disallow: /media/service/
Disallow: /media/submit/
Disallow: /misc/style?*
Disallow: /misc/quick-navigation-menu?*
Disallow: /online/
Disallow: /pages/conduct/
Disallow: /pages/privacy/
Disallow: /posts/
Disallow: /threads/tera-tweet-from-*
Disallow: /wiki/special/
Disallow: /forums/*?order=
Disallow: /members/

-----------------------------

Okay, what can we infer from the contents of these robots.txt files? Personally, I know that a lot of what's included in the above is fluff. The most bare bones robots.txt file should be blocking:

Disallow: /whats-new/
Disallow: /search/

The pages contained in these directories are spider traps and can cause you lots of issues. Even though they're both marked with noindex in the headers of the pages, Google and friends shouldn't really be crawling them.

I also think it's a good idea to block:

Disallow: /goto/
Disallow: /posts/

The /goto/ directory is chock full of 301 redirects that are of no use. There isn't even any pagerank that flows through the URLs because the links are now (as of a recent XenForo software update) marked with nofollow. So basically, they're useless to search engines. The same thing with the /posts/ directory. On top of the 301 redirects though, this directory includes very thin noindex pages, which are useless for search engines to crawl.

Now here's where things get interesting. Should we be blocking these directories?

Disallow: /members/
Disallow: /attachments/

Some of the largest and most successful forums I've seen do, in fact, block those pages. Some don't. I don't know. I guess it depends on how popular these sites are and how many links they've got pointing to them. Who knows. What I do know is that unless you've only got a few members and their pages are very full of useful content, you should probably block member pages. The attachments links are useless and they'll end up as soft 404s if you don't block them in the permissions or inside of robots.txt. Directories such as /login/, /misc/, /help/, and so on are small potatoes. Blocking them might be a good idea just to save crawl budget, but the contents contained within them is so insignificant that I don't think they'll be a change one way or the other.

So what's your opinion? What should be be blocking inside of our XenForo robots.txt files? What do we not want the search engine crawlers to crawl?
 
JGaulard

JGaulard

Administrator
Staff member
Site Supporter
Sr. Site Supporter
Power User
Joined
May 5, 2021
Messages
319
Reaction Score
2
Points
18
  • #2
I am of the mind that you should block both the /members/ and /attachments/ directories. If those pages are allowed to be crawled, Google will encounter 403 status codes and the overall crawl rate for the website in question will drop. By blocking the pages, the crawl rate will stay the same or increase.

Also, the /threads/*/latest and /threads/*/post pages should be blocked in the robots.txt file. Those are merely redirects and they waste crawl budget as well. And for some reason, when those redirects are allowed to be crawled, the website's homepage won't appear when using the site: command in Google.
 
Top