
KodyWallice
Member
- Joined
- May 7, 2021
- Messages
- 129
- Reaction Score
- 1
- Points
- 23
- #1
One of the most critical aspects of owning and operating a large online web discussion board is technical SEO. While many website owners out there like to say things like, "Hey, this software handles SEO great right out of the box", they'd be wrong. I'm not sure I've ever seen any piece of web software handle SEO perfectly right out of the box. There's always something that needs to be done. Unless it comes with a fully tested for success robots.txt, you'll need to do some work. And that work often includes blocking pages that don't need to be crawled.
I'll admit that XenForo forum software does a lot of great things right out of the box. There are a lot of pages that don't need to be crawled though. Why not? Well, if you let the search spiders crawl every little thing and leave them to figure it all out, your success might be a long ways off. From what I've seen though the years, the more that search engine spiders crawl pages that lead to 403 header codes, 301 redirects, and noindex code on pages, the less they like to crawl the website in question. I've allowed search engines such as Google to crawl pages such as these and I've watched my website's crawl rate diminish through the months.
Figuring out what to put in a robots.txt file is an art though. There's no right or wrong. I've gotten to the point that I'm not sure I know what to block and what not to block and that's why I look to those who are much smarter and much more talented than I am.
With that in mind, I thought I'd do a robots.txt comparison of some of the largest XenForo forums right here in this post. I'll make sure that the forums I discuss below have tons of activity and that they rank well in Google. I'll post the contents of their files here and we can all try to decipher how they achieved their success together. While I haven't come to a definitive conclusion in regards to whether it's better to block more or few pages for this software, I'm leaning towards blocking more. I really think that allowing Googlebot to crawl hundreds of thousands of useless pages, such as member profiles, attachment links, 301 redirects, and thin content that's labeled noindex is not such a good thing. I've been allowing this type of crawling on my sites and nothing good has come of it. And from what I've ready on the XenForo forums themselves, there are other users who are having issues as well. Although, I may be wrong. Please correct me if I am.
User-agent: *
Disallow: /find-new/
Disallow: /account/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /collection/
Disallow: /8833889/
Disallow: /avfadmin/
Disallow: /members/
Disallow: /conversations/
Disallow: /admin.php
Disallow: *AF%81*
Disallow: */write$
Disallow: */viewing$
Disallow: */add-reply$
-----------------------------
https://forums.macrumors.com/
User-agent: *
Disallow: /whats-new/
Disallow: /account/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /admin.php
Disallow: /deferred.php
Disallow: /threads/*?view=reaction_score$
Disallow: /labs/
Disallow: /uix/toggle-
Disallow: */unread$
Disallow: */latest$
Allow: /search/$
Disallow: /search/*
Allow: */threads/post-*
Disallow: */post-*
-----------------------------
https://xenforo.com/community/
User-agent: *
Disallow: /community/whats-new/
Disallow: /community/account/
Disallow: /community/attachments/
Disallow: /community/goto/
Disallow: /community/posts/
Disallow: /community/login/
Disallow: /community/admin.php
Allow: /
-----------------------------
https://www.neogaf.com/
User-agent: Mediapartners-Google
Disallow:
-----------------------------
https://www.resetera.com/
User-agent: *
Disallow: /whats-new/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /admin.php
Allow: /
-----------------------------
https://www.hearth.com/talk/
User-agent: *
Disallow: /talk/conversations/
User-agent: *
Disallow: /talk/conversations/*
Disallow: /talk/conversations/add?to=*
Disallow: /talk/bookmarks/*
Disallow: /talk/members/*
Disallow: /talk/search/*
-----------------------------
https://forums.whathifi.com/
User-agent: *
Disallow: /admin.php
Disallow: /account/
Disallow: /ajax/
Disallow: /conversations/
Disallow: /credits/
Disallow: /attachments/
Disallow: /find-new/
Disallow: /forums/-/
Disallow: /forums/tweets/
Disallow: /goto/
Disallow: /help/
Disallow: /login/
Disallow: /logout/
Disallow: /lost-password/
Disallow: /members/
Disallow: /misc/
Disallow: /online/
Disallow: /posts/
Disallow: /recent-activity/
Disallow: /search/
Disallow: /watched/
Disallow: /trophies/
Disallow: /covers/
Allow: /
-----------------------------
https://forum.wordreference.com/
User-agent: *
Disallow: /test/
Disallow: /account/
Disallow: /admin.php
Disallow: /ajax/
Disallow: /conversations/
Disallow: /events/birthdays/
Disallow: /events/monthly
Disallow: /events/weekly
Disallow: /find-new/
Disallow: /forums/-/
Disallow: /forums/tweets/
Disallow: /goto/
Disallow: /help/
Disallow: /login/
Disallow: /lost-password/
Disallow: /media/category/
Disallow: /media/keyword/
Disallow: /media/user/
Disallow: /media/service/
Disallow: /media/submit/
Disallow: /misc/style?*
Disallow: /misc/quick-navigation-menu?*
Disallow: /online/
Disallow: /pages/conduct/
Disallow: /pages/privacy/
Disallow: /posts/
Disallow: /threads/tera-tweet-from-*
Disallow: /wiki/special/
Disallow: /forums/*?order=
Disallow: /members/
-----------------------------
Okay, what can we infer from the contents of these robots.txt files? Personally, I know that a lot of what's included in the above is fluff. The most bare bones robots.txt file should be blocking:
Disallow: /whats-new/
Disallow: /search/
The pages contained in these directories are spider traps and can cause you lots of issues. Even though they're both marked with noindex in the headers of the pages, Google and friends shouldn't really be crawling them.
I also think it's a good idea to block:
Disallow: /goto/
Disallow: /posts/
The /goto/ directory is chock full of 301 redirects that are of no use. There isn't even any pagerank that flows through the URLs because the links are now (as of a recent XenForo software update) marked with nofollow. So basically, they're useless to search engines. The same thing with the /posts/ directory. On top of the 301 redirects though, this directory includes very thin noindex pages, which are useless for search engines to crawl.
Now here's where things get interesting. Should we be blocking these directories?
Disallow: /members/
Disallow: /attachments/
Some of the largest and most successful forums I've seen do, in fact, block those pages. Some don't. I don't know. I guess it depends on how popular these sites are and how many links they've got pointing to them. Who knows. What I do know is that unless you've only got a few members and their pages are very full of useful content, you should probably block member pages. The attachments links are useless and they'll end up as soft 404s if you don't block them in the permissions or inside of robots.txt. Directories such as /login/, /misc/, /help/, and so on are small potatoes. Blocking them might be a good idea just to save crawl budget, but the contents contained within them is so insignificant that I don't think they'll be a change one way or the other.
So what's your opinion? What should be be blocking inside of our XenForo robots.txt files? What do we not want the search engine crawlers to crawl?
I'll admit that XenForo forum software does a lot of great things right out of the box. There are a lot of pages that don't need to be crawled though. Why not? Well, if you let the search spiders crawl every little thing and leave them to figure it all out, your success might be a long ways off. From what I've seen though the years, the more that search engine spiders crawl pages that lead to 403 header codes, 301 redirects, and noindex code on pages, the less they like to crawl the website in question. I've allowed search engines such as Google to crawl pages such as these and I've watched my website's crawl rate diminish through the months.
Figuring out what to put in a robots.txt file is an art though. There's no right or wrong. I've gotten to the point that I'm not sure I know what to block and what not to block and that's why I look to those who are much smarter and much more talented than I am.
With that in mind, I thought I'd do a robots.txt comparison of some of the largest XenForo forums right here in this post. I'll make sure that the forums I discuss below have tons of activity and that they rank well in Google. I'll post the contents of their files here and we can all try to decipher how they achieved their success together. While I haven't come to a definitive conclusion in regards to whether it's better to block more or few pages for this software, I'm leaning towards blocking more. I really think that allowing Googlebot to crawl hundreds of thousands of useless pages, such as member profiles, attachment links, 301 redirects, and thin content that's labeled noindex is not such a good thing. I've been allowing this type of crawling on my sites and nothing good has come of it. And from what I've ready on the XenForo forums themselves, there are other users who are having issues as well. Although, I may be wrong. Please correct me if I am.
Large XenForo Forums Robots.txt File Contents
https://www.avforums.com/forums/User-agent: *
Disallow: /find-new/
Disallow: /account/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /collection/
Disallow: /8833889/
Disallow: /avfadmin/
Disallow: /members/
Disallow: /conversations/
Disallow: /admin.php
Disallow: *AF%81*
Disallow: */write$
Disallow: */viewing$
Disallow: */add-reply$
-----------------------------
https://forums.macrumors.com/
User-agent: *
Disallow: /whats-new/
Disallow: /account/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /admin.php
Disallow: /deferred.php
Disallow: /threads/*?view=reaction_score$
Disallow: /labs/
Disallow: /uix/toggle-
Disallow: */unread$
Disallow: */latest$
Allow: /search/$
Disallow: /search/*
Allow: */threads/post-*
Disallow: */post-*
-----------------------------
https://xenforo.com/community/
User-agent: *
Disallow: /community/whats-new/
Disallow: /community/account/
Disallow: /community/attachments/
Disallow: /community/goto/
Disallow: /community/posts/
Disallow: /community/login/
Disallow: /community/admin.php
Allow: /
-----------------------------
https://www.neogaf.com/
User-agent: Mediapartners-Google
Disallow:
-----------------------------
https://www.resetera.com/
User-agent: *
Disallow: /whats-new/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /admin.php
Allow: /
-----------------------------
https://www.hearth.com/talk/
User-agent: *
Disallow: /talk/conversations/
User-agent: *
Disallow: /talk/conversations/*
Disallow: /talk/conversations/add?to=*
Disallow: /talk/bookmarks/*
Disallow: /talk/members/*
Disallow: /talk/search/*
-----------------------------
https://forums.whathifi.com/
User-agent: *
Disallow: /admin.php
Disallow: /account/
Disallow: /ajax/
Disallow: /conversations/
Disallow: /credits/
Disallow: /attachments/
Disallow: /find-new/
Disallow: /forums/-/
Disallow: /forums/tweets/
Disallow: /goto/
Disallow: /help/
Disallow: /login/
Disallow: /logout/
Disallow: /lost-password/
Disallow: /members/
Disallow: /misc/
Disallow: /online/
Disallow: /posts/
Disallow: /recent-activity/
Disallow: /search/
Disallow: /watched/
Disallow: /trophies/
Disallow: /covers/
Allow: /
-----------------------------
https://forum.wordreference.com/
User-agent: *
Disallow: /test/
Disallow: /account/
Disallow: /admin.php
Disallow: /ajax/
Disallow: /conversations/
Disallow: /events/birthdays/
Disallow: /events/monthly
Disallow: /events/weekly
Disallow: /find-new/
Disallow: /forums/-/
Disallow: /forums/tweets/
Disallow: /goto/
Disallow: /help/
Disallow: /login/
Disallow: /lost-password/
Disallow: /media/category/
Disallow: /media/keyword/
Disallow: /media/user/
Disallow: /media/service/
Disallow: /media/submit/
Disallow: /misc/style?*
Disallow: /misc/quick-navigation-menu?*
Disallow: /online/
Disallow: /pages/conduct/
Disallow: /pages/privacy/
Disallow: /posts/
Disallow: /threads/tera-tweet-from-*
Disallow: /wiki/special/
Disallow: /forums/*?order=
Disallow: /members/
-----------------------------
Okay, what can we infer from the contents of these robots.txt files? Personally, I know that a lot of what's included in the above is fluff. The most bare bones robots.txt file should be blocking:
Disallow: /whats-new/
Disallow: /search/
The pages contained in these directories are spider traps and can cause you lots of issues. Even though they're both marked with noindex in the headers of the pages, Google and friends shouldn't really be crawling them.
I also think it's a good idea to block:
Disallow: /goto/
Disallow: /posts/
The /goto/ directory is chock full of 301 redirects that are of no use. There isn't even any pagerank that flows through the URLs because the links are now (as of a recent XenForo software update) marked with nofollow. So basically, they're useless to search engines. The same thing with the /posts/ directory. On top of the 301 redirects though, this directory includes very thin noindex pages, which are useless for search engines to crawl.
Now here's where things get interesting. Should we be blocking these directories?
Disallow: /members/
Disallow: /attachments/
Some of the largest and most successful forums I've seen do, in fact, block those pages. Some don't. I don't know. I guess it depends on how popular these sites are and how many links they've got pointing to them. Who knows. What I do know is that unless you've only got a few members and their pages are very full of useful content, you should probably block member pages. The attachments links are useless and they'll end up as soft 404s if you don't block them in the permissions or inside of robots.txt. Directories such as /login/, /misc/, /help/, and so on are small potatoes. Blocking them might be a good idea just to save crawl budget, but the contents contained within them is so insignificant that I don't think they'll be a change one way or the other.
So what's your opinion? What should be be blocking inside of our XenForo robots.txt files? What do we not want the search engine crawlers to crawl?