When Should I Block URLs in My Robots.txt File?

Cameron · Jan 15, 2022

I have a quick question. I recently launched a classifieds website and the software is somewhat of a mess. It's not optimized for search engines at all and I think there are many pages that aren't necessary for the crawlers to crawl. So my question is, how do I know what to block in my robots.txt file? Which pages should I block? What will happen if I block them? I am pretty new at this.

CampFireJack · Jan 15, 2022

This is actually a great question. While not a lot of focus is placed in the robots.txt file in the SEO world anymore, it's actually one of the best methods for managing search engine crawling on a website. There have been many other and more modern methods for handling SEO introduced through the years, but nothing compares to the power of this one file. All the big sites use theirs extensively. From Amazon to Ebay to Google itself. Have you ever seen these files? I'll link to them here:

https://www.amazon.com/robots.txt

https://www.ebay.com/robots.txt

https://www.google.com/robots.txt

As you can see, the robots.txt file is alive and well. These big guys have got tons blocked.

To answer your question, there are three primary reasons you might want to block a URL, group of URLs, or a directory in your robots.txt file. The first reason is to block the introduction of duplicate content on your own website. I see that you're running a classifieds site. I'm sure you've got category pages with lots of sorting and the like. You may even have ad pages that have ancillary pages that display ad images only or printer friendly pages. All of these types of pages can introduce duplicate content which can severely damage your rankings. I've seen website rankings plummet because of duplicate content. While there are methods to handle this type of thing, it's best to just block these duplicate pages and be done with it. Canonical tags and 301 redirects only work some of the time, but the robots.txt file works all of the time.

The second reason you might want to use this file is to preserve your crawl budget. Google and the other search engines are only willing to crawl a certain number of website pages per day and if you've got them crawling duplicate and low value pages, many of your good pages might never be seen. That's not a good thing. Furthermore, the more low value pages you allow the search engines to crawl, the less they want to crawl. They like high value and they'll reward you with lots of crawling and rankings if you only show them those types of pages.

The final reason it's good to use your robots.txt file is to block access to low value thin content. On your classifieds site, you may have pages that only contain a form, such as a contact seller or send to a friend page. These are very low value and your website rankings will suffer if you allow them to be crawled. So block them and save yourself a headache.

I hope this helps.

KodyWallice · Jan 15, 2022

I ran a classifieds website for over 14 years and battled all kinds of duplicate content on it. First, the homepage itself had a number of versions. I found that many of the URLs that duplicated the homepage had question marks (?) in them, so I blocked them in robots.txt. Then, I found a distinct rewritten URL that was a copy of the homepage, so I blocked that too. As for the categories, I had primary URLs that were rewritten like, /CategoryName/Number/PageX.html, so they were fine. They included a bunch of sort options though and all of those sort URLs had parameters in them with more question marks, so I blocked all of the sort option pages. As for the ad pages, each one led to another page that someone could click on to see just the ad images. Then there was the printer only page. And then there were five pages with URLs that included more parameters. They went to the contact seller, send to a friend, vote on ad, view votes, and a few more. I think there were five in all. So for every ad page, there were seven extra system generated pages. Those were all thin. The printer friendly and the image pages were duplicate content. The sort pages for the categories were duplicates. It was a mess. I blocked all of these extra pages. The only ones I allowed the search engines to crawl were the one homepage, category pages and their associated pagination, and the single ad pages with no extra pages. That seemed to work well. I didn't use 301 redirects or canonical. I tried and never had any luck with those options.

The thing about blocking pages in the robots.txt file is that if you're blocking pages that link directly from another that you're keeping crawled, you shouldn't notice any temporary ranking drop. If you're blocking a directory that's got only one link going to it, but there are tons of pages contained inside, like an entire section of the website, you'll notice a temporary drop in rankings. The reason for this is that the shallow, one hop, pages still have link flow (juice) going directly to them. The section that gets blocked that only has one entrance path becomes problematic because the link flow gets cut off to all the interior pages completely. So if you have 1,000 pages in a section and you block the door way into that section, all of the URLs contained inside have got to be drained of their pagerank completely before they get dropped from Google's index. That can take months or years sometimes, depending on the website. But, it needs to be done, so just do it.

Basically, my point is, block all pages that don't need to be crawled. The ones that you wouldn't expect to show up in the Google results.

LukeLewis · Jan 15, 2022

My question is, is a robots.txt file necessary? I mean, is it 100% necessary to have this file on all websites? What if I don't want to block anything? Will there be a penalty if I don't include it?

Newman · Jan 15, 2022

LukeLewis said:
My question is, is a robots.txt file necessary? I mean, is it 100% necessary to have this file on all websites? What if I don't want to block anything? Will there be a penalty if I don't include it?

There is no penalty for not having a robots.txt file on your site. Many sites don't have them. You would only need one if you had the intention of blocking a file or directory from being crawled. Otherwise, you can leave it off. Why is this? Well, think about all the small mom and pop websites where the owner isn't versed in SEO. These people don't know the difference between a robots.txt file and a hole in the wall. Would it be in Google's best interest to leave all these websites out of their rankings? No, of course not. So again, if you don't want to block anything, you don't need one of these files at all.

Search

When Should I Block URLs in My Robots.txt File?

Cameron

Member

CampFireJack

Member

KodyWallice

Member

LukeLewis

Member

Newman

Member