Important!! Robots.txt gone wrong!

We use AWS cloud to host ecommerce stores that scale with traffic.

Problem statement

Recently we noticed that we had high traffic on our servers and servers were Autoscaling to match the traffic. Upon inspection we found, traffic was from GoogleBot. We looked at the URL’s and realized, GoogleBot was trying to access same category page urls with different filters. This should not happen unless you allow everything in Robots.txt

And there it is, Robots allows everything for GoogleBot

Robots.txt that allows everything for Googlebot

Such a rule becomes problematic in following situation.

Consider following Product Listing page for Hugo Boss perfume category.

As you can see we have size filters.

Perfume Website with size filter options. Each size is swatch

These swatches are links. Clicking on any swatch results in going to a link of format

perfumewebsite.com/hugo-boss.html?size=30ML&size=40ML

perfumewebsite.com/hugo-boss.html?size=40ML&size=30ML

…… and so on.

Crawlers go to a link and then crawl links found on the new page. In this case, crawler will go on 30ML swatch and then it would see other swatches.

This can lead to more than 3628800 url being crawled for 10 swatches. We confirmed this find by going to Google Search Console -> Crawler stats section.

This harms the website in 3 ways

  1. Google has to crawl more URL’s, which hurts website’s crawl budget.

How to fix it?

Update the robot txt and remove a rule like

User-agent: Googlebot
Disallow:

User-agent: Googlebot-image
Disallow:

Make sure you are not googlebot is not crawling URL’s that you don’t want it to. Check Crawler stats in Google search console and mind crawl budget.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store