Important!! Robots.txt gone wrong!
We use AWS cloud to host ecommerce stores that scale with traffic.
Problem statement
Recently we noticed that we had high traffic on our servers and servers were Autoscaling to match the traffic. Upon inspection we found, traffic was from GoogleBot. We looked at the URL’s and realized, GoogleBot was trying to access same category page urls with different filters. This should not happen unless you allow everything in Robots.txt
And there it is, Robots allows everything for GoogleBot
Such a rule becomes problematic in following situation.
Consider following Product Listing page for Hugo Boss perfume category.
As you can see we have size filters.
These swatches are links. Clicking on any swatch results in going to a link of format
perfumewebsite.com/hugo-boss.html?size=30ML&size=40ML
perfumewebsite.com/hugo-boss.html?size=40ML&size=30ML
…… and so on.
Crawlers go to a link and then crawl links found on the new page. In this case, crawler will go on 30ML swatch and then it would see other swatches.
This can lead to more than 3628800 url being crawled for 10 swatches. We confirmed this find by going to Google Search Console -> Crawler stats section.
This harms the website in 3 ways
- Google has to crawl more URL’s, which hurts website’s crawl budget.
- These filter requests are not served from page cache so they land directly on server and generate load on origin server. This affects actual customer experience. And you have to pay more for the compute.
- You will be charged higher for using more network bandwidth (because you are serving more requests).
How to fix it?
Update the robot txt and remove a rule like
User-agent: Googlebot
Disallow:
User-agent: Googlebot-image
Disallow:
Make sure you are not googlebot is not crawling URL’s that you don’t want it to. Check Crawler stats in Google search console and mind crawl budget.