Robots.txt: A Practical Guide

A beginner's guide to robots.txt SEO

The robots.txt file can be a powerful, but potentially ☣️ dangerous SEO tool. Like a hammer, when used properly it can be very useful, but if used improperly you can hurt yourself quite badly.
A misconfigured robots.txt file can block prevent search engines from discovering the pages you want to drive traffic to.
notion image
This guide is for beginner to intermediate professionals.

What does the robots.txt file do?

The robots.txt file is a special text file that tells web crawlers, like Googlebot, which areas of your website they are ✅ allowed to access and which areas are ⛔️ restricted.
Robots.txt files typically block search engines from crawling:
  • Marketing landing pages
  • Unpublished content or media
  • Registered 'member only' areas
  • Admin or login areas
  • Unmoderated user generated content like comments
  • A development or staging server

Where to find or place your robots.txt file

notion image
While it is OK to not have a robots.txt file, almost every website should have one.
Your robots.txt file must be at the root of your domain. Meaning it can't be in a folder.
 
https://searchsignals.com/robots.txt
Your robots.txt file should be a plain text file, and not an HTML file. If you visit your robots.txt file, the content should look fairly plain in blocky font.
User-agent: *
Disallow:

Sitemap: https://searchsignals.com/api/sitemap.xml
An example robots.txt file

Controlling where crawlers are allowed to go

If you're using any sort of website builder like WordPress or a web framework like Next.js, likely you already have a robots.txt file with some Disallow: or Allow: directives.
These directives tell the search engines where they can and can not go.
Disallow: directives are like ⛔️ 'do not enter' signs.
 
notion image
Allow: directives are like 👉 'enter here' signs.
 
notion image
 
It's important to note that the robots.txt file doesn't prevent "😈 bad bots" from accessing your website, just like road signs don't prevent people from entering or going the wrong way.
Following the robots.txt directives is considered a common courtesy, so major search engines and developers of "good" crawlers choose to abide by this internet etiquette.

Common robots.txt commands

The following are commonly used commands found within robots.txt files:
Allow full access
User-agent: *
Disallow:
Block all access
User-agent: *
Disallow: /
Block access to all files in a directory but allow 1 file to still be accessed
User-agent: *
Disallow: /folder-name/
Allow: /folder-name/specific-file.mp3
💡
What if my robots.txt file is empty or doesn't exist at all? An empty or non-existent robots.txt file will allow search engines to crawl any and all your page or file.

What does User-agent do?

The User-agent: command let you sets different rules for different crawlers.
Most commonly you'll see an asterisk * which is a default set of rules for all crawlers.
Common use of User-agent with an asterisk
User-agent: *
Disallow: /restricted-folders/
Disallow: /marketing-landing-pages/
However, you can set different rules for a specific crawler.
For example, it's common to allow Google's AdsBot full access to your site if you are purchasing ads through Google or if you're displaying ads from Google's ad network.
User-agent: *
Disallow: /restricted-folders/
Disallow: /marketing-landing-pages/

User-agent: AdsBot-Google
Disallow: 

User-agent: CCBot
Disallow: /
Crawlers will look for their name and follow the commands listed under their name. So in this example, when Google's Adsbot visits your website, it will only obey the 1 Disallow: directive under the Adsbot section. The blank Disallow: directive means you allow Adsbot to crawl all page.
It's❗️ important to note that, even though /marketing-landing-page/ area was disallowed in the default User-agent: * that AdsBot, in this case, ✅ can still crawl that area, since crawlers only follow 1 set of commands.

Testing your robots.txt file

Before making can changes to your robots.txt file, you'll want to test it.
A great resource is the robots.txt validator and testing tool on TechnicalSEO.com. You can simply copy & paste and test any URL from your site to ensure the URLs you want blocked are blocked and the ones you want search engines to know about are allowed.
Alternatively, you can use the Google Search Console (GSC), a free tool provided by Google. Within GSC is a robots.txt testing tool. While it takes a little more effort to sign up and register your website, you'll have confidence that your changes are correct since you're getting direct feedback from Google itself. Additionally, there are many other valuable tools within GSC beyond the robots.txt tester.
notion image

Learn more advanced robots.txt topics

If you have other commands in your robots.txt file like crawl-delay or sitemap: ; or if you're curious about what the dollar sign $, carrot symbol ^, or asterisks * does within the Disallow: and Allow: directives.
Practically speaking, you should almost never use crawl-delay and almost always have a sitemap: listed.
You can learn more advance topics about robots.txt in the advanced robots.txt guide
 
notion image

Resources & References