The robots.txt file can be a powerful, but potentially ☣️ dangerous SEO tool. Like a hammer, when used properly it can be very useful, but if used improperly you can hurt yourself quite badly.
A misconfigured robots.txt file can block prevent search engines from discovering the pages you want to drive traffic to.
This guide is for beginner to intermediate professionals.
The robots.txt file is a special text file that tells web crawlers, like Googlebot, which areas of your website they are ✅ allowed to access and which areas are ⛔️ restricted.
Typically robots.txt files block search engines from crawling:
- Marketing landing pages
- Unpublished content or media
- Registered member only areas
- Admin or login areas
- Unmoderated user generated content like comments
- A development or staging server
While it is OK to not have a robots.txt file, almost every website should have one.
Your robots.txt file must be at the root of your domain. Meaning it can't be in a folder.
Your robot.txt file should be a plain text file, and not an HTML file. If you visit your robots.txt file, the content should look fairly plain in blocky font.
User-agent: * Disallow: Sitemap: https://searchsignals.com/api/sitemap.xml
If you're using any sort of website builder like WordPress or a web framework like Next.js, likely you already have a robots.txt file with some Disallow: or Allow: directives.
These directives tell the search engines where they can and can not go.
Disallow: directives are like ⛔️ 'do not enter' signs.
Allow: directives are like 👉 'enter here' signs.
It's important to note that the robots.txt file doesn't prevent "bad bots" from accessing your website, just like road signs don't prevent people from entering or going the wrong way.
Following the robots.txt directives is considered a common courtesy, so major search engines and developers of "good" crawlers choose to abide by this internet etiquette.
The following are commonly used commands found within robots.txt files:
User-agent: * Disallow:
User-agent: * Disallow: /
User-agent: * Disallow: /folder-name/ Allow: /folder-name/specific-file.mp3
What if my robots.txt file is empty or doesn't exist at all?
An empty or non-existent robots.txt file will allow search engines to crawl any and all your page or file.
Practically speaking, you should block all access for a development or staging server. You should also consider Disallow: for sensitive areas like admin areas of your site.
The User-agent: command let you sets different rules for different crawlers.
Most commonly you'll see an asterisk * which is a default set of rules for all crawlers.
User-agent: * Disallow: /restricted-folders/ Disallow: /marketing-landing-pages/
However, you can set different rules for a specific crawler.
For example, it's common to allow Google's AdsBot full access to your site if you are purchasing ads through Google or if you're displaying ads from Google's ad network.
User-agent: * Disallow: /restricted-folders/ Disallow: /marketing-landing-pages/ User-agent: AdsBot-Google Disallow: User-agent: CCBot Disallow: /
Crawlers will look for their name and follow the commands listed under their name. So in this example, when Google's Adsbot visits your website, it will only obey the 1 Disallow: directive under the Adsbot section. The blank Disallow: directive means you allow Adsbot to crawl all page.
It's❗️ important to note that, even though /marketing-landing-page/ area was disallowed in the default User-agent: * that AdsBot, in this case, ✅ can still crawl that area, since crawlers only follow 1 set of commands.
Before making changes to your robots.txt file, you'll want to test it.
A great resource is the robots.txt validator and testing tool on TechnicalSEO.com. You can simply copy & paste and test any URL from your site to ensure the URLs you want blocked are blocked and the ones you want search engines to know about are allowed.
Alternatively, you can use the Google Search Console (GSC), a free tool provided by Google. Within GSC is a robots.txt testing tool. While it takes a little more effort to sign up and register your website, you'll have confidence that your changes are correct since you're getting direct feedback from Google itself. Additionally, there are many other valuable tools within GSC beyond the robots.txt tester.
If you have other commands in your robots.txt file like crawl-delay or sitemap: ; or if you're curious about what the dollar sign $, carrot symbol ^, or asterisks * does within the Disallow: and Allow: directives.
Practically speaking, you should almost never use crawl-delay and almost always have a sitemap: listed.
You can learn more advance topics about robots.txt in the advanced robots.txt guide