Setup robots.txt correctly! WordPress examples, rules and recommendations

February 04, 2023

Table of Contents:

Proper indexing of site pages in search engines is one of the important tasks facing the owner of the resource. Inclusion in the index of unnecessary pages can lead to lowering of documents in search results. To solve such problems the W3C consortium adopted the robots.txt standard on January 30, 1994.

What is Robots.txt?

Robots.txt is a text file on the site that contains instructions for robots which pages are allowed to be indexed and which are not. But it is not a direct instruction for search engines, rather instructions are advisory in nature, for example, as Google writes, if the site has external links, the page will be indexed.

In the illustration you can see the indexing of the resource without and with the Robots.txt file.

What should be closed from indexing:

  • website service pages
  • duplicate documents
  • private data pages
  • resource search result
  • sorting pages
  • login and registration pages
  • product comparisons

How do I create and add Robots.txt to my site?

Robots.txt is an ordinary text file that can be created in Notepad, following the standard syntax that will be described below. Only one such file is needed for one site.

The file must be added to the root directory of the site and must be accessible at: http://www.YourWebsite.com/robots.txt

Robotots.txt file syntax

Instructions for search robots are given by directives with different parameters.

User-agent directive

With this directive you can specify for which search engine robot the following recommendations will be specified. The robots file must begin with this directive. There are officially 302 such robots in the World Wide Web. But if you do not want to list them all, you can use the following line:

User-agent: *

Where * is a special symbol for any robot.

  • Googlebot is the main Google robot;
  • Googlebot-Image is a picture robot;
  • Googlebot-Mobile is a mobile version indexer.

Disallow and Allow directives

With these directives you can specify which partitions or files can and should not be indexed.

Disallow - directive to disallow indexing of documents on the resource. The directive syntax is as follows:

Disallow: /site/

In this example, all pages from YourWebsite.com/site/ were blocked from indexing by search engines.

Note: If this directive is specified empty, it means that the entire site is open for indexing. If you specify Disallow: / - it will close the entire site from indexing.

  • To disallow a site folder, specify the following:

    Disallow: /folder/

  • To disallow only one file, write:

    Disallow: /folder/img.jpg

  • If you want to disallow only files with a certain resolution:

    Disallow: /*.css$

  • Allow is, on the contrary, an allow instruction for indexing.

    User-agent: *
    Allow: /site
    Disallow: /

    This instruction prohibits indexing the entire site, except for the site folder.

Sitemap Directive

If your site has a sitemap.xml file describing its structure, you can specify the path to it in robots.txt using the Sitemap directive. If there are several such files, you can list them in robots:

User-agent: *
Disallow: /site/
Allow: /
Sitemap: http://site.com/sitemap1.xml
Sitemap: http://site.com/sitemap2.xml

The directive can be specified in any of the instructions for any robot.

Host Directive

Host is an instruction directly to the Google robot to specify the main mirror of the site. This directive is necessary if the site has several domains under which it is accessible. It is necessary to specify Host in the section for Google robots:

User-agent: Google
Disallow: /site/
Host: YourWebsite.com

Note: If the main mirror of the site is a domain with the protocol https, then it must be specified in the robots in this way:

Host: https://YourWebsite.com.

In robots Host directive is taken into account only once. If there are two HOST directives in the file, Google robots will consider only the first one.

Crawl-delay directive

If search engine crawlers visit a resource too often, this can affect the server load (relevant for resources with a large number of pages). To reduce the load on the server, you can use the Crawl-delay directive.

The parameter for Crawl-delay is the time in seconds that tells the robots that pages should be downloaded from the site no more than once in a specified period.

Example of Crawl-delay directive usage:

User-agent: *
Disallow: /site
Crawl-delay: 4

Features of the Robots.txt file

  • All directives are on a new line and you should not list directives on one line
  • No other characters(including a space) must precede the directive
  • Parameters of directives must be specified in one line
  • Rules in robots are specified in the following form: [Directive_name]:[optional space][value][optional space]
  • Parameters do not need to be specified in quotes or other characters
  • Do not specify ";" after the directives.
  • An empty string is interpreted as the end of the User-agent directive, if there is no empty string before the next User-agent, it can be ignored
  • In robots, you can put comments after the # sign (even if the comment is moved to the next line, you should also put # on the next line)
  • Robots.txt is case insensitive
  • If the robots file weighs more than 32 Kb or is for some reason inaccessible or empty, it is treated as Disallow: (all can be indexed)
  • Only 1 parameter can be specified in the "Allow", "Disallow" directives
  • In the directives "Allow", "Disallow" in the site directory parameter are specified with a slash (for example, Disallow: /site)
  • Using Cyrillic characters in robots is not allowed

Features of setting up robots.txt for Google

The peculiarities of setting up robots for Google are only the presence of the Host directory in the instructions. Let's look at the correct robots by example:

User-agent: Google
Disallow: /site
Disallow: /admin
Disallow: /users
Disallow: */templates
Disallow: */css
Host: www.site.com

In this case, the Host directive tells the Google robots that the main mirror of the site is www.site.com (but this directive is advisory in nature).

Features of setting up robots.txt for Google

For Google, the only peculiarity is that the company itself recommends not to close files with css-styles and js-scripts from search robots. In this case, the robot will look like this:

User-agent: Googlebot
Disallow: /site
Disallow: /admin
Disallow: /users
Disallow: */templates
Allow: *.css
Allow: *.js
Host: www.site.com

With Allow directives, style and script files are available to Google robots, they will not be indexed by the search engine.

Check if robots are set up correctly

You can also use this tool to check whether pages are allowed or prohibited for indexing:

Another tool for validating robots is the "Robots.txt file validation tool" in the Google Search Console panel:

But this tool is only available if the site is added to the Google Webmaster panel.

Conclusion

Robots.txt is an important tool for controlling how search engines index your site. It is very important to keep it up to date, and not to forget to open the necessary documents for indexing and close those pages that can damage the good ranking of the resource in extradition.

Example of setting up robots for WordPress

The correct robots.txt for Wordpress should be composed like this:

User-agent: Google
Disallow: /cgi-bin
Disallow: /?
Disallow: /wp-
Disallow: *?s=
Disallow: /search
Disallow: */page/
Disallow: /*print=
Host: www.YourWebsite.com
User-agent: Googlebot
Disallow: /cgi-bin
Disallow: /?
Disallow: /wp-
Disallow: *?s=
Disallow: /search
Disallow: */page/
Disallow: /*print=
Allow: *.css
Allow: *.js
User-agent: *
Disallow: /cgi-bin
Disallow: /?
Disallow: /wp-
Disallow: *?s=
Disallow: /search
Disallow: */page/
Disallow: /*print=
Sitemap: http://YourWebsite.com/sitemap.xml

Need to check positions on Google? The answer is here:

Join Rankinity

Get 300 checks per month absolutely FREE!

No credit card needed. No strings attached. 👍