Robots.txt Guide for SEO Beginners & Professionals

SEO

Robots.txt Guide for SEO Beginners & Professionals

Absolutely all seoshniki are faced with the development and configuration of robots.txt. A well-written document allows you to index pages faster and occupy high positions in the search results for relevant queries. We wrote a simple instruction for beginners SEO-professionals: what is an index file and how to properly configure it.

What is the use of Robot.txt file?

The robots.txt file is a UTF-8 encoded text document that restricts crawlers from accessing website content (sections, pages). It operates on URL protocols (http, https and FTP).

It is mainly needed to:

Hide non-publishing pages
Optimize Crowling Budget
Prevent duplication of content

From indexing, they usually close the admin panel, site search results, registration and authorization pages, feeds, blank or development pages, etc.

Crawling budget – page limit for crawls by search robots over a time interval. The calculation is made taking into account user demand and server availability.

Sometimes instead of an index file, noindex is used in the robots meta tag. For example, to pass the reference weight of the page being removed from the index. Add the <meta name = ”robots” content = ”noindex, follow”> meta tag to <head >.

Important: robots.txt directives and the noindex instruction in robots act as recommendations and can be ignored by robots.

Robots.txt instructions

Before you start creating a file, you need to make sure that there are no robots.txt on the site. The easiest way to find out if such a file exists is to put the site URL in a browser with the addition of /robots.txt. As a result, one of three things will happen:

You will find the registered (albeit not very deep) file
Discover an almost empty but tuned robots
Get 404 error because page does not exist

Quick Start Guide:

Fill in a text document with UTF-8 support
Save it as robots in txt format
Check and make adjustments
Place robots.txt in the root directory

You should familiarize yourself with the instructions for filling, directives and file syntax.

General requirements

The name is written in lowercase – robots.txt
UTF-8 encoding
Format – txt
Size up to 500 KiB
Location at the root of the site
The only one on the site
Access for the desired protocol and port number

The encoding of page addresses and site structure are the same.
Please note that for websites with subdomains, for each root robots.txt are indicated separately .

Used Directives and Syntax

Directives prescribe instructions for search robots. Each is indicated on a new line. Consider their purpose and features:

1. Mandatory User-agent directive . With its help, we set the rules for each robot:

 

Search engines choose specific (appropriate for them) rules and can ignore the instructions in *. Therefore, it is recommended to register several agents for each, separating the sets with a line break.

2-3. Allow and Disallow regulate access to content for indexing. The first directive opens, the second closes. Using a slash (/) – stops crawlers from crawling site content: Disallow: /

However, Disallow with an empty section is equivalent to Allow .

Consider a special case:

In this combination, robots view only a specific blog post, the rest of the content is not available to them.

The path of the pages is written in full, sections – ends with a slash (/);
Allow and Disallow are sorted by URL prefix length (from smallest to largest). If there are several rules suitable for the page, preference is given to the latter;
The special characters * and $ are supported.

4. Sitemap – prescribes the position of the sitemap in xml format. This navigation contains the URLs of the pages required for indexing. After each crawl, the robot will receive an update of the site information in the search, taking into account all changes in the file.

Example: Sitemap: https://site.com//sitemap.xml.

We place anywhere in the document without duplication
When filling out, specify the full URL
Large cards are recommended to be broken

5. Clean-param is used additionally and is valid for Yandex.

Excludes dynamic (UTM-tags) and get-parameters. Such data does not affect the content of the page; therefore, it is inadmissible for indexing.

Parameters are indicated through “&”, followed by the prefix of the path of all or individual pages to which the rule applies:

Clean-param: parm1 & parm2 & parm3 /

Clean-param: parm1 & parm2 & parm3 / page.html

If there are several pages with duplicate information, it is more expedient to reduce their addresses to one:

Clean-param: ref /some_dir/get_products.pl – contains page addresses:

www.robot.com/some_dir/get_products.pl?products_id=123

www.robot.com/some_dir/get_products.pl?ref=site_1&products_id=123

www.robot.com/some_dir/get_products.pl?ref=site_2&products_id=123

www.robot.com/some_dir/get_products.pl?ref=site_3&products_id=123

Ref parameter use , to keep track of the resource from which the request.

The length is not more than 500 characters
Parameters are case sensitive
Located anywhere in the document
Reduces the load and speeds up indexing, as crawlers will not waste time scanning duplicate pages

6. Craw-delay determines the time for crawling pages.

Example: Crawl-delay: 2 – 2 second interval.

Not relevant for Google
For Yandex, it’s better to configure it in Webmaster
Allows you to slow down the scan in case of overload

7. Through Host, specify the main mirror of the site in order to avoid duplicates in the issue. If there are several values, only the first is taken into account, the rest are ignored.

Not relevant for Google, from March 20, 2018 – for Yandex
Replaced by a 301 redirect
Crawlers interpret directives differently. Yandex complies with the rules described in the file. Google is guided by its own principles. Therefore, when working with it, it is recommended to close pages through the robots meta tag.

Special characters “/, *, $, #”

An asterisk (*) takes into account a sequence of characters. The $ character indicates the end of the line and neutralizes the asterisk (*).

After the lattice “#” we place the comments on the same line. Their contents are ignored during scanning.

The slash “/” hides the content. A single slash in Disallow does not allow the entire site to be indexed. Two signs “//” are used to prohibit scanning of a separate directory.

We collect data, determine the necessary and “garbage” pages. With their account, we fill out the document, not forgetting the requirements and instructions. As a result, we get the finished robots.txt of the form:

We open access to styles and scripts for the correct rendering. Otherwise, the content cannot be indexed correctly, which will negatively affect the site’s position.

We implement Clean-param if you have dynamic links or pass parameters to URLs. Using Craw-Delay is also optional and takes effect in case of resource load.

Blank lines are allowed only between instruction groups for each agent.
At a minimum, the document should contain an agent and a prohibition directive.
For robots, unique rules apply depending on the type of site and CMS.
The directives are valid for a long time in case the crawler loses access to the index file.
A closed page may appear in the index if the link to it is placed on the site itself or on a third-party resource.
note
Complete restriction of access to crawlers is the biggest mistake in using the index file. Search engines will no longer crawl the resource, which could negatively affect organic traffic. We recommend only supplementing and updating the file after testing each rule introduced to correct errors in a timely manner. When creating and making changes to robots.txt, we apply the golden rule: fewer lines, more sense.

In case of refusal to implement the index file, the crawlers will scan the resource without restrictions. Moreover, the absence of such a file is not critical for small sites. Otherwise, consider the crawling budget and implement the robots document .

Important: robots.txt is a public file. While there is a possibility of indexing closed content, you need to make sure that pages with sensitive information use passwords and noindex .

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *