Site indexing in Google: setting, checking, accelerating indexing
August 10, 2023
Table of Contents:
An important component of search engine optimization is the work with internal factors. Such factors include, among others, management of resource indexing - setting up its interaction with search engine robots. This issue of site functioning should be addressed at the design stage, which will avoid problems with promotion in the future.
What is site indexing and crowdsourcing budget?
Site indexing is the process of bypassing the pages of a web resource by search engine robots and entering the obtained information into the base of search engines, and in order for a resource to appear in Google's output it is necessary for it to be scanned and added to their index.
Search engine robots visit resource pages regularly, but the frequency with which they do so depends on several factors:
- frequency of content changes;
- the number of pages on the site;
- traffic volume.
Search engine robots learn about new pages from links that appear on previously known documents, as well as traffic to them from various sources.
In this case, it should be taken into account that for one visit robot processes a certain number of pages of the site. This phenomenon is due to the fact that search robots do not want to overload the server with their requests. But how is this limit of uploaded documents determined?
At the beginning of 2017, Google representative Gary Ilsh spoke about such a concept as Crowling Budget, which combines such indicators: the speed of site crawling and crawling demand (the number of documents that Googlebot wants to bypass, based on the popularity of the resource and relevance of content). By crawling budget, Google means the number of pages of a site that Googlebot can bypass.
Internal website factors that reduce the crawling rate (according to Google):
- indexable documents with session identifiers, filtering or search variables, UTM tags in the address;
- page duplicates;
- documents with 404 server response;
- pages with low-quality and spammy content.
Ways to manage site indexing
In order to optimize the crawling budget expenditure it is necessary to correctly manage the site indexing - to give the robots the opportunity to index only those pages that are important for the promotion of the resource.
By setting the canonical address (canonical), you can explicitly tell search engines which page is preferred for indexing. It is necessary to set the canonical attribute if the site contains documents with the same content:
- pagination pages;
- UTM-tagged pages;
- filter pages;
- et al
To customize canonical pages, you need to specify the following code in the head section: <link rel="canonical">
If the page should participate in the search, its url is specified in the href attribute, if it should not and is a full or partial duplicate, the address of the canonical document is specified in the href attribute.
With the robots.txt file, which is located in the root of the website, you can control the search engine robots:
- The Disallow directive closes the specified pages from indexing;
- User-Agent allows you to specify the search engine for which the indexing instructions are written;
- Crawl-delay sets the frequency of robots accessing resource pages (Google skips this instruction);
- Clean-param prohibits pages with the specified dynamic parameters from being indexed.
Robots meta tag
This meta-tag is designed to control indexing of a particular page. To customize the meta-tag you need to specify in the head section: <meta name="robots">
List of parameters of the robots meta tag:
- index - permission to index the document;
- noindex - prohibit indexing of the document;
- follow - permission to follow links on the page;
- nofollow - prohibit following links on the page;
- all - equivalent to specifying content="index, follow";
- none - equivalent to specifying content="noindex, nofollow".
The absence of the meta tag in the page code is considered as an automatic permission to index the document and follow links.
It is important to take into account that when closing the page from indexing in this way - the robot still "spends" its crawling budget to read it, it is best to use this meta tag to prohibit clicking on links.
Which pages should be closed from indexing?
The following types of pages should be prevented from being indexed by search engines:
- For pagination pages, the canonical address should be specified (do not close such pages with the robots or robots.txt meta tag: a wide range is one of the important commercial factors);
- Technical pages (without useful content) should be closed in robots.txt;
- Pages of personal information (personal account, registrations, etc.) should be closed in robots.txt:
- For pages that are generated when sorting products in the catalog, it is worth specifying the canonical address;
- Printable version pages should be closed in robots.txt;
- Pages with site search results are worth closing in robots.txt and using the robots tag if they cannot be optimized to get additional traffic.
Competent indexation management will help to optimize crowding budget and direct the limits to the promoted pages of the resource.
Checking indexed pages
You can use several methods to check the correct indexing of a resource.
Check indexing in Google panels
Using Google's webmasters console, you can see:
- Number of pages indexed;
- The number of closed pages in the robots.txt file:
Using search operators
Search engines have developed special search operators that allow you to refine your search query. For example, using the operator "site:" you can find out the approximate number of indexed pages.
Checking indexing with RDS bar
RDS bar toolbar is a plugin for Google Chrome and Mozilla Firefox browsers, which is displayed in the browser as an additional toolbar. This plugin allows you to quickly view the main indicators of the resource:
- the number of indexed pages in Google;
- whether the current page is indexed in Google;
Programs to check indexing
To automate the process of analyzing internal web resource errors and indexing problems, there are special tools - site and search engine index parsers:
- Netpeak Spider - the program allows you to check page responsiveness, see canonical addresses, whether the page is closed in robots.txt or using the robots meta tag:
Comparser is a specialized program for deep analysis of site indexing, which allows you to perform the following operations:
- scanning of pages of the entire web resource (responses and canonical addresses);
- scanning of the Google search engine index;
- search for pages that are in the index of search engines, but the site does not have internal links to them;
- automatic removal of unnecessary pages from the Google index.
Causes of pages dropping out of the index
A large number of dropped landing pages from Google search leads to a drop in site positions and traffic. There are several main reasons for pages dropping out of the search engine index:
- 301 or 302 response (redirects to another document are configured);
- Presence of duplicate pages (e.g. pagination, filtering, sorting and other types of pages where metadata and content are duplicated);
- Erroneous closing of a site section or page in the robots.txt file or robots meta tag;
- 404 response;
- 5xx response, saying that there are failures in hosting or CMS, because of which pages are unavailable for search engine robots for a long time.
To prevent the landing pages of the resource from the index of search engines should follow the technical optimization of the site and timely eliminate emerging errors. But if the search engine has removed the page from the search, you should use the following algorithm:
- Determine the reason for dropping out of the index;
- Eliminate the cause;
- Send the dropped page for indexing (re-indexing).
Accelerated indexing methods
If a page is new or missing from the index for some reason (and the reason for dropping out of the index has been fixed), you can speed up adding it to the index using the following methods:
- Specifies the page(s) in the sitemap.xml file with the update date and priority for indexing;
- Submitting to the Page Turnover tool;
- Posting links to the document on external resources;
- Posting links to the document on social media;
- Getting instant traffic with good activity, where the source of traffic can even be an e-mail newsletter;
- Correct setting of internal linking on the site.
Indexation management is an important part of promotion. Unlike the work with external factors of search engine optimization - the ability to influence the indexing of pages is always available and changes are faster reflected in the index of search engines, but it is best to provide competent interaction of the site with search robots at the stage of resource development.
It is important to keep track of all internal errors on the site in time to be able to quickly eliminate them before search engines remove pages from the index. And if this has already happened - it is necessary to promptly send the dropped (or new) pages for indexing.
Need to check positions on Google? The answer is here:
Get 300 checks per month absolutely FREE!
No credit card needed. No strings attached. 👍