Robots.txt: What it is and how it works for a website (Googlebot, search engine crawlers, crawler traffic)

What is robots.txt? It's a technical file with access rules for search engine robots: it tells Googlebot and other search engine crawlers which sections of a site can be crawled and which are best left alone. Essentially, it's one of the basic tools for controlling indexing and crawler traffic distribution as part of systematic website promotion.

Definition and role of the robots.txt file for a website

Robots txt file A robots.txt file is a text document that contains directives (such as Disallow and Allow) for different robots. When a bot visits a website, it first checks robots.txt and then decides which URLs to request. This affects:

  • saving crawling budget (fewer unnecessary requests);
  • prioritizing scanning of important pages;
  • Reducing server load by managing crawler traffic.

Important: robots.txt controls crawling, not "guaranteed removal of a page from search." This often requires other mechanisms (more on this in the section on noindex and X-Robots-Tag in the following sections).

Where should robots.txt be located and what protocols are taken into account?

Robots.txt is placed in top-level directory (root of the site) so that it is accessible at an address like this: https://example.com/robots.txtIf the file is located deeper (for example, /folder/robots.txt), robots ignore it.

Also keep in mind that the rules are applied at the host and protocol level. That is, http And https — these are different versions, and during migrations it is important to check the availability and relevance of the file. As part of robots.txt specification There are cases for different schemes, including ftp, but for SEO in practice the key ones are HTTP/HTTPS.

How Googlebot and other crawlers read rules (robots.txt specification and UTF-8)

By robots.txt specification The bot selects a block of rules by User-agent (for example, Googlebot) and applies the most appropriate directives. Frequently used:

Disallow — prohibits scanning of the path; Allow — allows access within the prohibited path (useful for precise exceptions).

The file must be correctly encoded: use utf-8 robots.txt to ensure that rules and paths are unambiguous, especially if the URL contains non-standard characters.

Robots.txt is about crawling and resource control, not about “hiding everything from Google.”

At Web-Raketa we consider robots.txt as part of "a complete guide to website indexing": It helps you build a strategy, not chaos—directing robots to pages that drive organic traffic and traffic that converts.

Robots.txt: What it is and how it works for a website (Googlebot, search engine crawlers, <em>crawler traffic</em> )

Setting Up Robots.txt for SEO: Disallow/Allow, How to Close Pages, and Examples

Basic syntax: User-agent, Disallow, Allow

If you've already figured it out, What is robots.txt?, the next step is to configure it so search robots spend time on pages that actually generate sales. The logic is simple: we manage crawling so that increased visibility in Google occurs by prioritizing indexing of key categories, cards, and content, rather than endless technical URLs.

Basic Directives:

  • User-agent — which robot the rules apply to (e.g. Googlebot or for everyone);
  • Disallow - prohibits scanning of the specified path;
  • Allow — exception from prohibition (permission within the prohibited zone).

An important principle: setting up robots.txt for SEO is a strategy, not a mess. Close only those that don't need to eat up your crawl budget and don't add any search value.

How to close pages in robots.txt: common tasks for stores and services

In Ukraine, this is most often the case for online stores with filters and parameters, where thousands of duplicates can occur. It's also worth limiting the crawling of internal search and service sections.

Examples of tasks that are typically closed:

1) filters/sorting with parameters (to avoid duplicates); 2) internal search; 3) shopping cart, account, checkout; 4) temporary technical sections (e.g., /tmp/).

It's important not to accidentally close CSS/JS or images if they're needed for proper page rendering and evaluation—otherwise, you could lose indexing quality.

Robots.txt Example: Careful, Lossless Traversal Management

Below is an example that can be adapted to suit your website:

Line What does it do?
User-agent: Rules for all robots
Disallow: /search Closes internal search
Disallow: /cart Closes the basket
Disallow: /? Limits parameterized URLs (must be checked to avoid choking important pages)
Allow: /catalog/ Leaves key sections open

Limitation: Even a perfectly configured robots.txt file won't remove a page from the index if it's already included in search results or has external links. Removing or controlling indexing requires other tools (such as meta robots "noindex" or HTTP headers), but robots.txt remains critical for converting traffic—it helps the robot find commercially important pages faster and crawl them more frequently.

"A proper robots.txt file is all about focus: less junk crawling, more attention to the pages that generate leads."

Setting up Robots.txt for SEO: Disallow/ <em>Allow</em> , How to Close Pages, and Examples

Robots.txt Errors and Related Tools: noindex vs. robots.txt, X-Robots-Tag, Password Protection

Common Robots.txt Errors and How to Diagnose Them

Understanding, What is robots.txt?, is important, but even more important is avoiding mistakes that cut organic traffic. In practice, we most often see problems when a file "accidentally" blocks something that should be ranking.

Typical robots txt errors:

  • blocking important sections (categories, cards, blog) through a too general Disallow;
  • Closing CSS/JS/images, which causes Google to render the page worse and may incorrectly evaluate the content;
  • Allow/conflictDisallow (the rules are designed in such a way that the robot chooses the wrong route);
  • Incorrect path (typo, incorrect slash, case-insensitive URL);
  • Incorrect encoding (we recommend UTF-8 without any “exotic” characters), which makes the rules unclear;
  • the file is not in the root of the site or is unavailable (404/403) - then the robots act “by default”.

Diagnostics: Check /robots.txt accessibility, compare actual URLs with the rules, and review Google Search Console reports (crawling, indexing). If visibility has dropped, start by checking robots.txt and blocked resources.

"Robots.txt should control crawling, not accidentally exclude your business from search."

Noindex vs. robots.txt: Which to Choose for Indexing?

Robots.txt controls crawling: the robot may not visit the page, but this does not guarantee that the URL will not appear in search results (for example, if there are links to it). No index (meta robots) — a signal specifically about indexing: the page can be crawled, but not added to the index.

Practical guideline:

If your goal is to avoid wasting crawl budget on "junk" URLs (filters, search), use robots.txt. If your goal is to remove a page from search results but leave it crawlable (for example, a thank you page or a technical duplicate), noindex is more suitable.

X-Robots-Tag and Password Protection: When Robots.txt Isn't Suitable

The X-Robots-Tag is an HTTP header that allows you to set indexing rules for files and server responses (PDFs, images, or entire templates) where adding a meta tag is difficult. It is convenient for systemic indexing control at the server level.

But robots.txt is not suitable for private content: it is a public file that rather “hints” at where the private content is located.

If you need to truly restrict access (personal accounts, partner pricing pages, admin panel), use password protection (HTTP auth), role restrictions in the CMS, or server/firewall-level closure. This is reliable and secure, unlike "masking" access via robots.txt.

FAQ and conclusions: Robots.txt checklist for effective SEO

FAQ: Frequently asked questions about robots.txt

Is robots.txt always necessary? If the site is very simple and doesn't have any "junk" URLs (search, filters, technical sections), it's not critical. However, in practice, for almost any commercial project, it's easier to keep the basic file in the root and manage crawling consciously. This reduces the risk of chaos as the site grows and helps maintain converting traffic.

What to do when migrating from http to https? Make sure the file is accessible via the https version in the root: /robots.txt. Make sure the rules don't block the new URLs, and that search robots (including Googlebot) receive 200 OKs, not redirects/403s. After the migration, it's helpful to re-run the check in Google Search Console and look at the crawl reports.

Is it possible to block already indexed content from being indexed? Robots.txt alone doesn't guarantee deindexing. Meta robots noindex or X-Robots-Tag, as well as the removal tools in Search Console for temporarily hiding. Robots.txt here refers to crawling restrictions, not "removing from search results."

How to check the work for GooglebotUse Google Search Console: URL inspection and robots.txt diagnostic tools (if available in your interface version), as well as server logs, to see actual robot visits and which URLs they request.

How to reduce crawler traffic Without sacrificing SEO? Block low-value and endless URL spaces (parameters, internal search, technical pages) in robots.txt, but don't block important categories/cards and rendering resources. The goal is to direct crawling to pages that drive organic traffic and sales.

Conclusions: robots.txt as part of a strategy, not chaos

In short, What is robots.txt? For business: This is a crawl control tool that helps build a transparent approach to promotion. It doesn't replace optimization, content, or link building, but it supports systematic website promotion—especially as the project grows and the number of URLs increases exponentially. Combined with noindex, X-Robots-Tag, and proper architecture, it becomes part of a "complete website indexing guide" and a foundation for digital business growth.

“Effective SEO starts with monitoring what is indexed, what is crawled, and why.”

Final robots.txt setup checklist

Check that the file is located in the root of the site and is returned with a 200 OK status code, with UTF-8 encoding. Ensure the rules are clear: the User Agent is specified, meaningful Disallow/Allow rules are applied, and there are no accidental restrictions for commercial sections. Also, check that CSS/JS, important images, and other resources that affect rendering are not blocked. During migrations and redesigns, check robots.txt first to avoid losing visibility in Google. To remove from the index, use noindex or X-Robots-Tag, and for private access, use password protection. Finally, regularly check your rules against actual robot behavior using Search Console data and logs: this way you maintain control over crawling and achieve SEO for your business without unnecessary noise.

Interesting on the topic