robots.txt Generator

Name: XooCode robots.txt Generator
Author: XooCode

Build a crawler directives file in about a minute.

A form-driven generator for the robots.txt file, the oldest and most universally respected way to tell web crawlers which paths they can and can’t access. Supports multiple user-agent blocks, Allow and Disallow directives, Crawl-delay, and Sitemap declarations. Includes presets for opting out of the major AI training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, PerplexityBot). Runs entirely in your browser.

robots.txt form

User-agent blocks (1)

User-agent

Bot name (* for all bots). Common values shown in the dropdown.

Allow paths

Paths this bot may crawl. Use / to allow everything. Defaults to allow if no Disallow matches.

Disallow paths

Paths this bot must not crawl. Use / to block everything for this bot.

Crawl-delay (optional)

Seconds between requests. Honored by Bing, Yandex, and a few others. Googlebot ignores this; use Google Search Console for Googlebot rate limiting.

Sitemap URLs

Absolute URLs to your XML sitemap files. Emitted as standalone 'Sitemap:' lines at the end of the file. Multiple sitemaps are allowed.

Host directive (optional)

The canonical hostname for your site. Legacy directive but still used by some crawlers. Leave blank if unsure.

Live robots.txt preview

User-agent: *

Save this file as robots.txt in your site's root directory, so it's accessible at https://your-site.com/robots.txt. Crawlers check this path automatically on their first visit to your site. Everything runs in your browser. Nothing is sent to XooCode's servers.

What robots.txt is (and isn't)

robots.txt is a plain-text file you host at the root of your domain that tells web crawlers which paths they’re allowed (and not allowed) to fetch. It was introduced in 1994 by Martijn Koster as an informal convention, spent nearly three decades as the de-facto standard that every respectful crawler honoured voluntarily, and was finally formalised as RFC 9309 by the IETF in September 2022.

The format is deliberately simple. Each group starts with one or more User-agent: lines identifying the bot, followed by Allow: and Disallow: lines listing path patterns. Optional Crawl-delay: lines request a pause between requests. At the bottom of the file, one or more Sitemap: lines point at your XML sitemap URLs. That’s the whole spec in one paragraph.

The thing to understand about robots.txt is that it is advisory, not enforcement. The file tells crawlers what you’d like them to do; it doesn’t stop anyone from doing otherwise. Well-behaved bots (Googlebot, Bingbot, most AI training crawlers) honour it religiously. Malicious scrapers ignore it entirely. If you need actual enforcement (blocking IPs, requiring authentication, rate-limiting), that happens at the web server layer, not in robots.txt.

How to use the generator

The interface is a two-pane form. Fill in one or more user-agent blocks on the left, watch the file preview on the right update as you type, then download when you’re happy with the output.

Start from the seed example (optional)

Click Load XooCode example above the form to populate a realistic two-block file: one User-agent: * block allowing everything except legacy WordPress paths, and one User-agent: GPTBot block disallowing everything. Good shape reference whether you keep it or clear it and start over.

Add user-agent blocks

Click Add user-agent block to create a new group. Each block targets one bot. Use * to match all bots, or a specific name like Googlebot or GPTBot for per-crawler rules. The generator includes a dropdown of common bot names so you don’t have to remember them.

Add Allow and Disallow rules

Inside each block, add path patterns. Every Disallow: line tells the targeted bot to skip paths matching that pattern. Allow: lines create exceptions to broader disallows. Patterns use * as a wildcard and $ to anchor the end of the URL.

Add sitemap URLs

At the bottom of the form, add one or more absolute Sitemap URLs. These tell crawlers where to find your XML sitemaps and are the single highest-ROI addition you can make to a robots.txt file. Search engines will find your sitemaps faster.

Download and deploy

Click the download button to save the file as robots.txt. Upload it to your server’s document root so it’s accessible at https://your-site.com/robots.txt. Verify with curl -I https://your-site.com/robots.txt . You should get a 200 OK with content type text/plain.

The directive reference

RFC 9309 defines a small vocabulary. Here’s everything you can put in a robots.txt file and what each line does.

User-agent

Opens a group and names the bot the rules below apply to. Case-insensitive. Use * as a catch-all. Multiple User-agent lines in a row apply the same rules to each of the named bots. A file can have any number of groups.

Disallow

Tells the targeted bot not to fetch URLs whose path begins with the given pattern. Disallow: with an empty value means “nothing is disallowed” (i.e., allow everything). Disallow: / means “everything is disallowed”.

Allow

Carves out exceptions to a Disallow for the same user-agent. Longer (more specific) patterns win over shorter (less specific) ones, so you can Disallow: /private/ everything in a private folder and then Allow: /private/public-file.pdf to poke one file through the block.

Crawl-delay

Requests the bot wait the given number of seconds between requests. Honoured by Bing, Yandex, and some smaller crawlers. Googlebot ignores it ; Google manages crawl rate in Search Console instead. Still worth setting for the bots that respect it.

Sitemap

Declares the absolute URL of an XML sitemap. Unlike the other directives, this one is not scoped to a user-agent ; it applies to all crawlers. Put Sitemap: lines at the bottom of the file for readability, not for any semantic reason.

AI crawler opt-outs

The big shift in robots.txt usage since 2023 has been AI training crawlers. Unlike traditional search bots that index for search result pages, these crawlers fetch content to train language models. Most honour robots.txt, giving publishers a meaningful opt-out for the first time. Here are the main ones:

GPTBot: OpenAI’s training crawler for ChatGPT and GPT-family models. Official documentation. Disallowing it opts your site out of ChatGPT training data going forward (not retroactively).
ClaudeBot: Anthropic’s training crawler for Claude. Official documentation. Also honours anthropic-ai as an older alias.
CCBot: Common Crawl’s crawler. Common Crawl is a shared dataset many AI labs use as a starting point, so blocking it has a multiplier effect: you’re blocking every downstream model trained on Common Crawl without explicit Common Crawl opt-in.
Google-Extended: Google’s opt-out token for training Gemini and Bard on your content. Importantly, this is not the same as Googlebot . Blocking Google-Extended opts you out of training without affecting your search ranking. Google’s crawler documentation.
PerplexityBot: Perplexity’s crawler for answer synthesis. Official documentation. Perplexity uses this for fetching pages it cites in real time, not just for training.

Common mistakes

robots.txt looks simple because it mostly is, but a few patterns trip up authors often enough to be worth calling out explicitly.

Disallow: /* blocks everything unexpectedly

Disallow: /* is equivalent to Disallow: / ; it blocks the whole site, not just the root. If you wanted to block only files with a specific extension, anchor the pattern: Disallow: /*.pdf$.

Putting Disallow: /private on robots.txt leaks the path

robots.txt is publicly readable. If you Disallow: /admin, you’ve just published the fact that /admin exists. Use authentication for private paths instead, or use a non-guessable path if the content has to stay unauthenticated.

Blocking JS and CSS

A common legacy pattern was Disallow: /js/ and Disallow: /css/ to save crawl budget. Google now recommends against this because it prevents Googlebot from rendering your page properly. If Googlebot can’t see your stylesheet, it can’t see what the page actually looks like and may downgrade its mobile-friendly score.

Missing sitemap reference

If you have a sitemap, link it from robots.txt. Search engines already know to check /sitemap.xml by convention, but they also specifically look for Sitemap: lines in robots.txt and will pick up sitemaps at non-default paths only if you tell them.

What this generator isn't

Small tool, tight scope. Here’s what the generator is and what it isn’t.

It IS a robots.txt builder

Full REP directive support: User-agent, Allow, Disallow, Crawl-delay, Sitemap. Output matches RFC 9309 and common search engine extensions.

It IS AI-crawler aware

Includes ready-to-pick presets for GPTBot, ClaudeBot, CCBot, Google-Extended, PerplexityBot, and other major AI training crawlers.

It is NOT a security tool

Disallow is advisory. It asks polite crawlers to skip paths but does not prevent anyone from fetching them. Real security goes at the web server, not here.

It is NOT a crawler

It doesn't fetch your site to suggest directives. You know your site better than any tool could; the generator just handles the formatting.

It IS download-first

One click produces a downloadable file. No server round-trip, no tracking, no saved state. Runs entirely in your browser.

It is NOT llms.txt

llms.txt is a separate file that describes content for AI agents. robots.txt controls access. If you need both, build each with its own generator.

Authoritative sources

The generator is built on these primary documents. Consult them when you need a detail the tool doesn’t cover.

RFC 9309: Robots Exclusion Protocol. The formal specification, published by the IETF in September 2022. Canonical source for directive syntax and precedence rules.
robotstxt.org. The original 1994 convention site, still maintained. Good historical context and a database of well-known bot names.
Google’s robots.txt documentation. How Googlebot interprets the file, including Google-specific quirks like ignoring Crawl-delay.
Google crawler directory. Authoritative list of Google’s crawler user agents including Googlebot, Googlebot-Image, Google-Extended, and the rest.
ai.robots.txt. A community-maintained list of AI training crawler user agents with curated robots.txt snippets you can paste directly into the generator.

robots.txt Generator

robots.txt form

User-agent blocks (1)

Live robots.txt preview

Start from the seed example (optional)

Add user-agent blocks

Add Allow and Disallow rules

Add sitemap URLs

Download and deploy

User-agent

Disallow

Allow

Crawl-delay

Sitemap

Disallow: /* blocks everything unexpectedly

Putting Disallow: /private on robots.txt leaks the path

Blocking JS and CSS

Missing sitemap reference

It IS a robots.txt builder

It IS AI-crawler aware

It is NOT a security tool

It is NOT a crawler

It IS download-first

It is NOT llms.txt

llms.txt Generator

Schema Diff Tool

Schema Type Hierarchy

JSON-LD Code Examples