robots.txt Generator
Build a crawler directives file in about a minute.
A form-driven generator for the robots.txt file — the oldest and most universally respected way to tell web crawlers which paths they can and can’t access. Supports multiple user-agent blocks, Allow and Disallow directives, Crawl-delay, and Sitemap declarations. Includes presets for opting out of the major AI training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, PerplexityBot). Runs entirely in your browser.
robots.txt form
User-agent blocks (1)
Bot name (* for all bots). Common values shown in the dropdown.
Paths this bot may crawl. Use / to allow everything. Defaults to allow if no Disallow matches.
Paths this bot must not crawl. Use / to block everything for this bot.
Seconds between requests. Honored by Bing, Yandex, and a few others. Googlebot ignores this — use Google Search Console for Googlebot rate limiting.
Absolute URLs to your XML sitemap files. Emitted as standalone 'Sitemap:' lines at the end of the file. Multiple sitemaps are allowed.
The canonical hostname for your site. Legacy directive but still used by some crawlers. Leave blank if unsure.
Live robots.txt preview
User-agent: *
Save this file as robots.txt in your site's root directory, so it's accessible at https://your-site.com/robots.txt. Crawlers check this path automatically on their first visit to your site. Everything runs in your browser. Nothing is sent to XooCode's servers.
What robots.txt is (and isn't)
robots.txt is a plain-text file you host at the root of your domain that tells web crawlers which paths they’re allowed (and not allowed) to fetch. It was introduced in 1994 by Martijn Koster as an informal convention, spent nearly three decades as the de-facto standard that every respectful crawler honoured voluntarily, and was finally formalised as RFC 9309 by the IETF in September 2022.
The format is deliberately simple. Each group starts with one or more User-agent: lines identifying the bot, followed by Allow: and Disallow: lines listing path patterns. Optional Crawl-delay: lines request a pause between requests. At the bottom of the file, one or more Sitemap: lines point at your XML sitemap URLs. That’s the whole spec in one paragraph.
The thing to understand about robots.txt is that it is advisory, not enforcement. The file tells crawlers what you’d like them to do; it doesn’t stop anyone from doing otherwise. Well-behaved bots (Googlebot, Bingbot, most AI training crawlers) honour it religiously. Malicious scrapers ignore it entirely. If you need actual enforcement — blocking IPs, requiring authentication, rate-limiting — that happens at the web server layer, not in robots.txt.
How to use the generator
The interface is a two-pane form. Fill in one or more user-agent blocks on the left, watch the file preview on the right update as you type, then download when you’re happy with the output.
- 1
Start from the seed example (optional)
Click Load XooCode example above the form to populate a realistic two-block file: oneUser-agent: *block allowing everything except legacy WordPress paths, and oneUser-agent: GPTBotblock disallowing everything. Good shape reference whether you keep it or clear it and start over. - 2
Add user-agent blocks
Click Add user-agent block to create a new group. Each block targets one bot. Use*to match all bots, or a specific name likeGooglebotorGPTBotfor per-crawler rules. The generator includes a dropdown of common bot names so you don’t have to remember them. - 3
Add Allow and Disallow rules
Inside each block, add path patterns. EveryDisallow:line tells the targeted bot to skip paths matching that pattern.Allow:lines create exceptions to broader disallows. Patterns use*as a wildcard and$to anchor the end of the URL. - 4
Add sitemap URLs
At the bottom of the form, add one or more absolute Sitemap URLs. These tell crawlers where to find your XML sitemaps and are the single highest-ROI addition you can make to arobots.txtfile. Search engines will find your sitemaps faster. - 5
Download and deploy
Click the download button to save the file asrobots.txt. Upload it to your server’s document root so it’s accessible athttps://your-site.com/robots.txt. Verify withcurl -I https://your-site.com/robots.txt— you should get a200 OKwith content typetext/plain.
The directive reference
RFC 9309 defines a small vocabulary. Here’s everything you can put in a robots.txt file and what each line does.
User-agent
Opens a group and names the bot the rules below apply to. Case-insensitive. Use * as a catch-all. Multiple User-agent lines in a row apply the same rules to each of the named bots. A file can have any number of groups.
Disallow
Tells the targeted bot not to fetch URLs whose path begins with the given pattern. Disallow: with an empty value means “nothing is disallowed” (i.e., allow everything). Disallow: / means “everything is disallowed”.
Allow
Carves out exceptions to a Disallow for the same user-agent. Longer (more specific) patterns win over shorter (less specific) ones, so you can Disallow: /private/ everything in a private folder and then Allow: /private/public-file.pdf to poke one file through the block.
Crawl-delay
Requests the bot wait the given number of seconds between requests. Honoured by Bing, Yandex, and some smaller crawlers. Googlebot ignores it — Google manages crawl rate in Search Console instead. Still worth setting for the bots that respect it.
Sitemap
Declares the absolute URL of an XML sitemap. Unlike the other directives, this one is not scoped to a user-agent — it applies to all crawlers. Put Sitemap: lines at the bottom of the file for readability, not for any semantic reason.
AI crawler opt-outs
The big shift in robots.txt usage since 2023 has been AI training crawlers. Unlike traditional search bots that index for search result pages, these crawlers fetch content to train language models. Most honour robots.txt, giving publishers a meaningful opt-out for the first time. Here are the main ones:
GPTBot— OpenAI’s training crawler for ChatGPT and GPT-family models. Official documentation. Disallowing it opts your site out of ChatGPT training data going forward (not retroactively).ClaudeBot— Anthropic’s training crawler for Claude. Official documentation. Also honoursanthropic-aias an older alias.CCBot— Common Crawl’s crawler. Common Crawl is a shared dataset many AI labs use as a starting point, so blocking it has a multiplier effect — you’re blocking every downstream model trained on Common Crawl without explicit Common Crawl opt-in.Google-Extended— Google’s opt-out token for training Gemini and Bard on your content. Importantly, this is not the same asGooglebot— blockingGoogle-Extendedopts you out of training without affecting your search ranking. Google’s crawler documentation.PerplexityBot— Perplexity’s crawler for answer synthesis. Official documentation. Perplexity uses this for fetching pages it cites in real time, not just for training.
Common mistakes
robots.txt looks simple because it mostly is — but a few patterns trip up authors often enough to be worth calling out explicitly.
Disallow: /* blocks everything unexpectedly
Disallow: /* is equivalent to Disallow: / — it blocks the whole site, not just the root. If you wanted to block only files with a specific extension, anchor the pattern: Disallow: /*.pdf$.
Putting Disallow: /private on robots.txt leaks the path
robots.txt is publicly readable. If you Disallow: /admin, you’ve just published the fact that /admin exists. Use authentication for private paths instead, or use a non-guessable path if the content has to stay unauthenticated.
Blocking JS and CSS
A common legacy pattern was Disallow: /js/ and Disallow: /css/ to save crawl budget. Google now recommends against this because it prevents Googlebot from rendering your page properly. If Googlebot can’t see your stylesheet, it can’t see what the page actually looks like and may downgrade its mobile-friendly score.
Missing sitemap reference
If you have a sitemap, link it from robots.txt. Search engines already know to check /sitemap.xml by convention, but they also specifically look for Sitemap: lines in robots.txt and will pick up sitemaps at non-default paths only if you tell them.
What this generator isn't
Small tool, tight scope. Here’s what the generator is and what it isn’t.
It IS a robots.txt builder
It IS AI-crawler aware
It is NOT a security tool
It is NOT a crawler
It IS download-first
It is NOT llms.txt
Authoritative sources
The generator is built on these primary documents. Consult them when you need a detail the tool doesn’t cover.
- RFC 9309: Robots Exclusion Protocol — the formal specification, published by the IETF in September 2022. Canonical source for directive syntax and precedence rules.
- robotstxt.org — the original 1994 convention site, still maintained. Good historical context and a database of well-known bot names.
- Google’s robots.txt documentation — how Googlebot interprets the file, including Google-specific quirks like ignoring
Crawl-delay. - Google crawler directory — authoritative list of Google’s crawler user agents including
Googlebot,Googlebot-Image,Google-Extended, and the rest. - ai.robots.txt — a community-maintained list of AI training crawler user agents with curated
robots.txtsnippets you can paste directly into the generator.