Why Do AI Crawlers Keep Hitting robots.txt Instead of My Content?
Dom Sipowicz

Dom Sipowicz @sip

About: AI SEO + ▲Vercel + Next.js + WebPerf

Location:
London, UK
Joined:
Oct 18, 2022

Why Do AI Crawlers Keep Hitting robots.txt Instead of My Content?

Publish Date: Aug 23
2 0

Over the past weeks I’ve been monitoring traffic from AI crawlers like OpenAI’s GPTBot, oai-searchbot and ClaudeBot. The data (see screenshots below) raises some interesting questions:

Questions I Wanted to Answer

  • Why does GPTBot visit robots.txt so many times, sometimes multiple times per day?

  • Why does GPTBot prefer robots.txt over sitemap.xml?

  • Why do I see AI bot traffic but no crawling of fresh content? Just repeated hits to old resources.

Vercel Observability Query Builder: Bot traffic

(Screenshot 1: Vercel Observability Query Builder: Bot traffic)


1. robots.txt Obsession

OpenAI GPTBot hits robots multiple times per day

(Screenshot 2: OpenAI GPTBot robots.txt traffic)

The charts clearly show GPTBot hammering robots.txt across multiple IPs, sometimes 7 times in 2 days from the same subnet. Unlike Googlebot, which fetches robots.txt a few times per day and caches the rules, GPTBot seems to re-check every time it rotates IPs or restarts.

OpenAI AI crawlers traffic pattern

(Screenshot 3: OpenAI AI crawlers traffic pattern)

That means there’s no centralised “consent” store for the crawler. Every new instance behaves like a fresh bot, wasting its crawl budget on permission checks.


2. sitemap.xml Inconsistencies

I’ve tracked two different projects, and the behaviour is inconsistent. On one site, GPTBot fetched the sitemap exactly once in a month. On another, it skipped the sitemap entirely but went straight for content. Meanwhile, Anthropic’s ClaudeBot actually hit the sitemap multiple times.

The missing piece here is a smart algorithm that keeps score over time for each website. Google solved this years ago: it doesn’t blindly trust every lastmod tag, but instead builds a trust score for each domain based on history, accuracy, and freshness signals. That’s how it decides whether to treat a sitemap update seriously or to ignore it.

AI crawlers aren’t doing this yet. They either underuse sitemaps or waste fetches on them without consistency. To improve, AI labs need to adopt a similar scoring system. Or, as I strongly suspect from patterns I’ve seen, they may simply partner with Google Search and tap into its index instead of reinventing crawling from scratch.

Side note: I’ve even seen OpenAI API results that looked suspiciously close to Google Search outputs ...

AI crawlers traffic pattern - Vercel Observability Query builder

(Screenshot 4: AI crawlers traffic pattern - Vercel Observability Query builder)


3. Crawling Old Content Repeatedly (and the Budget Problem)

This is where the inefficiency really shows. Bots keep returning to old content instead of discovering what’s new. Even when they’ve seen the sitemap, they often ignore it and waste their crawl budget revisiting stale pages.

There should be a smarter way to surface new material—and honestly, respecting lastmod in sitemap.xml would solve a lot of this. I really hope someone on the search teams at OpenAI and Anthropic is reading this.

From what I see:

  • Crawling budgets are tiny. Sometimes a bot “spends” its limited fetches just on robots.txt and pages it has already crawled.

  • No centralised rule cache. Each IP acts independently, re-checking permissions and burning requests on duplicate work.

  • Unstable sessions. The pattern of repeated restarts suggests crawler instances spin up and down often, leading to wasted quota.

And that’s why your fresh blog post doesn’t get fetched, while your robots.txt enjoys multiple visits per day.


4. The Static Assets Surprise (a.k.a. Bots Running Headless Browsers)

Now here’s the real surprise: OpenAI’s crawler sometimes downloads static assets. Next.js chunks, CSS, polyfills. That almost certainly means it’s firing up a headless browser and actually rendering the page. Rendering at scale is expensive, so seeing this in the logs is like catching the bot red-handed burning VC money on your webpack bundles.

Developers, let’s be honest: we shouldn’t force AI labs to reinvent Google Search’s rendering farm from scratch. The sane thing is still to serve content via SSR/ISR so crawlers don’t have to play Chromium roulette just to see your page. Otherwise you’re basically making Sam Altman pay more to crawl your vibe-coded portfolio site.

The funny bit? This is great news for vibe coders. All those sites built with pure CSR - the “AI slop” nobody thought would ever be indexable, might now actually get pulled into GPTBot’s memory. Your prayers have been heard... at least until the crawl budget runs out.

Fun fact: some vibe coding tools default to CSR, which is basically SEO assisted suicide. If you care about visibility, whether in Google or in AI engines, please stop.

Vercel Observability: AI crawlers render the page

(Screenshot 5: GPTBot download static assets (however hallucinated 404s)


5. What This Means for AI SEO

The good news:
OpenAI and Anthropic at least play by the rules. They ask permission before scraping, unlike the swarm of shady scrapers hitting your site daily.

The bad news:

  • Crawl budgets are tiny and often wasted.
  • Fresh content gets ignored.
  • Sitemaps and lastmod aren’t respected.
  • JS rendering happens only occasionally, so CSR-only sites are still at risk of being invisible.

Closing Thought

Google has a 25-year head start in crawling, indexing, and ranking. AI crawlers are still in year one of that journey. They’re not true search engines yet, but the scaffolding is going up fast.

If anyone from OpenAI, Anthropic, or xAI is reading this: please, implement smarter crawl budgets and start respecting sitemap freshness. Otherwise, all we’ll get is bots lovingly revisiting robots.txt while the real content sits untouched.


Godspeed

https://x.com/dom_sipowicz
https://www.linkedin.com/in/dominiksipowicz/

Comments 0 total

    Add comment