Blog Attacks

How to Block Bytespider (TikTok's AI Crawler)

Bytespider crawls your site for Bytedance's AI systems. Learn how to block it with robots.txt and IP ranges, and the data sovereignty concerns.

Jun 20, 2026 • 6 min read

Mike Kutlu Client-Side Security Consultant

How to Block Bytespider (TikTok's AI Crawler)

TL;DR: block Bytespider after the 2023 robots.txt non-compliance reports

The 2023 trust problem: A Disallow: Bytespider line looks like a solved problem until you remember 2023. Public security reporting documented Bytespider crawling pages that had explicit robots.txt disallow directives, which sets a different trust baseline than GPTBot or ClaudeBot.
IP-level enforcement: ByteDance publishes Bytespider IP ranges in its crawler documentation, so a firewall or CDN deny list catches the crawler regardless of whether it reads your robots.txt, and a quarterly review keeps the list current as ranges expand.
The decision: If you are a government contractor, financial services firm, healthcare organisation, or hold competitive IP, treat Bytespider as an IP-block target with robots.txt as the secondary signal. If your data governance policy is silent on Chinese jurisdiction, the robots.txt line alone may be enough.

Short on time? See cside's AI-agent detection. It covers everything below in one deployment.

Bytespider is the web crawler operated by Bytedance, the parent company of TikTok. It collects web content for AI training across Bytedance's product portfolio. Unlike most major AI training crawlers, Bytespider attracted significant public attention in 2023 when reports emerged that it was ignoring robots.txt directives on a number of sites. That compliance history makes it a higher-priority blocking target than most other declared AI crawlers, including ones like the crawler behind ClaudeBot that carry stronger compliance reputations.

What is Bytespider?

Quick answer: Bytespider is Bytedance's AI training crawler. It is used to collect web content for training AI models that power products across Bytedance's portfolio, including TikTok. It uses a declared user-agent string but drew public scrutiny for reportedly bypassing robots.txt restrictions on some sites in 2023.

Bytespider uses a user-agent string in the Bytespider family, with references to Bytedance's crawler documentation. Like other AI training crawlers, it is an HTTP-based tool that makes GET requests, reads page content, and does not execute JavaScript in a real browser context.

The key difference between Bytespider and crawlers like GPTBot or ClaudeBot is the compliance history. Reports from security researchers and site owners in 2023 documented instances of Bytespider ignoring robots.txt disallow rules. Bytedance has since updated its crawler practices, but the incident established a different baseline of trust compared to US-based AI crawlers with stronger compliance reputations.

How to block Bytespider with robots.txt

Quick answer: Add Bytespider to your robots.txt. Given its past compliance issues, treat robots.txt as a starting point rather than a complete solution. Supplement it with IP-level blocking for sites where crawler access needs hard enforcement.

To block Bytespider from your entire site:

User-agent: Bytespider
Disallow: /

Given the documented compliance history, this alone may not be sufficient if Bytespider resumes the crawling behaviour seen in 2023. IP-level blocking provides the enforcement layer that robots.txt cannot guarantee. The same logic applies to any well-behaved declared crawler you add alongside it, such as the Common Crawl bot CCBot.

IP-level blocking for Bytespider

Quick answer: Bytedance publishes Bytespider's IP ranges in its crawler documentation. Denying these ranges at your firewall or CDN provides enforcement independent of robots.txt compliance. For organisations with data governance or competitive concerns about Bytedance access, IP blocking is the more reliable approach.

IP-level blocking steps:

Locate Bytedance's current published IP ranges for Bytespider from their official documentation
Add these to your firewall, CDN edge rules, or reverse proxy configuration
Set a review cycle, quarterly is sufficient for most organisations

The IP blocking approach catches Bytespider regardless of whether it reads your robots.txt, which addresses the core concern raised by the 2023 compliance reports.

Data sovereignty concerns

Quick answer: Bytedance is a Chinese company operating under Chinese law. Content collected by Bytespider may be subject to the same data access framework that applies to other Chinese technology companies operating under Chinese jurisdiction. For regulated industries or organisations with explicit geopolitical data policies, this carries specific compliance relevance.

The concern here mirrors the reasoning behind blocking DeepSeekBot. It is not a claim of specific data misuse: it is a statement about jurisdictional exposure. Organisations that have explicit policies about data transfer to certain jurisdictions, or that handle content subject to regulatory requirements, have documented reasons to treat Bytedance-operated crawlers differently from crawlers operated by US-based companies.

Government contractors, financial services firms, healthcare organisations, and technology companies with competitive IP concerns have been active in adding Bytespider to their crawler blocklists on this basis.

Competitive intelligence risk

Quick answer: Beyond training data, Bytespider's crawl of retail, media, and tech sites creates competitive intelligence risk for Bytedance's product roadmap. TikTok Shop and Bytedance's e-commerce ambitions make detailed product catalogue and pricing data from competitors commercially valuable.

This is the second-order concern that makes Bytespider different from purely research-oriented AI crawlers. Bytedance operates TikTok Shop and has significant e-commerce infrastructure ambitions. A crawler that systematically collects product pricing, inventory, and catalogue data from retail sites serves both training and competitive intelligence purposes simultaneously.

For retailers, media companies, and any site with proprietary product or content data, this dual-use nature of Bytespider's collection is worth factoring into the blocking decision.

Browser-layer detection: what robots.txt leaves uncovered

Quick answer: Blocking Bytespider addresses Bytedance's declared training crawler. The 2023 compliance controversy shows that even declared crawlers can operate outside their stated parameters. Undeclared Bytedance-adjacent agents operating in real browser sessions are entirely invisible to header-based and rule-based detection tools.

Bytespider's compliance history makes browser-layer monitoring especially relevant for organisations blocking it. If the declared crawler bypassed robots.txt in the past, any future undeclared agent browsing your site in a real browser session leaves nothing to inspect at the network layer. The gap is architectural, not something you can configure away, and it is the same gap that lets undeclared AI content scrapers slip past rule-based controls.

cside observes the behavioural signals inside browser sessions that distinguish automated sessions from human visitors: interaction timing, fingerprint consistency, navigation patterns, and JavaScript execution characteristics. In cside's controlled testing, traditional tools missed AI agents operating inside real browser sessions in 81 out of 100 scenarios.

cside AI agent detection dashboard

Consider what a Bytedance-adjacent undeclared agent looks like at the browser layer. A session opens a retail category page in a full browser, renders JavaScript, and begins extracting product pricing and inventory data. The IP is clean, the user-agent is a current Chrome build, and the session presents a valid TLS fingerprint. Nothing at the network layer triggers an alert.

What cside observes is different: the agent opens each product detail page in a fixed sequence matching the category listing order, hover events over product images are absent, and the time between page loads is stable to within tens of milliseconds across dozens of requests. No human browsing session produces that combination of signals. cside classifies the session as automated and surfaces it for review before meaningful data has been extracted. For organisations that have added Bytespider to their robots.txt and IP blocklist, browser-layer monitoring closes the gap those controls leave open.

Client-Side Security Consultant Mike Kutlu

Client-side security consultant at cside. 10+ years of experience implementing technology solutions for enterprises (previously at Oracle, Cloudflare, and Splunk). Now helping teams use client-side intelligence to catch & reduce fraud.

Don't just take our word for it, ask AI

FAQ

Frequently Asked Questions

Bytespider is Bytedance's AI training crawler. Bytedance is the Chinese parent company of TikTok. Bytespider collects web content to train AI models used across Bytedance's products. It attracted public scrutiny in 2023 for reportedly ignoring robots.txt restrictions on some sites, which sets it apart from crawlers with stronger compliance records.

Add `User-agent: Bytespider` followed by `Disallow: /` to your robots.txt file. Given Bytespider's documented compliance issues in 2023, supplement this with IP-level blocking. Locate Bytedance's published IP ranges and add them to your firewall or CDN deny list for hard enforcement.

Reports from 2023 documented instances of Bytespider crawling pages despite robots.txt disallow directives. Bytedance addressed these issues and updated its crawler practices. The incidents are documented in public security reporting. Whether current Bytespider versions fully respect robots.txt is a matter of ongoing monitoring by the site-owner community.

Bytedance is a Chinese company subject to Chinese law, including data access requirements that can apply to Chinese technology companies. Organisations with regulatory policies that restrict data transfer to specific jurisdictions, or with IP concerns about AI training data origin, have specific compliance reasons to block Bytespider beyond a general crawler blocking policy.

Bytespider is a crawl agent that systematically collects page content for training purposes. It is not representative of users visiting your site from TikTok. TikTok user traffic arriving via links or referrals is standard browser traffic. Bytespider is a distinct, automated system operated at infrastructure level to collect data at scale.

Monitor and Secure Your Third-Party Scripts

Gain full visibility and control over every script delivered to your users to enhance site security and performance.

Book a demo

Start for free

Start free, or try Business with a 14-day trial.

cside dashboard interface showing script monitoring and security analytics

Bot protection in 2026: why browser-layer detection catches what WAFs miss

AI agents run inside real Chromium browsers and slip past WAFs. Browser-layer detection reads canvas entropy and session cadence to catch them.

Chargeback fraud prevention: how device evidence wins disputes in 2026

Chargeback fraud prevention hinges on device evidence captured at checkout, the proof Visa CE 3.0 accepts when you contest a card-not-present dispute.

Account takeover solutions: understanding the category before you build a shortlist

Account takeover solutions span four layers: WAF, MFA, browser device intelligence, and behavioral analytics. No single vendor covers them all.

Best account sharing detection software 2026: an honest comparison

Device fingerprinting counts how many distinct devices sit behind one login, catching the seat abuse that IP-based tools and MFA controls miss.

Fake account detection: why email verification is not enough in 2026

Email verification and CAPTCHA confirm an endpoint, not a person. Device fingerprinting is what catches fake account signups at registration.

Best VPN detection software 2026: TLS handshake fingerprint TLS fingerprinting vs IP blocklists

The best VPN detection tools use TLS handshake fingerprint TLS fingerprinting to catch the residential proxies and VPN configurations that IP blocklists miss entirely.

PCI DSS compliance checklist 2026: Requirements 6.4.3 and 11.6.1 explained

Requirements 6.4.3 and 11.6.1 became mandatory in March 2025. Here is what belongs on a modern PCI DSS compliance checklist, and how to automate it.

Card testing fraud prevention software: how to stop automated card validation at checkout

See how browser-layer detection stops automated card testing at checkout using session behavior, AI agent signals, and device fingerprinting.

What is formjacking? How it works and how to detect it

Formjacking injects malicious JavaScript into checkout pages to steal card data as it is typed, invisible to WAFs and CSPs. Here is how to detect it.

What is credential stuffing? Definition, examples, and detection

Credential stuffing tests stolen username and password pairs from breaches against other sites. Learn how it works and how device signals catch it.