Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study

May 27, 2025 · Declared Dead · 🏛 ACM/SIGCOMM Internet Measurement Conference

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Chloe Poroslay, Emily Wenger arXiv ID 2505.21733 Category cs.NI: Networking & Internet Cross-listed cs.CR, cs.CY Citations 7 Venue ACM/SIGCOMM Internet Measurement Conference Last Checked 3 months ago

Abstract

Online data scraping has taken on new dimensions in recent years, as traditional scrapers have been joined by new AI-specific bots. To counteract unwanted scraping, many sites use tools like the Robots Exclusion Protocol (REP), which places a robots$.$txt file at the site root to dictate scraper behavior. Yet, the efficacy of the REP is not well-understood. Anecdotal evidence suggests some bots comply poorly with it, but no rigorous study exists to support (or refute) this claim. To understand the merits and limits of the REP, we conduct the first large-scale study of web scraper compliance with robots$.$txt directives using anonymized web logs from our institution. We analyze the behavior of 130 self-declared bots (and many anonymous ones) over 40 days, using a series of controlled robots$.$txt experiments. We find that bots are less likely to comply with stricter robots$.$txt directives, and that certain categories of bots, including AI search crawlers, rarely check robots$.$txt at all. These findings suggest that relying on robots$.$txt files to prevent unwanted scraping is risky and highlight the need for alternative approaches.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Networking & Internet

R.I.P. 👻 Ghosted

Efficient Multi-User Computation Offloading for Mobile-Edge Cloud Computing

Xu Chen, Lei Jiao, ... (+2 more)

cs.NI 🏛 IEEE/ACM ToN 📚 2.2K cites 10 years ago

📚 📚 The Cartographer

Federated Learning in Mobile Edge Networks: A Comprehensive Survey

Wei Yang Bryan Lim, Nguyen Cong Luong, ... (+6 more)

cs.NI 🏛 IEEE COMST 📚 2.1K cites 6 years ago

📚 📚 The Cartographer

A Survey of Indoor Localization Systems and Technologies

Faheem Zafari, Athanasios Gkelias, Kin Leung

cs.NI 🏛 IEEE COMST 📚 2.1K cites 8 years ago

R.I.P. 👻 Ghosted

Survey of Important Issues in UAV Communication Networks

Lav Gupta, Raj Jain, Gabor Vaszkun

cs.NI 🏛 IEEE COMST 📚 2.0K cites 10 years ago

📚 📚 The Cartographer

Network Function Virtualization: State-of-the-art and Research Challenges

Rashid Mijumbi, Joan Serrat, ... (+4 more)

cs.NI 🏛 IEEE COMST 📚 1.8K cites 10 years ago

📚 📚 The Cartographer

Applications of Deep Reinforcement Learning in Communications and Networking: A Survey

Nguyen Cong Luong, Dinh Thai Hoang, ... (+5 more)

cs.NI 🏛 IEEE COMST 📚 1.7K cites 7 years ago

Died the same way — 👻 Ghosted

R.I.P. 👻 Ghosted

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Konečný, H. Brendan McMahan, ... (+4 more)

cs.LG 🏛 arXiv 📚 5.2K cites 9 years ago

R.I.P. 👻 Ghosted

In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman P. Jouppi, Cliff Young, ... (+73 more)

cs.AR 🏛 ISCA 📚 5.1K cites 9 years ago

R.I.P. 👻 Ghosted

Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning

Hoo-Chang Shin, Holger R. Roth, ... (+7 more)

cs.CV 🏛 IEEE TMI 📚 4.9K cites 10 years ago

R.I.P. 👻 Ghosted

Explanation in Artificial Intelligence: Insights from the Social Sciences

Tim Miller

cs.AI 🏛 AI 📚 4.9K cites 8 years ago