Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study
May 27, 2025 Β· Declared Dead Β· π ACM/SIGCOMM Internet Measurement Conference
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Chloe Poroslay, Emily Wenger
arXiv ID
2505.21733
Category
cs.NI: Networking & Internet
Cross-listed
cs.CR,
cs.CY
Citations
7
Venue
ACM/SIGCOMM Internet Measurement Conference
Last Checked
3 months ago
Abstract
Online data scraping has taken on new dimensions in recent years, as traditional scrapers have been joined by new AI-specific bots. To counteract unwanted scraping, many sites use tools like the Robots Exclusion Protocol (REP), which places a robots$.$txt file at the site root to dictate scraper behavior. Yet, the efficacy of the REP is not well-understood. Anecdotal evidence suggests some bots comply poorly with it, but no rigorous study exists to support (or refute) this claim. To understand the merits and limits of the REP, we conduct the first large-scale study of web scraper compliance with robots$.$txt directives using anonymized web logs from our institution. We analyze the behavior of 130 self-declared bots (and many anonymous ones) over 40 days, using a series of controlled robots$.$txt experiments. We find that bots are less likely to comply with stricter robots$.$txt directives, and that certain categories of bots, including AI search crawlers, rarely check robots$.$txt at all. These findings suggest that relying on robots$.$txt files to prevent unwanted scraping is risky and highlight the need for alternative approaches.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Networking & Internet
R.I.P.
π»
Ghosted
π
π
The Cartographer
Federated Learning in Mobile Edge Networks: A Comprehensive Survey
π
π
The Cartographer
A Survey of Indoor Localization Systems and Technologies
R.I.P.
π»
Ghosted
Survey of Important Issues in UAV Communication Networks
π
π
The Cartographer
Network Function Virtualization: State-of-the-art and Research Challenges
π
π
The Cartographer
Applications of Deep Reinforcement Learning in Communications and Networking: A Survey
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted