The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

March 31, 2020 ยท Declared Dead ยท ๐Ÿ› ACM/IEEE Joint Conference on Digital Libraries

๐Ÿ‘ป CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Xinyue Wang, Zhiwu Xie arXiv ID 2003.14046 Category cs.DL: Digital Libraries Cross-listed cs.DB Citations 10 Venue ACM/IEEE Joint Conference on Digital Libraries Last Checked 2 months ago
Abstract
The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Digital Libraries

Died the same way โ€” ๐Ÿ‘ป Ghosted