Harvesting Entities from the Web Using Unique Identifiers -- IBEX

May 04, 2015 · Declared Dead · 🏛 International Workshop on the Web and Databases

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Aliaksandr Talaika, Joanna Biega, Antoine Amarilli, Fabian M. Suchanek arXiv ID 1505.00841 Category cs.DB: Databases Cross-listed cs.IR Citations 17 Venue International Workshop on the Web and Databases Last Checked 4 months ago

Abstract

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Databases

R.I.P. 👻 Ghosted

The Case for Learned Index Structures

Tim Kraska, Alex Beutel, ... (+3 more)

cs.DB 🏛 SIGMOD 📚 1.2K cites 8 years ago

R.I.P. 👻 Ghosted

Untangling Blockchain: A Data Processing View of Blockchain Systems

Tien Tuan Anh Dinh, Rui Liu, ... (+4 more)

cs.DB 🏛 IEEE TKDE 📚 997 cites 8 years ago

R.I.P. 👻 Ghosted

Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades

Garrick Orchard, Ajinkya Jayawant, ... (+2 more)

cs.DB 🏛 Frontiers in Neuroscience 📚 905 cites 10 years ago

R.I.P. 👻 Ghosted

BLOCKBENCH: A Framework for Analyzing Private Blockchains

Tien Tuan Anh Dinh, Ji Wang, ... (+4 more)

cs.DB 🏛 SIGMOD 📚 872 cites 9 years ago

R.I.P. 👻 Ghosted

Data Synthesis based on Generative Adversarial Networks

Noseong Park, Mahmoud Mohammadi, ... (+4 more)

cs.DB 🏛 VLDB 📚 568 cites 8 years ago

R.I.P. 👻 Ghosted

HoloClean: Holistic Data Repairs with Probabilistic Inference

Theodoros Rekatsinas, Xu Chu, ... (+2 more)

cs.DB 🏛 VLDB 📚 544 cites 9 years ago

Died the same way — 👻 Ghosted

R.I.P. 👻 Ghosted

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Konečný, H. Brendan McMahan, ... (+4 more)

cs.LG 🏛 arXiv 📚 5.2K cites 9 years ago

R.I.P. 👻 Ghosted

In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman P. Jouppi, Cliff Young, ... (+73 more)

cs.AR 🏛 ISCA 📚 5.1K cites 9 years ago

R.I.P. 👻 Ghosted

Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning

Hoo-Chang Shin, Holger R. Roth, ... (+7 more)

cs.CV 🏛 IEEE TMI 📚 4.9K cites 10 years ago

R.I.P. 👻 Ghosted

Explanation in Artificial Intelligence: Insights from the Social Sciences

Tim Miller

cs.AI 🏛 AI 📚 4.9K cites 9 years ago