Harvesting Entities from the Web Using Unique Identifiers -- IBEX

May 04, 2015 Β· Declared Dead Β· πŸ› International Workshop on the Web and Databases

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Aliaksandr Talaika, Joanna Biega, Antoine Amarilli, Fabian M. Suchanek arXiv ID 1505.00841 Category cs.DB: Databases Cross-listed cs.IR Citations 17 Venue International Workshop on the Web and Databases Last Checked 4 months ago
Abstract
In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with human-readable names for the entities at large scale. Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Databases

Died the same way β€” πŸ‘» Ghosted