A Fast Template-based Approach to Automatically Identify Primary Text Content of a Web Page

November 26, 2019 Β· Declared Dead Β· πŸ› International Conference on Knowledge and Systems Engineering

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham, The Duy Bui arXiv ID 1911.11473 Category cs.IR: Information Retrieval Citations 9 Venue International Conference on Knowledge and Systems Engineering Last Checked 4 months ago
Abstract
Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant Web pages. One reason is because search engines also look at non-informative blocks of Web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main content blocks in a Web page by improving the ContentExtractor algorithm. By automatically identifying and storing templates representing the structure of content blocks in a website, content blocks of a new Web page from the Website can be extracted quickly. The hierarchical order of the output blocks is also maintained which guarantees that the extracted content blocks are in the same order as the original ones.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Information Retrieval

Died the same way β€” πŸ‘» Ghosted