A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm

October 07, 2018 Β· Declared Dead Β· πŸ› Turkish J. Electr. Eng. Comput. Sci.

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Hamid Mohammadi, Seyed Hossein Khasteh arXiv ID 1810.03102 Category cs.IR: Information Retrieval Citations 5 Venue Turkish J. Electr. Eng. Comput. Sci. Last Checked 4 months ago
Abstract
One of the important factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate and near-duplicate text document detection system. Traditional approaches to this problem, such as brute force comparisons or simple hash-based algorithms are not suitable as they are not scalable and are not capable of detecting near-duplicate documents effectively. In this paper, a new signature-based approach to text similarity detection is introduced which is fast, scalable, reliable and needs less storage space. The proposed method is examined on popular text document data-sets such as CiteseerX, Enron, Gold Set of Near-duplicate News Articles and etc. The results are promising and comparable with the best cutting-edge algorithms, considering the accuracy and performance. The proposed method is based on the idea of using reference texts to generate signatures for text documents. The novelty of this paper is the use of genetic algorithms to generate better reference texts.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Information Retrieval

Died the same way β€” πŸ‘» Ghosted