The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

February 26, 2020 Β· Declared Dead Β· πŸ› Multimedia tools and applications

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Paul Sheridan, Mikael OnsjΓΆ arXiv ID 2002.11844 Category cs.IR: Information Retrieval Citations 9 Venue Multimedia tools and applications Last Checked 4 months ago
Abstract
Term frequency-inverse document frequency, or TF-IDF for short, and its many variants form a class of term weighting functions the members of which are widely used in text analysis applications. While TF-IDF was originally proposed as a heuristic, theoretical justifications grounded in information theory, probability, and the divergence from randomness paradigm have been advanced. In this work, we present an empirical study showing that TF-IDF corresponds very nearly with the hypergeometric test of statistical significance on selected real-data document retrieval, summarization, and classification tasks. These findings suggest that a fundamental mathematical connection between TF-IDF and the negative logarithm of the hypergeometric test P-value (i.e., a hypergeometric distribution tail probability) remains to be elucidated. We advance the empirical analyses herein as a first step toward explaining the long-standing effectiveness of TF-IDF from a statistical significance testing lens. It is our aspiration that these results will open the door to the systematic evaluation of significance testing derived term weighting functions in text analysis applications.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Information Retrieval

Died the same way β€” πŸ‘» Ghosted