Generalized Entropies and the Similarity of Texts

November 11, 2016 Β· Declared Dead Β· πŸ› arXiv.org

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Eduardo G. Altmann, Laercio Dias, Martin Gerlach arXiv ID 1611.03596 Category physics.soc-ph Cross-listed cs.CL Citations 21 Venue arXiv.org Last Checked 3 months ago
Abstract
We show how generalized Gibbs-Shannon entropies can provide new insights on the statistical properties of texts. The universal distribution of word frequencies (Zipf's law) implies that the generalized entropies, computed at the word level, are dominated by words in a specific range of frequencies. Here we show that this is the case not only for the generalized entropies but also for the generalized (Jensen-Shannon) divergences, used to compute the similarity between different texts. This finding allows us to identify the contribution of specific words (and word frequencies) for the different generalized entropies and also to estimate the size of the databases needed to obtain a reliable estimation of the divergences. We test our results in large databases of books (from the Google n-gram database) and scientific papers (indexed by Web of Science).
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” physics.soc-ph

R.I.P. πŸ‘» Ghosted

Scale-free networks are rare

Anna D. Broido, Aaron Clauset

physics.soc-ph πŸ› Nat. Commun. πŸ“š 988 cites 8 years ago

Died the same way β€” πŸ‘» Ghosted