Universal and non-universal text statistics: Clustering coefficient for language identification

November 18, 2019 Β· Declared Dead Β· πŸ› Physica A: Statistical Mechanics and its Applications

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Diego Espitia, HernΓ‘n Larralde arXiv ID 1911.08915 Category physics.soc-ph Cross-listed cs.CL Citations 1 Venue Physica A: Statistical Mechanics and its Applications Last Checked 4 months ago
Abstract
In this work we analyze statistical properties of 91 relatively small texts in 7 different languages (Spanish, English, French, German, Turkish, Russian, Icelandic) as well as texts with randomly inserted spaces. Despite the size (around 11260 different words), the well known universal statistical laws -- namely Zipf and Herdan-Heap's laws -- are confirmed, and are in close agreement with results obtained elsewhere. We also construct a word co-occurrence network of each text. While the degree distribution is again universal, we note that the distribution of Clustering Coefficients, which depend strongly on the local structure of networks, can be used to differentiate between languages, as well as to distinguish natural languages from random texts.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” physics.soc-ph

R.I.P. πŸ‘» Ghosted

Scale-free networks are rare

Anna D. Broido, Aaron Clauset

physics.soc-ph πŸ› Nat. Commun. πŸ“š 988 cites 8 years ago

Died the same way β€” πŸ‘» Ghosted