Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach

November 07, 2024 · Declared Dead · 🏛 Journal of Quantitative Linguistics

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Gideon Yoffe, Nachum Dershowitz, Ariel Vishne, Barak Sober arXiv ID 2411.04950 Category cs.CL: Computation & Language Citations 0 Venue Journal of Quantitative Linguistics Last Checked 4 months ago

Abstract

We introduce a data-centric hypothesis-testing framework to quantify the influence of sequentially correlated literary properties--such as thematic continuity--on textual classification tasks. Our method models label sequences as stochastic processes and uses an empirical autocovariance matrix to generate surrogate labelings that preserve sequential dependencies. This enables statistical testing to determine whether classification outcomes are primarily driven by thematic structure or by non-sequential features like authorial style. Applying this framework across a diverse corpus of English prose, we compare traditional (word n-grams and character k-mers) and neural (contrastively trained) embeddings in both supervised and unsupervised classification settings. Crucially, our method identifies when classifications are confounded by sequentially correlated similarity, revealing that supervised and neural models are more prone to false positives--mistaking shared themes and cross-genre differences for stylistic signals. In contrast, unsupervised models using traditional features often yield high true positive rates with minimal false positives, especially in genre-consistent settings. By disentangling sequential from non-sequential influences, our approach provides a principled way to assess and interpret classification reliability. This is particularly impactful for authorship attribution, forensic linguistics, and the analysis of redacted or composite texts, where conventional methods may conflate theme with style. Our results demonstrate that controlling for sequential correlation is essential for reducing false positives and ensuring that classification outcomes reflect genuine stylistic distinctions.