Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance

November 13, 2022 ยท The Ethereal ยท + Add venue

๐Ÿ”ฎ THE ETHEREAL: The Ethereal
Pure theory โ€” exists on a plane beyond code

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Gianfranco Bilardi, Michele Schimd arXiv ID 2211.07644 Category cs.FL: Formal Languages Cross-listed cs.DS, math.PR Citations 4 Last Checked 2 months ago
Abstract
The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let $e_k(n)$ denote the average edit distance between random, independent strings of $n$ characters from an alphabet of size $k$. For $k \geq 2$, it is an open problem how to efficiently compute the exact value of $ฮฑ_{k}(n) = e_k(n)/n$ as well as of $ฮฑ_{k} = \lim_{n \to \infty} ฮฑ_{k}(n)$, a limit known to exist. This paper shows that $ฮฑ_k(n)-Q(n) \leq ฮฑ_k \leq ฮฑ_k(n)$, for a specific $Q(n)=ฮ˜(\sqrt{\log n / n})$, a result which implies that $ฮฑ_k$ is computable. The exact computation of $ฮฑ_k(n)$ is explored, leading to an algorithm running in time $T=\mathcal{O}(n^2k\min(3^n,k^n))$, a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how $ฮฑ_k(n)$ can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of $n$ say up to a quarter million. Correspondingly, 99.9\% confidence intervals of width approximately $10^{-2}$ are obtained for $ฮฑ_k$. Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound $ฮฒ_k^*$ to $ฮฑ_k$, such that $ \lim_{k \to \infty} ฮฒ_k^*=1$. In general, $ฮฒ_k^* \leq ฮฑ_k \leq 1-1/k$; for $k$ greater than a few dozens, computing $ฮฒ_k^*$ is much faster than generating good statistical estimates with confidence intervals of width $1-1/k-ฮฒ_k^*$. The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Formal Languages