TY - JOUR
T1 - Large-scale detection of repetitions
AU - Smyth, William
PY - 2014
Y1 - 2014
N2 - Combinatorics on words began more than a century ago with a demonstration that an infinitely long string with no repetitions could be constructed on an alphabet of only three letters. Computing all the repetitions (such as . . . TTT . . . or . . .CGACGA. . . ) in a given string x of length n is one of the oldest and most important problems of computational stringology, requiring O(n log n) time in the worst case. About a dozen years ago, it was discovered that repetitions can be computed as a by-product of the T(n)-time computation of all the maximal periodicities or runs in x. However, even though the computation is linear, it is also brute force: global data structures, such as the suffix array, the longest common prefix array and the Lempel-Ziv factorization, need to be computed in a preprocessing phase. Furthermore, all of this effort is required despite the fact that the expected number of runs in a string is generally a small fraction of the string length. In this paper, I explore the possibility that repetitions (perhaps also other regularities in strings) can be computed in a manner commensurate with the size of the output. © 2014 The Author(s) Published by the Royal Society.
AB - Combinatorics on words began more than a century ago with a demonstration that an infinitely long string with no repetitions could be constructed on an alphabet of only three letters. Computing all the repetitions (such as . . . TTT . . . or . . .CGACGA. . . ) in a given string x of length n is one of the oldest and most important problems of computational stringology, requiring O(n log n) time in the worst case. About a dozen years ago, it was discovered that repetitions can be computed as a by-product of the T(n)-time computation of all the maximal periodicities or runs in x. However, even though the computation is linear, it is also brute force: global data structures, such as the suffix array, the longest common prefix array and the Lempel-Ziv factorization, need to be computed in a preprocessing phase. Furthermore, all of this effort is required despite the fact that the expected number of runs in a string is generally a small fraction of the string length. In this paper, I explore the possibility that repetitions (perhaps also other regularities in strings) can be computed in a manner commensurate with the size of the output. © 2014 The Author(s) Published by the Royal Society.
UR - https://www.scopus.com/pages/publications/84899106168
U2 - 10.1098/rsta.2013.0138
DO - 10.1098/rsta.2013.0138
M3 - Article
C2 - 24751872
SN - 1364-503X
VL - 372
SP - 10pp
JO - Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
JF - Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
IS - 2016
ER -