main page — CS 566 Efficient Algorithms
Unit 13: Text Indexing – Searching whole genomes
This unit covers data structures over static text that enable very efficient searching in the indexed text, including some more recent advances in this field. We cover
- inverted indexes
- suffix trees
- applications of suffix trees: string matching, longest repeated substrings, longest common substrings
- suffix arrays
- linear-time construction for suffix arrays
Learning outcomes
- Know and understand methods for text indexing: inverted indexes, suffix trees, (enhanced) suffix arrays
- Know and understand generalized suffix trees
- Know properties, in particular performance characteristics, and limitations of the above data structures.
- Design (simple) algorithms based on suffix trees.
- Understand construction algorithms for suffix arrays.
Material
- slides
- Video 13-1 (2026-02-02):
§13.1 Motivation, inverted indexes, tries
- Video 13-2 (2026-02-02):
§13.2 Suffix trees
- Video 13-3 (2026-02-02):
§13.3 Applications of suffix trees: string matching, longest repeated substring
- Video 13-4 (2026-02-02):
§13.4 Generalized suffix trees, longest common substring
- Video 13-5 (2026-02-02):
§13.5 Suffix arrays
- Video 13-6 (2026-02-03):
§13.5 Suffix array via string sorting
- Video 13-7 (2025-02-03):
§13.6 Linear-time suffix sorting – Induced sorting and merging
- Video 13-8 (2025-02-03):
§13.7 Linear-time suffix sorting – DC3 algorithm
Further reading and sources
There is no textbook that covers all topics, but parts can be found in the following sources, all of which cover much more than the topics I selected.
Tries: These are notoriously treated only in passing in most algorithms and data structures textbooks, and there are various little differences in the details. Apart from the fact that they do not insist on prefix-free keys, I find the most helpful treatment of tries is given by Sedgewick and Wayne:
Suffix trees:
- Gusfield: Algorithms on Strings, Trees, and Sequences (1997) Cambridge University Press
- Crochemore, Rytter: Jewels of Stringology (2002), World Scientific
Suffix arrays have not yet found widespread inclusion in standard textbooks; my way of presenting it is mostly my own, but took inspiration from
Suffix arrays and LCP arrays are also discussed (in slightly different form) in