main page — COMP 526 Efficient Algorithms
Unit 6: Text Indexing – Searching whole genomes
This is an archived version of this module from Spring 2021.
Click here for the current iteration.
This unit covers data structures over static text
that enable very efficient searching in the indexed text,
including some more recent advances in this field.
We cover
- inverted indices
- suffix trees
- applications of suffix trees: string matching, longest repeated substrings, longest common substrings, k-mismatch matching
- suffix arrays, LCP array
- linear-time construction for suffix and LCP arrays
Learning outcomes
- Know and understand methods for text indexing: inverted indices, suffix trees, (enhanced) suffix arrays
- Know and understand generalized suffix trees
- Know properties, in particular performance characteristics, and limitations of the above data structures.
- Design (simple) algorithms based on suffix trees.
- Understand construction algorithms for suffix arrays and LCP arrays.
Material
- slides
- lecture notes
- Video 6-1 (2021-03-16):
§6.1 Motivation, inverted indices
- Video 6-2 (2021-03-17):
§6.2 Suffix trees
- Video 6-3 (2021-03-17):
§6.3 Applications of suffix trees
- Video 6-4 (2021-03-17):
§6.4 Longest common extensions
- Video 6-5 (2021-04-13):
§6.4 Longest common extensions (part 2)
- Video 6-6 (2021-04-13):
§6.5 Suffix arrays
- Video 6-6a (2021-04-14):
Addendum to §6.5 Suffix arrays
- Video 6-7 (2021-04-13):
§6.6 Linear-time suffix sorting
- Video 6-8 (2021-04-14):
§6.6 Linear-time suffix sorting (part 2)
- Video 6-9 (2021-04-14):
§6.7 LCP arrays
- Video 6-10 (2021-04-14):
§6.7 LCP array construction & back to suffix trees
Further reading and sources
There is no textbook that covers all topics, but parts can be found in the following sources, all of which cover much more than the topics I selected.
Tries: These are notoriously treated only in passing in most algorithms and data structures textbooks, and there are various little differences in the details. Apart from the fact that they do not insist on prefix-free keys, I find the most helpful treatment of tries is given by Sedgewick and Wayne:
Suffix trees:
- Gusfield: Algorithms on Strings, Trees, and Sequences (1997) Cambridge University Press
- Crochemore, Rytter: Jewels of Stringology (2002), World Scientific
Suffix arrays have not yet found widespread inclusion in standard textbooks; my way of presenting it is mostly my own, but took inspiration from
Suffix arrays and LCP arrays are also discussed (in slightly different form) in