main page — COMP 526 Efficient Algorithms
Unit 6: Text Indexing – Searching whole genomes
This is an archived version of this module from Spring 2020.
Click here for the current iteration.
This unit covers data structures over static text that enable very efficient searching in the indexed text, including some more recent advances in this field. We cover
- inverted indices
- suffix trees
- applications of suffix trees: string matching, longest repeated substrings, longest common substrings, k-mismatch matching
- suffix arrays, LCP array
- linear-time construction for suffix and LCP arrays
Learning outcomes
- Know and understand methods for text indexing: inverted indices, suffix trees, (enhanced) suffix arrays
- Know and understand generalized suffix trees
- Know properties, in particular performance characteristics, and limitations of the above data structures.
- Design (simple) algorithms based on suffix trees.
- Understand construction algorithms for suffix arrays and LCP arrays.
Material
- lecture notes
- slides
- tutorial problems
- George’s tutorial videos
- Lecture 2020-03-09: Tries
- Lecture 2020-03-10: Tries (part 2)
- Lecture 2020-03-10: Suffix trees
- Lecture 2020-03-10: Applications of suffix trees
- Lecture 2020-03-10: Longest common extensions
- Lecture 2020-03-10: Suffix arrays
- Lecture 2020-03-16: Suffix arrays (part 2)
- Lecture 2020-03-16: Linear-time suffix sorting
- Lecture 2020-03-17: Linear-time suffix sorting (part 2)
- Lecture 2020-03-17: LCP arrays
Further reading and sources
There is no textbook that covers all topics, but parts can be found in the following sources, all of which cover much more than the topics I selected.
Tries: These are notoriously treated only in passing in most algorithms and data structures textbooks, and there are various little differences in the details. Apart from the fact that they do not insist on prefix-free keys, I find the most helpful treatment of tries is given by Sedgewick and Wayne:
Suffix trees:
- Gusfield: Algorithms on Strings, Trees, and Sequences (1997) Cambridge University Press
- Crochemore, Rytter: Jewels of Stringology (2002), World Scientific
Suffix arrays have not yet found widespread inclusion in standard textbooks; my way of presenting it is mostly my own, but took inspiration from
Suffix arrays and LCP arrays are also discussed (in slightly different form) in