main page  —  COMP 526 Applied Algorithmics

Unit 8: Text Indexing – Searching whole genomes

This unit covers data structures over static text that enable very efficient searching in the indexed text, including some more recent advances in this field. We cover

  • inverted indices
  • suffix trees
  • applications of suffix trees: string matching, longest repeated substrings, longest common substrings, k-mismatch matching
  • suffix arrays, LCP array
  • linear-time construction for suffix and LCP arrays

Learning outcomes

  1. Know and understand methods for text indexing: inverted indices, suffix trees, (enhanced) suffix arrays
  2. Know and understand generalized suffix trees
  3. Know properties, in particular performance characteristics, and limitations of the above data structures.
  4. Design (simple) algorithms based on suffix trees.
  5. Understand construction algorithms for suffix arrays and LCP arrays.

Material

  • slides
  • lecture notes

  • Video 8-1 (2023-11-24): §8.1 Motivation, inverted indices, tries

  • Video 8-2 (2023-11-24): §8.2 Suffix trees

  • Video 8-3 (2023-11-24): §8.3 Applications of suffix trees: string matching

  • Video 8-4 (2023-11-30): §8.3 Applications of suffix trees (part 2): longest repeated substring, longest common substring

  • Video 8-5 (2023-11-30): §8.4 Longest common extensions

  • Video 8-6 (2023-11-30): §8.5 Suffix arrays

  • Video 8-7 (2023-12-01): §8.6 Linear-time suffix sorting – Induced sorting and merging

  • Video 8-8 (2023-12-01): §8.7 Linear-time suffix sorting – DC3 algorithm

  • Video 8-9 (2023-12-01): §8.8 The LCP array


Further reading and sources

There is no textbook that covers all topics, but parts can be found in the following sources, all of which cover much more than the topics I selected.

Tries: These are notoriously treated only in passing in most algorithms and data structures textbooks, and there are various little differences in the details. Apart from the fact that they do not insist on prefix-free keys, I find the most helpful treatment of tries is given by Sedgewick and Wayne:

Suffix trees:

Suffix arrays have not yet found widespread inclusion in standard textbooks; my way of presenting it is mostly my own, but took inspiration from

Suffix arrays and LCP arrays are also discussed (in slightly different form) in


Unit 7  ⋅  Syllabus  ⋅  Unit 9