main page  —  COMP 526 Applied Algorithmics

Unit 6: Text Indexing – Searching whole genomes

This is an archived version of this module from Spring 2020.
Click here for the current iteration.

This unit covers data structures over static text that enable very efficient searching in the indexed text, including some more recent advances in this field. We cover

  • inverted indices
  • suffix trees
  • applications of suffix trees: string matching, longest repeated substrings, longest common substrings, k-mismatch matching
  • suffix arrays, LCP array
  • linear-time construction for suffix and LCP arrays

Learning outcomes

  1. Know and understand methods for text indexing: inverted indices, suffix trees, (enhanced) suffix arrays
  2. Know and understand generalized suffix trees
  3. Know properties, in particular performance characteristics, and limitations of the above data structures.
  4. Design (simple) algorithms based on suffix trees.
  5. Understand construction algorithms for suffix arrays and LCP arrays.

Material

  • lecture notes
  • slides
  • tutorial problems
  • George’s tutorial videos
  • Lecture 2020-03-09: Tries

  • Lecture 2020-03-10: Tries (part 2)

  • Lecture 2020-03-10: Suffix trees

  • Lecture 2020-03-10: Applications of suffix trees

  • Lecture 2020-03-10: Longest common extensions

  • Lecture 2020-03-10: Suffix arrays

  • Lecture 2020-03-16: Suffix arrays (part 2)

  • Lecture 2020-03-16: Linear-time suffix sorting

  • Lecture 2020-03-17: Linear-time suffix sorting (part 2)

  • Lecture 2020-03-17: LCP arrays

Further reading and sources

There is no textbook that covers all topics, but parts can be found in the following sources, all of which cover much more than the topics I selected.

Tries: These are notoriously treated only in passing in most algorithms and data structures textbooks, and there are various little differences in the details. Apart from the fact that they do not insist on prefix-free keys, I find the most helpful treatment of tries is given by Sedgewick and Wayne:

Suffix trees:

Suffix arrays have not yet found widespread inclusion in standard textbooks; my way of presenting it is mostly my own, but took inspiration from

Suffix arrays and LCP arrays are also discussed (in slightly different form) in