Adaptive sorting for large keys, strings, and database rows
In the context of a (row-oriented) database system, sorting the rows of a table is a fundamental part of query processing and index creation. In this paper, we present a new adaptive sorting algorithm that is particularly well-suited for sorting large keys, strings, and database rows, based on Powersort.
Since comparisons are often very expensive – e.g., when sorting wrt. several columns, each potentially containing data with custom comparison functions such as international strings – a classic optimization is to first generate normalized keys, long binary strings that encode the sort order of the rows. Sorting rows is thus akin to sorting strings, and we can use some the same techniques, in particular remembering longest common prefixes of rows (or indeed, offset-value codes, which additionally contain the next few characters after the LCP) to avoids redundant character comparisons.
In this paper, we combine these techniques from string sorting with the run-adaptive Powersort, which avoids entire string comparisons when the input is partially sorted. We observe that the savings from both approaches are indeed orthogonal and nicely combine into overall performance improvements.