Quicksort Is Optimal for Many Equal Keys

Jan 2018 (Written: Apr 2017)

Sebastian Wild:

Meeting on Analytic Algorithmics and Combinatorics 2018
in Nebel M., Wagner S. (eds.): ANALCO 2018, SIAM, pp 8–22

In this paper, I give the first analysis of Quicksort with median-of-$k$ pivot sampling on inputs with equal elements. I presented this result at ANALCO 2018; the slides are available on speakerdeck.

I also presented the main result at the informal AofA meeting 2017.

The result settles—under some restrictions—a long-standing open problem, the Sedgewick-Bentley conjecture, posed in two talks by Robert Sedgewick:

Median-of-$k$ Quicksort with fat-pivot partitioning has asymptotically optimal average behavior on any randomly ordered input up to a constant factor, and that factor converges to 1 as $k\to\infty$.

(New research on the theory and practice of sorting and Quicksort is Optimal)

Fat-pivot partitioning means that we split the input into three segments: elements strictly smaller than the pivot, strictly larger than the pivot, and equal to the pivot. Hence all duplicates of the pivot are removed from recursive calls.

Input Model: Many Duplicates

The input model are random permutations of fixed multisets, i.e., there are $u$ different values occurring exactly $x_1,\ldots,x_u$ times in the input. For this general setting, no progress has been made since Sedgewick’s seminal 1977 paper.

I therefore restrict the inputs to allow only those that have “many duplicates”: every of the $u$ distinct values appears $\Omega(n^\varepsilon)$ times in expectation for some constant $\varepsilon>0$.

Having many duplicates is arguably the most relevant scenario for studying how Quicksort performs:

This is the setting that differs most from the classic distinct-keys case.
It is the setting that seems to suggest the use of tailor-based methods, so that the result that the general-purpose Quicksort performs just as good is most interesting.

I have no reason to believe that the below result is not true for the unrestricted case, but there are several technical reasons why the restriction is hard to drop.

The result

In a nutshell, we find that Quicksort needs asymptotically at most a factor $\alpha_k = \ln(2) / (H_{k+1} - H_{(k+1)/2})$ more comparisons than needed by any comparison-based method to sort a random multiset permutation with many duplicates.

The key idea is to consider a related input model where each value is drawn i.i.d. (independent and identically distributed) from a distribution with probability weights $\mathbf q = (x_1/n,\ldots,x_u/n)$.

Step 1: Separating n and q

The first ingredient for the analysis is a separation theorem that expresses Quicksort’s cost as $a_{\mathbf q} n \pm O(n^{1-\varepsilon})$.

A form of this theorem is already given in my dissertation (Theorem 8.16) for the general case of $s$-way Quicksort with generic one-pass partitioning. In my dissertation, I used a stronger many-duplicates assumption, but for this step it can be generalized. The ANALCO paper gives the detailed proof for median-of-$k$ Quicksort on $n$ elements drawn i.i.d. from any discrete distribution $\mathbf q$.

Step 2: Path Length of Fringe-Balanced Trees with duplicates

The term $a_{\mathbf q}$ corresponds to the average search costs in a fringe-balanced search tree built from inserting $\mathbf q$ distributed values until the tree contains all $u$ possible values.

Fringe-balanced trees are the search-tree equivalent of median-of-$k$ Quicksort: Leaves collect up to $k$ elements before a new internal nodes is created, which then get the median of the $k$ collected elements as label. That way, the bottommost subtrees, i.e., those at the fringe of the tree, have more balanced splits than in an ordinary search tree.

We can show that $a_{\mathbf q}$ is $\alpha_k$ times the entropy of $\mathbf q$ in an asymptotic sense (unless the distribution becomes degenerate, i.e., the entropy is bounded by a constant). That result might be of independent interest.

Step 3: Translating between the models

These two steps together imply the main result for i.i.d. inputs with many duplicates. We can transfer them back to an upper bound for the multiset model.

General Case for Discrete Uniform Inputs

In the general setting for multiway Quicksort, an explicit expression for $a_{\mathbf q}$ is still to be found, although it is naturally to conjecture a similar entropy-bound as above.

For the special case of a discrete uniform distribution in $ \lbrace 1,\ldots,u \rbrace $, I found a general asymptotic solution in my dissertation (Theorem 8.17) that even applies to more cost measures than comparisons. This case is particularly nice in that—under the many duplicates assumption—the costs for Quicksort have the same form as for random permutations, only with $\ln(n)$ replaced by $\ln(u)$.

I presented this part of the analysis in Dagstuhl, see the slides on slideshare.