Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs

May 2018 (Written: May 2018)

J. Ian Munro and Sebastian Wild:

European Symposium on Algorithms 2018
in Y. Azar and H. Bast and G. Herman (Eds.): ESA 2018, LIPIcs 112, Dagstuhl, 2018, pp 63:1–63:16

Breaking news! [2021-09-07]

Powersort replaces Timsort’s merging policy in the Python library!

If you are using a cutting-edge version of Python you are running powersort whenever you sort a Python list. More precisely, both the CPython reference implementation of the Python programming language, and PyPy, a widely used alternative making heavy use of just-in-time compilation have now adopted powersort; see the changelog for Python 3.11 (search for powersort), and the issue on Python’s bug tracker.

CPython

CPython is the reference implementaton of Python; unless you explicitly install something else, python on your machine will be CPython. Powersort is used in Tim Peters’s new C-code for sorting lists, see listobjects.c. The list sorting algorithm is described in Tim Peters’s words in listsort.txt, including lots of discussion of the taken design decisions and some benchmark results.

PyPy

Powersort has also been adapted by Carl Friedrich Bolz-Tereick for PyPy. See the Python source for powersort on PyPy heptapod: listsort.py.

###

Since June 2025, powersort is also used in numpy, replacing the former Timsort implementation in the development branch.

Other libraries

Powersort is also used in AssemblyScript with apparently quite impressive results.

Adaptive sorting

Mergesort can make use of existing order in the input by picking up existing runs, i.e., segments that are already sorted. Since the lengths of these runs can be arbitrary, simply merging them as they arrive can be wasteful—merging can degenerate to inserting a single element into a long run.

In this paper, we show that we can find an optimal merging order (up to lower order terms of costs) with negligible overhead and thereby get the same worst-case guarantee as for classic mergesort (up to lower order terms), while exploiting existing runs if present. Our two new mergesort variants, peeksort and powersort, are simple, stable, optimally adaptive and fast in practice (never slower than classic mergesort and Timsort, but significantly faster on certain inputs, see §4 of the paper).

Natural Mergesort

The simplest version of run-adaptive mergesort starts with bottom-up mergesort, but instead of blindly treating every element as a minimal run of length 1, we detect existing sorted segments. Knuth presents a clever implementation of this strategy under the name “natural mergesort” (Algorithm 5.2.4–N in TAOCP). It detects runs while merging them and so avoids to ever keep track of all the run lengths.

This can save a lot of work if long runs are present, but the merging order can be suboptimal (see below) and the merging procedure itself uses additional comparisons to re-detect the run boundaries in each pass.

A bad case for natural mergesort

An example where merging left to right is wasteful is the following class of inputs. Suppose $n$ is a power of two and the the runs lengths are $n/2, 1, 1, 2, 4, 8, \ldots, n/4$. For $n=64$, the runs look like the following:

|================================|=|=|==|====|========|================|

There are $1+\lg n$ runs, so bottom-up mergesort would do $\lceil\lg\lg n\rceil$ passes, in each pass touching all $n$ elements for a total cost of $n \lceil\lg\lg n\rceil$. However, by iteratively merging the shortest runs, the average number of merges one element participates in is $2$, so we can merge these runs with total cost of $2n$.

Mergesort and optimal binary search trees

The problem of determining the best order for merges turns out to be equivalent to well-studied coding problems. Indeed, if we take the cost of one merge to be the output size (“merge costs” model), the total merge cost is exactly the total weighted code length of a corresponding prefix-free code, where the symbols are the runs and their weights are the lengths of the runs.

An obvious choice is thus to simulate the Huffman code (iteratively merging shortest runs, as in the above example), but in general, that does not lead to a stable sort since it might involve merging nonadjacent runs. Optimal stable options correspond to alphabetic a.k.a. Hu-Tucker codes, or equivalently, optimal binary search trees.

While these approaches principally work, the overhead for finding the merging order easily exceeds the benefit gained. Recall that even in the specifically crafted example above, the improvement was to replace a multiplicative $\lg\lg n$ by $2$, which, in this universe, is a speedup by a factor of 2–3 really. So if such an optimized mergesort is to result in an overall faster sorting method, the merging order must be found with negligible overhead.

Nearly-optimal mergesort

We therefore rely on methods to compute nearly-optimal binary search trees, simple linear-time methods that compute search trees that are within a small additive constant of the entropy (and thus the optimal tree). They consist in greedily choosing the root so as to balance the topmost split as much as possible, and recursively build the subtrees. There are several variants that differ in the details.

Method 1, a.k.a. the weight-balancing rule, is described in one of Kurt Mehlhorn’s earliest papers ever: Nearly Optimal Binary Search Trees. It uses the most straight-forward way to recurse.
Method 2, a.k.a. the bisection rule, was proposed by Kurt Mehlhorn shortly after that in A Best Possible Bounds for the Weighted Path Length of Binary Search Trees. It continues halving the original interval irrespective of the actual split at the root.

Figure 1 in the paper illustrates the two methods.

Peeksort

Method 1 leads to peeksort: we simulate cutting the probability in half by finding the run boundary closest to the middle of the array. We can do that by simply scanning left and right from the middle, until we find the first out-of-order pair. With two additional indices passed down to the recursive calls, we can avoid every scanning the same run twice in this “peeking-at-the-middle” process.

The awesome property of peeksort is that we do not have to detect the runs up front; we can delay this step until we really need to know a boundary. In case the input does not contain long runs, we do not much more work than standard top-down mergesort.

Powersort

Method 2—or, indeed Method 2’, a slight modification of the bisection rule—is the basis for powersort.

This method is much more similar to Timsort than to standard mergesort: it proceeds in one pass from left to right over the input, and detects the next run in the input. For each new run, we may decide to do some merges now, or we delay them and keep the runs on a to-do-stack.

While Timsort uses a rather complicated set of rules to decide what and when to merge, powersort assigns each adjacent pair of runs an easy to compute integer, its “power” and upon arrival of a new pair, simply executes all postponed merges of higher power.

Like peeksort, powersort provably adapts optimally to existing runs (up to linear terms), something Timsort provably does not.