Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs

## Breaking news! [2021-09-07]

*Powersort replaces Timsort’s merging policy
in the Python library!*

If you are using a cutting-edge version of Python you are running powersort whenever you sort a Python list.
More precisely, both the *CPython* reference implementation of the Python programming language,
and *PyPy*, a widely used alternative making heavy use of
just-in-time compilation have now adopted powersort;
see the changelog for Python 3.11 (search for powersort),
and the issue on Python’s bug tracker.

### CPython

CPython is the reference implementaton of Python;
unless you explicitly install something else, `python`

on your machine will be CPython.
Powersort is used in Tim Peters’s new C-code for sorting lists, see `listobjects.c`

.
The list sorting algorithm is described in Tim Peters’s words in
`listsort.txt`

,
including lots of discussion of the taken design decisions and some benchmark
results.

### PyPy

Powersort has also been adapted by Carl Friedrich Bolz-Tereick for PyPy.
See the Python source for powersort on PyPy heptapod: `listsort.py`

.

### Other libraries

Powersort is also used in AssemblyScript with apparently quite impressive results.

## Adaptive sorting

Mergesort can make use of existing order in the input by
picking up *existing runs*, i.e., segments that are already sorted.
Since the lengths of these runs can be arbitrary,
simply merging them as they arrive can be wasteful—merging can
degenerate to inserting a single element into a long run.

In this paper, we show that we can find an *optimal* merging order (up to lower order
terms of costs) with negligible overhead and thereby get the *same worst-case guarantee*
as for classic mergesort (up to lower order terms),
while exploiting existing runs if present.
Our two new mergesort variants,
*peeksort* and *powersort*, are simple, stable,
optimally adaptive and fast in practice (never slower
than classic mergesort and Timsort,
but significantly faster on certain inputs, see §4 of the paper).

## Natural Mergesort

The simplest version of run-adaptive mergesort starts with
bottom-up mergesort,
but instead of blindly treating every element as a minimal run of length 1,
we detect existing sorted segments.
Knuth presents a clever implementation of this strategy under the name *“natural mergesort”*
(Algorithm 5.2.4–N in TAOCP).
It detects runs while merging them and so avoids to ever keep track of all the
run lengths.

This can save a lot of work if long runs are present, but the merging order can be suboptimal (see below) and the merging procedure itself uses additional comparisons to re-detect the run boundaries in each pass.

### A bad case for natural mergesort

An example where merging left to right is wasteful is the following class of inputs. Suppose $n$ is a power of two and the the runs lengths are $n/2, 1, 1, 2, 4, 8, \ldots, n/4$. For $n=64$, the runs look like the following:

```
|================================|=|=|==|====|========|================|
```

There are $1+\lg n$ runs, so bottom-up mergesort would do $\lceil\lg\lg n\rceil$ passes, in each pass touching all $n$ elements for a total cost of $n \lceil\lg\lg n\rceil$. However, by iteratively merging the shortest runs, the average number of merges one element participates in is $2$, so we can merge these runs with total cost of $2n$.

## Mergesort and optimal binary search trees

The problem of determining the best order for merges turns out to be equivalent
to well-studied coding problems.
Indeed, if we take the cost of one merge to be the output size (*“merge costs” model*),
the total merge cost is exactly the total weighted code length of a corresponding
prefix-free code, where the symbols are the runs and their weights are the
lengths of the runs.

An obvious choice is thus to simulate the Huffman code
(iteratively merging shortest runs, as in the above example),
but in general, that does not lead to a *stable* sort
since it might involve merging nonadjacent runs.
Optimal stable options correspond to alphabetic a.k.a. Hu-Tucker codes, or equivalently, optimal binary search trees.

While these approaches principally work, the overhead for finding the merging order easily exceeds the benefit gained. Recall that even in the specifically crafted example above, the improvement was to replace a multiplicative $\lg\lg n$ by $2$, which, in this universe, is a speedup by a factor of 2–3 really. So if such an optimized mergesort is to result in an overall faster sorting method, the merging order must be found with negligible overhead.

## Nearly-optimal mergesort

We therefore rely on methods to compute
*nearly-optimal binary search trees*, simple linear-time methods that compute
search trees that are within a small additive constant of the entropy
(and thus the optimal tree).
They consist in greedily choosing the root so as to balance the topmost split
as much as possible, and recursively build the subtrees.
There are several variants that differ in the details.

**Method 1**, a.k.a. the*weight-balancing rule,*is described in one of Kurt Mehlhorn’s earliest papers ever:*Nearly Optimal Binary Search Trees*. It uses the most straight-forward way to recurse.**Method 2**, a.k.a. the*bisection rule,*was proposed by Kurt Mehlhorn shortly after that in*A Best Possible Bounds for the Weighted Path Length of Binary Search Trees*. It continues halving the original interval irrespective of the actual split at the root.

Figure 1 in the paper illustrates the two methods.

### Peeksort

Method 1 leads to *peeksort:* we simulate cutting the probability in half
by *finding the run boundary closest to the middle of the array.*
We can do that by simply scanning left and right from the middle, until
we find the first out-of-order pair.
With two additional indices passed down to the recursive calls, we can avoid
every scanning the same run twice in this “peeking-at-the-middle” process.

The awesome property of peeksort is that we do *not* have to detect the runs
*up front;* we can delay this step until we really need to know a boundary.
In case the input does not contain long runs, we do not much more work
than standard top-down mergesort.

### Powersort

Method 2—or, indeed *Method 2’,* a slight modification of the
bisection rule—is the basis for *powersort*.

This method is much more similar to Timsort than to standard mergesort: it proceeds in one pass from left to right over the input, and detects the next run in the input. For each new run, we may decide to do some merges now, or we delay them and keep the runs on a to-do-stack.

While Timsort uses a rather complicated set of rules to decide what and when to merge,
powersort assigns each adjacent pair of runs an easy to compute integer,
its *“power”* and upon arrival of a new pair,
simply executes all postponed merges of higher power.

Like peeksort, powersort provably adapts optimally to existing runs (up to linear terms),
something Timsort provably does *not*.