If you consider contributing sorted lists from your own Python application to our benchmark for adaptive sorting, the steps below show you how to do collect this data. Note: Our instrumentation stores a list of integers with equivalent comparison-behavior to all lists sorted when running Python code through our custom CPython.

Background

The goal of the benchmark is to collect real-world data from Python applications to better understand the effectiveness of adaptive features in the list sort functions. In my PyCon US 2023 talk, I reached out to Pythonistas to contribute their sorting inputs. If sorted lists were completely random data, we would never see (significant) improvements from these, but data hardly is very random.
How much pre-sortedness is there in your use case? Let’s find out!

Step 1: Build instrumented CPython

Clone the instrumented branch of CPython; currently we have support for 3.11 or 3.10. (If we dearly need another version, drop me a line and we can add it.)

git clone https://github.com/sebawild/cpython --branch 3.11-instrumented --single-branch cpython-sorting
cd cpython-sorting

The steps below assume linux and a set up development environment; check the official instructions). For a core installation only standard C build tools are needed, plus OpenSSL headers. (On Ubuntu, you get the latter via sudo apt-get install libssl-dev).

./configure --enable-optimizations && make -j
make test 

Step 2: Set up your project

First, we create a venv (a virtual environment to keep installed package local). Inside cpython-sorting, call

./python -m venv sorting-python
source sorting-python/bin/activate

to create and activate the sorting-python venv. Now you can use pip in the usual way to install any needed packages.

Step 3: Run your application and submit arrays.txt

You run your application as normal: python your-awesome-script.py.

To collect the benchmark data, first delete arrays.txt (results are otherwise appended) and run your application. Then store arrays.txt and send it over, with a quick description of your application.

Afterwards arrays.txt will contain all sorted lists (and some stats). Note that even during the process of starting python, a few dozen calls to list sort are made (mostly on tiny lists); for the benchmark, we are mostly interested in big lists.

A rudimentary script to read an arrays.txt file and compute some presortedness metrics is implemented in run-information.py. Simply running python run-information.py (in the same folder) will print stats on the longest sorted list (by default). This is sufficient to check whether your application sorted substantially long list at all. If so, please send your arrays.txt to me.

Limitations

The instrumentation is a quick hack at this point, not production-ready code. It is hence best to run code via our python in a sandbox environment.

Known limitations:

  • The output arrays.txt is appended each time you run python and it could grow large.
  • Our instrumentation is not ready for multi-threading. The instrumentation may crash python in obscure scenarios such as comparison functions that modify the sorted list.