algorithms
programming
How to contribute your inputs to the Adaptive Sorting Benchmark
If you consider contributing sorted lists from your own Python application to our benchmark for adaptive sorting, the steps below show you how to do collect this data. Note: Our instrumentation stores a list of integers with equivalent comparison-behavior to all lists sorted when running Python code through our custom CPython.
Background
The goal of the benchmark is to collect real-world data from Python applications
to better understand the effectiveness of adaptive features in the list sort functions.
In my PyCon US 2023 talk, I reached out to Pythonistas to contribute their sorting inputs.
If sorted lists were completely random data, we would never see (significant) improvements
from these, but data hardly is very random.
How much pre-sortedness is there in your use case? Let’s find out!
Step 1: Build instrumented CPython
Clone the instrumented branch of CPython; currently we have support for 3.11 or 3.10. (If we dearly need another version, drop me a line and we can add it.)
git clone https://github.com/sebawild/cpython --branch 3.11-instrumented --single-branch cpython-sorting
cd cpython-sorting
The steps below assume linux and a set up development environment;
check the official instructions).
For a core installation only standard C build tools are needed, plus OpenSSL headers.
(On Ubuntu, you get the latter via sudo apt-get install libssl-dev
).
./configure --enable-optimizations && make -j
make test
Step 2: Set up your project
First, we create a venv (a virtual environment to keep installed package local).
Inside cpython-sorting
, call
./python -m venv sorting-python
source sorting-python/bin/activate
to create and activate the sorting-python
venv.
Now you can use pip in the usual way to install any needed packages.
Step 3: Run your application and submit arrays.txt
You run your application as normal: python your-awesome-script.py
.
To collect the benchmark data, first delete arrays.txt
(results are otherwise appended)
and run your application.
Then store arrays.txt
and send it over,
with a quick description of your application.
Afterwards arrays.txt
will contain all sorted lists (and some stats).
Note that even during the process of starting python, a few dozen calls to list sort
are made (mostly on tiny lists); for the benchmark, we are mostly interested in
big lists.
A rudimentary script to read an arrays.txt
file and compute some presortedness
metrics is implemented in run-information.py
.
Simply running python run-information.py
(in the same folder) will
print stats on the longest sorted list (by default).
This is sufficient to check whether your application sorted substantially long list at all.
If so, please send your arrays.txt
to me.
Limitations
The instrumentation is a quick hack at this point, not production-ready code. It is hence best to run code via our python in a sandbox environment.
Known limitations:
- The output
arrays.txt
is appended each time you runpython
and it could grow large. - Our instrumentation is not ready for multi-threading. The instrumentation may crash python in obscure scenarios such as comparison functions that modify the sorted list.