Home

Interoperating Rust with polars

Saturday, October 14 2023

Interoperating Rust with Polars

Polars is the hot new thing in data processing. It’s a dataframe library that, for many reasons — including because it’s written in Rust — is ultra-fast. It’s a fantastic improvement on Pandas.

It’s so good that unfortunately it’s incredibly noticeable if you need to step out of the Polars API and do some kind of processing that requires a custom function in Python. It’s intolerably slower.

Normally you’d written a Python function and call it on a dataframe using .map_elements1. Then you can step back, and go get coffee. Hell, get lunch too because you’ll be waiting a while. Your other option is to write your function in Rust, compile it to a native Python package and invoke it like any other Rust expression. The downside being that it’ll be so fast you won’t have time to get a coffee.

The contributors to Polars have made it super easy to define functions in Rust to interoperate (interop) with Polars, replete with examples2. This blog post is a more verbose explanation of that repository.

We’re going to walk through defining a Polars interop function in Rust to calculate the Levenshtein distance. This is a metric to determine how similar two strings are, which we use in Swim-free3 to determine the closest match for a location that users search for.

The setup

Python

requirements.txt contains our two Python depdencies: polars and maturin. Maturin is used to build the Rust package and output it as a Python package

pyproject.toml defines our Python package

We write the Python-binding code in ./levenshtein_lib/__init__.py. An __init__.py file tells Python to consider that directory a module, the code within __init__.py is executed when the module is imported.

Rust

Cargo.toml, specifying all the settings and dependencies for the Rust project. The dependencies we have are polars, polars-plan, pyo3 and pyo3-polars. PyO3 defines Rust bindings for Python, letting you write a Python module in Rust.

The src/ directory contains our Rust code.

lib.rs indicating to Rust that we’re writing a library.

expressions.rs defining our polars expressions and Levenshtein algorithm

Writing the Rust code

I’ll skip over this bit as it’s the most well-covered by other great resources. Only things to note here is that:

  1. I use dynamic programming to implement the Levenshtein distance metric. I have no idea if that’s optimal.
  2. I use the .nth method which I know is very suboptimal, but this is a demo. All to say that this implementation is naive and very unoptimised.
  3. Polars handles the query optimisation and parallelistion for you. Fantastic.

Building the Rust code as a native Python package

Maturin handles everything for us. We can invoke

> maturin develop --release -m levenshtein_lib/Cargo.toml

To compile a release version of the code.

Calling it from Python

Invoke the run.py script from within a virtual environment:

> python run.py

The API is a little unfortunate here in that we define a class but the name used in Polars is set in the decorator applied on the class. In our case we name the class LevenshteinDistance but register the expression as dist. The class is imported into our Python script like from expression_lib import LevenshteinDistance but we invoke the method with pl.col("places").dist.levenshtein("hull")). Easy to make mistakes, so best to be careful

‘Benchmarks’

To ‘benchmark’ this implementation I’ve compared the method against what we currently have in Swim-free, which is invoking the Levenshtein distance method in the rapidfuzz library. Rapidfuzz is a ‘highly optimised’ set of implementations written in C++. So in reality we’re testing my naive implementation against a fantastic implementation that has to operate element-wise, but what I’m trying to demonstrate is a quick mock-up against what you’d normally write after googling ‘Levenshtein polars’. Full script here.

Benchmark
Comparing the Rust function against the rapidfuzz library invoked using `.map_elements`.

I have no idea why it’s slower for few elements, but look at that improvement on the large dataframe!

Conclusions

I hope this was helpful. Please email me if you have any questions, or want to send me a Levenshtein algorithm in Rust that’s actually optimised.


  1. .apply is the deprecated keyword for .map_elements ↩︎

  2. This interface changed a month ago and looks a little rough around the edges in some places so I’d guess it’ll still change a fair bit. Polars improvements come in hot and fast, mostly for the better! ↩︎

  3. Check out the matching blog post. Using the Levenshtein distance for user input does lead to the unfortunate Easter egg that ‘Hull’ routes to the town of ‘Dull’ in Wales. The full name for Hull is ‘Kingston upon Hull’ and it’s a very underrated town. ↩︎