Benchmarking with OSM data: Duckdb vs Rust vs Go

Posted on Jan 20, 2024

Lately, I’ve been curious about how duckdb, go and rust would handle OSM data and its performance. I’m interested in using OSM data mainly because it offers everything from small to medium-sized datasets, all the way up to colossal ones like planet-scale data. So here it goes!

Test 1: count features

This test consisted in opening the appropriate area osm.pbf file and count the number of features. The code used to run these tests is available here.

area features duckdb go rust
Andalucía 17189833 0.42 1.05 0.43
Spain 162478451 3.77 9.27 3.74
Europe 3695049613 94.63 218.50 95.96
Planet 9800308007 134.61 700.39 240.66

Note: time profiling in miliseconds

Test 2: average distance

This test built upon the first one and calculated the average distances between health centres and pharmacies in a given area. The code used to run these tests is available here.

area features duckdb go rust
Andalucía 17189833 0.90 0.98 0.55
Spain 162478451 8.40 8.84 4.49
Europe 3695049613 179.70 286.45 108.05
Planet 9800308007 220.47 3808.29 264.17

Note: time profiling in miliseconds

Results

Duckdb: on average, duckdb was marginally slower than Rust but it showed impressive results at planet scale. My only issue with duckdb is that the ST_Distance function wasn’t working (details here). I had to manually implement the haversine algorithm. But duckdb’s performance was very good - with few lines of code you can get really powerful results… and it’s serverless 🙏

Go: while Go may not boast the same speed as its counterparts, it deserves recognition for its capabilities. Processing a substantial 77GB osm.pbf file and computing distances between every single pharmacy and hospital worldwide in just 60 minutes is quite remarkable. Go’s strength lies in its simplicity and uncomplicated nature, offering an enjoyable coding experience. For instance, the tests I developed for the benchmarking were produced in less than an hour! 🤘🏼

Rust: sheer performance… with a caviat. Writing the benchmarking code took me significantly longer to develop than with Duckdb or Go. However, Rust it’s fast - like really fast. It was able to process the planet data (3.5 billion features!) in under 3 minutes, which is truly impressive 🔥. I also really liked its expressiveness as a language and its safety guarantees.

My thoughts

I am going to ruin everything by saying that none of the three languages tested are better than the other. They all have an important role to play depending on the task. For instance, I’d use Duckdb together with Python to speed up mundane tasks, but there are limitations when it comes to SQL. Then there’s Go and Rust, both powerful in their own right, Go sacrifices some performance for sake of simplicity whereas Rust doesn’t take any compromises (fight the borrow checker!) but you get performance, expressiveness & safety.

For a small company where sheer speed isn’t the primary concern - Go is probably the best choice out of the three. Prototyping in Go reminds me of Python (fast) and I have barely noticed it’s concise syntax. It is reasonably slower than Rust, but if you don’t work with big data or need low latency, few extra seconds may not be a big deal. The only real negative with Go is that it only has one mature geospatial library - Orb - and if your geospatial needs are beyond such library, you may have to figure a custom solution.

Enjoy!