Benchmarks

Important

  • Minimal tuning was done

  • Benchmarking is hard (and biased)

  • Take with a grain of salt

We’ve tried to create some benchmarks to help users understand which DataFrames to use when. Please take the results with a grain of salt! If you have ideas on how we can further improve preformance please make an issue, we always welcome contributions.

Setup used

Single Machine:

  • 16 CPUs

  • 64GB RAM

Distributed Spark:

  • 20 Executors

  • 8 Cores

  • 32GB RAM

The Data

The data (base, and compare) we generated was purely synthetic consisting of 10 columns:

  • 1 id (montonicly increasing) column used for joining

  • 3 string columns

  • 6 numeric columns

Table of mean benchmark times in seconds:

Number of

pandas

polars

spark sql

pandas on spark

spark (fugue)

spark (fugue)

rows

(distributed)

(distributed)

(single)

(distributed)

1000

0.025

0.025

15.3112

15.2838

2.041

1.109

100,000

0.196

0.120

15.0701

11.1113

1.743

3.175

10,000,000

18.804

11.330

18.2763

20.6274

17.560

16.455

50,000,000

96.494

62.827

31.1257

57.5735

90.578

94.304

100,000,000

DNR

127.194

47.2185

96.3204

DNR

193.234

500,000,000

DNR

DNR

130.9814

262.6094

DNR

DNR

Note

DNR = Did not run

_images/benchmarks.png

TLDR

  • Polars can handle a lot of data and is fast!

    • From our experiments we can see that on a 64GB machine it was able to process 100 Million records

  • The Pandas on Spark implementation will be slower for small to mediumish data.

    • in the 100 Million + range is starts to shine, and due to its distributed nature it can process vast amounts of data

  • The Spark SQL implementaion seems to be the most performant on very large datasets

    • It makes the Pandas on Spark implementation obsolete moving forward.

  • The native Pandas version is best for small and medium data