Benchmarks¶
Important
Minimal tuning was done
Benchmarking is hard (and biased)
Take with a grain of salt
We’ve tried to create some benchmarks to help users understand which DataFrames to use when. Please take the results with a grain of salt! If you have ideas on how we can further improve preformance please make an issue, we always welcome contributions.
Setup used¶
Single Machine:
16 CPUs
64GB RAM
Distributed Spark:
20 Executors
8 Cores
32GB RAM
The Data¶
The data (base, and compare) we generated was purely synthetic consisting of 10 columns:
1 id (montonicly increasing) column used for joining
3 string columns
6 numeric columns
Table of mean benchmark times in seconds:
Number of |
pandas |
polars |
spark sql |
pandas on spark |
spark (fugue) |
spark (fugue) |
---|---|---|---|---|---|---|
rows |
(distributed) |
(distributed) |
(single) |
(distributed) |
||
1000 |
0.025 |
0.025 |
15.3112 |
15.2838 |
2.041 |
1.109 |
100,000 |
0.196 |
0.120 |
15.0701 |
11.1113 |
1.743 |
3.175 |
10,000,000 |
18.804 |
11.330 |
18.2763 |
20.6274 |
17.560 |
16.455 |
50,000,000 |
96.494 |
62.827 |
31.1257 |
57.5735 |
90.578 |
94.304 |
100,000,000 |
DNR |
127.194 |
47.2185 |
96.3204 |
DNR |
193.234 |
500,000,000 |
DNR |
DNR |
130.9814 |
262.6094 |
DNR |
DNR |
Note
DNR = Did not run
TLDR¶
Polars can handle a lot of data and is fast!
From our experiments we can see that on a 64GB machine it was able to process 100 Million records
The Pandas on Spark implementation will be slower for small to mediumish data.
in the 100 Million + range is starts to shine, and due to its distributed nature it can process vast amounts of data
The Spark SQL implementaion seems to be the most performant on very large datasets
It makes the Pandas on Spark implementation obsolete moving forward.
The native Pandas version is best for small and medium data