Benchmarks
==========

.. important::

    - Minimal tuning was done
    - Benchmarking is hard (and biased)
    - Take with a grain of salt

    
We've tried to create some benchmarks to help users understand which DataFrames to use when.
Please take the results with a grain of salt! If you have ideas on how we can further improve
preformance please make an issue, we always welcome contributions.


Setup used
----------

Single Machine:

- 16 CPUs
- 64GB RAM

Distributed Spark:

- 20 Executors
- 8 Cores
- 32GB RAM

The Data
---------

The data (base, and compare) we generated was purely synthetic consisting of 10 columns:

- 1 id (montonicly increasing) column used for joining
- 3 string columns 
- 6 numeric columns


Table of mean benchmark times in seconds:

===========   =======   =======    ===============   ===============   ===============   =============== 
Number of     pandas    polars     spark sql         pandas on spark   spark (fugue)     spark (fugue)   
rows                               (distributed)     (distributed)     (single)          (distributed)   
===========   =======   =======    ===============   ===============   ===============   =============== 
1000          0.025     0.025      15.3112           15.2838           2.041             1.109           
100,000       0.196     0.120      15.0701           11.1113           1.743             3.175           
10,000,000    18.804    11.330     18.2763           20.6274           17.560            16.455          
50,000,000    96.494    62.827     31.1257           57.5735           90.578            94.304          
100,000,000   DNR       127.194    47.2185           96.3204           DNR               193.234         
500,000,000   DNR       DNR        130.9814          262.6094          DNR               DNR             
===========   =======   =======    ===============   ===============   ===============   =============== 

.. note:: DNR = Did not run

.. image:: img/benchmarks.png

TLDR
----

* Polars can handle a lot of data and is fast!
  
  * From our experiments we can see that on a 64GB machine it was able to process 100 Million records

* The Pandas on Spark implementation will be slower for small to mediumish data.

  * in the 100 Million + range is starts to shine, and due to its distributed 
    nature it can process vast amounts of data

* The Spark SQL implementaion seems to be the most performant on very large datasets
  
  * It makes the Pandas on Spark implementation obsolete moving forward.

* The native Pandas version is best for small and medium data