Polars Usage
============

.. important::

    Please note that Polars support is experimental and new in ``datacompy``
    as of v0.11.0

Overview
--------

`Polars <https://github.com/pola-rs/polars>`_ is a DataFrame interface on top of an OLAP Query Engine implemented in
Rust using Apache Arrow Columnar Format as the memory model. DataComPy's implementation of Polars is very similar port
of the Pandas version. There are some differences you should be aware of as they may yield slightly different
results. For instance Polars does not have indexes like Pandas so that functionality is not supported.


Basic Usage
-----------

Usage is very similar to the Pandas implementation. Consider the following code:

.. code-block:: python

    from io import StringIO
    import polars as pl
    from datacompy import PolarsCompare

    data1 = """acct_id,dollar_amt,name,float_fld,date_fld
    10000001234,123.45,George Maharis,14530.1555,2017-01-01
    10000001235,0.45,Michael Bluth,1,2017-01-01
    10000001236,1345,George Bluth,,2017-01-01
    10000001237,123456,Bob Loblaw,345.12,2017-01-01
    10000001238,1.05,Lucille Bluth,,2017-01-01
    10000001238,1.05,Loose Seal Bluth,,2017-01-01
    """

    data2 = """acct_id,dollar_amt,name,float_fld
    10000001234,123.4,George Michael Bluth,14530.155
    10000001235,0.45,Michael Bluth,
    10000001236,1345,George Bluth,1
    10000001237,123456,Robert Loblaw,345.12
    10000001238,1.05,Loose Seal Bluth,111
    """

    df1 = pl.read_csv(StringIO(data1))
    df2 = pl.read_csv(StringIO(data2))

Compare Object Setup
--------------------

Unlike the Pandas implementation there is only one supported method for joining your dataframes: join column(s).

.. code-block:: python

    compare = PolarsCompare(
        df1,
        df2,
        join_columns='acct_id',  #You can also specify a list of columns
        abs_tol=0.0001,
        rel_tol=0,
        df1_name='original',
        df2_name='new')

    # OR

    compare = PolarsCompare(df1, df2, join_columns=['acct_id', 'name'])

Reports
-------

A report is generated by calling ``PolarsCompare.report()``, which returns a string.
Here is a sample report generated by ``datacompy`` for the two tables above,
joined on ``acct_id`` (Note: if you don't specify ``df1_name`` and/or ``df2_name``,
then any instance of "original" or "new" in the report is replaced with "df1"
and/or "df2".)::

    DataComPy Comparison
    --------------------

    DataFrame Summary
    -----------------

      DataFrame  Columns  Rows
    0  original        5     6
    1       new        4     5

    Column Summary
    --------------

    Number of columns in common: 4
    Number of columns in original but not in new: 1
    Number of columns in new but not in original: 0

    Row Summary
    -----------

    Matched on: acct_id
    Any duplicates on match values: Yes
    Absolute Tolerance: 0.0001
    Relative Tolerance: 0
    Number of rows in common: 5
    Number of rows in original but not in new: 1
    Number of rows in new but not in original: 0

    Number of rows with some compared columns unequal: 5
    Number of rows with all compared columns equal: 0

    Column Comparison
    -----------------

    Number of columns compared with some values unequal: 3
    Number of columns compared with all values equal: 1
    Total number of values which compare unequal: 8

    Columns with Unequal Values or Types
    ------------------------------------

           Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
    0  dollar_amt        Float64   Float64          1    0.0500            0
    1   float_fld        Float64   Float64          4    0.0005            3
    2        name         String    String          3    0.0000            0

    Sample Rows with Unequal Values
    -------------------------------

           acct_id  dollar_amt (original)  dollar_amt (new)
    0  10000001234                 123.45             123.4

           acct_id name (original)            name (new)
    0  10000001234  George Maharis  George Michael Bluth
    1  10000001237      Bob Loblaw         Robert Loblaw
    2  10000001238   Lucille Bluth      Loose Seal Bluth

           acct_id  float_fld (original)  float_fld (new)
    0  10000001234            14530.1555        14530.155
    1  10000001235                1.0000              NaN
    2  10000001236                   NaN            1.000
    3  10000001238                   NaN          111.000

    Sample Rows Only in original (First 10 Columns)
    -----------------------------------------------

           acct_id  dollar_amt              name  float_fld    date_fld
    0  10000001238        1.05  Loose Seal Bluth        NaN  2017-01-01

Convenience Methods
-------------------

There are a few convenience methods available after the comparison has been run:

.. code-block:: python

    print(compare.intersect_rows[['name_df1', 'name_df2', 'name_match']])
    # shape: (5, 3)
    # ┌────────────────┬──────────────────────┬────────────┐
    # │ name_df1       ┆ name_df2             ┆ name_match │
    # │ ---            ┆ ---                  ┆ ---        │
    # │ str            ┆ str                  ┆ bool       │
    # ╞════════════════╪══════════════════════╪════════════╡
    # │ George Maharis ┆ George Michael Bluth ┆ false      │
    # │ Michael Bluth  ┆ Michael Bluth        ┆ true       │
    # │ George Bluth   ┆ George Bluth         ┆ true       │
    # │ Bob Loblaw     ┆ Robert Loblaw        ┆ false      │
    # │ Lucille Bluth  ┆ Loose Seal Bluth     ┆ false      │
    # └────────────────┴──────────────────────┴────────────┘

    print(compare.df1_unq_rows)
    # shape: (1, 5)
    # ┌─────────────┬────────────┬──────────────────┬───────────┬────────────┐
    # │ acct_id     ┆ dollar_amt ┆ name             ┆ float_fld ┆ date_fld   │
    # │ ---         ┆ ---        ┆ ---              ┆ ---       ┆ ---        │
    # │ i64         ┆ f64        ┆ str              ┆ f64       ┆ str        │
    # ╞═════════════╪════════════╪══════════════════╪═══════════╪════════════╡
    # │ 10000001238 ┆ 1.05       ┆ Loose Seal Bluth ┆ null      ┆ 2017-01-01 │
    # └─────────────┴────────────┴──────────────────┴───────────┴────────────┘

    print(compare.df2_unq_rows)
    # shape: (0, 4)
    # ┌─────────┬────────────┬──────┬───────────┐
    # │ acct_id ┆ dollar_amt ┆ name ┆ float_fld │
    # │ ---     ┆ ---        ┆ ---  ┆ ---       │
    # │ i64     ┆ f64        ┆ str  ┆ f64       │
    # ╞═════════╪════════════╪══════╪═══════════╡
    # └─────────┴────────────┴──────┴───────────┘

    print(compare.intersect_columns())
    # OrderedSet(['acct_id', 'dollar_amt', 'name', 'float_fld'])

    print(compare.df1_unq_columns())
    # OrderedSet(['date_fld'])

    print(compare.df2_unq_columns())
    # OrderedSet()

Duplicate rows
--------------

The PolarsCompare will try and handle duplicate rows just like the Pandas version. See :ref:`pandas-dup-rows`.

Caveats and Limitations
-----------------------

- Similar to Pandas the Polars version will try and be resilient to duplicate matching. See Pandas: :ref:`pandas-caveats`.
- As mentioned above there is no indexing like in Pandas, so you must join on ``join_columns``
- Polars ``join`` method uses ``join_nulls``. This tries to match `Pandas' behaviour <https://pandas.pydata.org/docs/reference/api/pandas.merge.html#pandas.merge>`_
  but there might be unexpected results due to this.
- The Polars documentation has a good section on migrating from Pandas to Polars and some of the things to consider.
  See `Coming from Pandas <https://docs.pola.rs/user-guide/migration/pandas/>`_