Polars Usage

Important

Please note that Polars support is experimental and new in datacompy as of v0.11.0

Overview

Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model. DataComPy’s implementation of Polars is very similar port of the Pandas version. There are some differences you should be aware of as they may yield slightly different results. For instance Polars does not have indexes like Pandas so that functionality is not supported.

Basic Usage

Usage is very similar to the Pandas implementation. Consider the following code:

from io import StringIO
import polars as pl
from datacompy import PolarsCompare

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001238,1.05,Lucille Bluth,,2017-01-01
10000001238,1.05,Loose Seal Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""

df1 = pl.read_csv(StringIO(data1))
df2 = pl.read_csv(StringIO(data2))

Compare Object Setup

Unlike the Pandas implementation there is only one supported method for joining your dataframes: join column(s).

compare = PolarsCompare(
    df1,
    df2,
    join_columns='acct_id',  #You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name='original',
    df2_name='new')

# OR

compare = PolarsCompare(df1, df2, join_columns=['acct_id', 'name'])

Reports

A report is generated by calling PolarsCompare.report(), which returns a string. Here is a sample report generated by datacompy for the two tables above, joined on acct_id (Note: if you don’t specify df1_name and/or df2_name, then any instance of “original” or “new” in the report is replaced with “df1” and/or “df2”.):

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  original        5     6
1       new        4     5

Column Summary
--------------

Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0

Row Summary
-----------

Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 1
Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 5
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 8

Columns with Unequal Values or Types
------------------------------------

       Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
0  dollar_amt        Float64   Float64          1    0.0500            0
1   float_fld        Float64   Float64          4    0.0005            3
2        name         String    String          3    0.0000            0

Sample Rows with Unequal Values
-------------------------------

       acct_id  dollar_amt (original)  dollar_amt (new)
0  10000001234                 123.45             123.4

       acct_id name (original)            name (new)
0  10000001234  George Maharis  George Michael Bluth
1  10000001237      Bob Loblaw         Robert Loblaw
2  10000001238   Lucille Bluth      Loose Seal Bluth

       acct_id  float_fld (original)  float_fld (new)
0  10000001234            14530.1555        14530.155
1  10000001235                1.0000              NaN
2  10000001236                   NaN            1.000
3  10000001238                   NaN          111.000

Sample Rows Only in original (First 10 Columns)
-----------------------------------------------

       acct_id  dollar_amt              name  float_fld    date_fld
0  10000001238        1.05  Loose Seal Bluth        NaN  2017-01-01

Convenience Methods

There are a few convenience methods available after the comparison has been run:

print(compare.intersect_rows[['name_df1', 'name_df2', 'name_match']])
# shape: (5, 3)
# ┌────────────────┬──────────────────────┬────────────┐
# │ name_df1       ┆ name_df2             ┆ name_match │
# │ ---            ┆ ---                  ┆ ---        │
# │ str            ┆ str                  ┆ bool       │
# ╞════════════════╪══════════════════════╪════════════╡
# │ George Maharis ┆ George Michael Bluth ┆ false      │
# │ Michael Bluth  ┆ Michael Bluth        ┆ true       │
# │ George Bluth   ┆ George Bluth         ┆ true       │
# │ Bob Loblaw     ┆ Robert Loblaw        ┆ false      │
# │ Lucille Bluth  ┆ Loose Seal Bluth     ┆ false      │
# └────────────────┴──────────────────────┴────────────┘

print(compare.df1_unq_rows)
# shape: (1, 5)
# ┌─────────────┬────────────┬──────────────────┬───────────┬────────────┐
# │ acct_id     ┆ dollar_amt ┆ name             ┆ float_fld ┆ date_fld   │
# │ ---         ┆ ---        ┆ ---              ┆ ---       ┆ ---        │
# │ i64         ┆ f64        ┆ str              ┆ f64       ┆ str        │
# ╞═════════════╪════════════╪══════════════════╪═══════════╪════════════╡
# │ 10000001238 ┆ 1.05       ┆ Loose Seal Bluth ┆ null      ┆ 2017-01-01 │
# └─────────────┴────────────┴──────────────────┴───────────┴────────────┘

print(compare.df2_unq_rows)
# shape: (0, 4)
# ┌─────────┬────────────┬──────┬───────────┐
# │ acct_id ┆ dollar_amt ┆ name ┆ float_fld │
# │ ---     ┆ ---        ┆ ---  ┆ ---       │
# │ i64     ┆ f64        ┆ str  ┆ f64       │
# ╞═════════╪════════════╪══════╪═══════════╡
# └─────────┴────────────┴──────┴───────────┘

print(compare.intersect_columns())
# OrderedSet(['acct_id', 'dollar_amt', 'name', 'float_fld'])

print(compare.df1_unq_columns())
# OrderedSet(['date_fld'])

print(compare.df2_unq_columns())
# OrderedSet()

Duplicate rows

The PolarsCompare will try and handle duplicate rows just like the Pandas version. See Duplicate rows.

Caveats and Limitations

  • Similar to Pandas the Polars version will try and be resilient to duplicate matching. See Pandas: Caveats.

  • As mentioned above there is no indexing like in Pandas, so you must join on join_columns

  • Polars join method uses join_nulls. This tries to match Pandas’ behaviour but there might be unexpected results due to this.

  • The Polars documentation has a good section on migrating from Pandas to Polars and some of the things to consider. See Coming from Pandas