Polars Usage¶
Important
Please note that Polars support is experimental and new in datacompy
as of v0.11.0
Overview¶
Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model. DataComPy’s implementation of Polars is very similar port of the Pandas version. There are some differences you should be aware of as they may yield slightly different results. For instance Polars does not have indexes like Pandas so that functionality is not supported.
Basic Usage¶
Usage is very similar to the Pandas implementation. Consider the following code:
from io import StringIO
import polars as pl
from datacompy import PolarsCompare
data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001238,1.05,Lucille Bluth,,2017-01-01
10000001238,1.05,Loose Seal Bluth,,2017-01-01
"""
data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""
df1 = pl.read_csv(StringIO(data1))
df2 = pl.read_csv(StringIO(data2))
Compare Object Setup¶
Unlike the Pandas implementation there is only one supported method for joining your dataframes: join column(s).
compare = PolarsCompare(
df1,
df2,
join_columns='acct_id', #You can also specify a list of columns
abs_tol=0.0001,
rel_tol=0,
df1_name='original',
df2_name='new')
# OR
compare = PolarsCompare(df1, df2, join_columns=['acct_id', 'name'])
Reports¶
A report is generated by calling PolarsCompare.report()
, which returns a string.
Here is a sample report generated by datacompy
for the two tables above,
joined on acct_id
(Note: if you don’t specify df1_name
and/or df2_name
,
then any instance of “original” or “new” in the report is replaced with “df1”
and/or “df2”.):
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 original 5 6
1 new 4 5
Column Summary
--------------
Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0
Row Summary
-----------
Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 1
Number of rows in new but not in original: 0
Number of rows with some compared columns unequal: 5
Number of rows with all compared columns equal: 0
Column Comparison
-----------------
Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 8
Columns with Unequal Values or Types
------------------------------------
Column original dtype new dtype # Unequal Max Diff # Null Diff
0 dollar_amt Float64 Float64 1 0.0500 0
1 float_fld Float64 Float64 4 0.0005 3
2 name String String 3 0.0000 0
Sample Rows with Unequal Values
-------------------------------
acct_id dollar_amt (original) dollar_amt (new)
0 10000001234 123.45 123.4
acct_id name (original) name (new)
0 10000001234 George Maharis George Michael Bluth
1 10000001237 Bob Loblaw Robert Loblaw
2 10000001238 Lucille Bluth Loose Seal Bluth
acct_id float_fld (original) float_fld (new)
0 10000001234 14530.1555 14530.155
1 10000001235 1.0000 NaN
2 10000001236 NaN 1.000
3 10000001238 NaN 111.000
Sample Rows Only in original (First 10 Columns)
-----------------------------------------------
acct_id dollar_amt name float_fld date_fld
0 10000001238 1.05 Loose Seal Bluth NaN 2017-01-01
Convenience Methods¶
There are a few convenience methods available after the comparison has been run:
print(compare.intersect_rows[['name_df1', 'name_df2', 'name_match']])
# shape: (5, 3)
# ┌────────────────┬──────────────────────┬────────────┐
# │ name_df1 ┆ name_df2 ┆ name_match │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ bool │
# ╞════════════════╪══════════════════════╪════════════╡
# │ George Maharis ┆ George Michael Bluth ┆ false │
# │ Michael Bluth ┆ Michael Bluth ┆ true │
# │ George Bluth ┆ George Bluth ┆ true │
# │ Bob Loblaw ┆ Robert Loblaw ┆ false │
# │ Lucille Bluth ┆ Loose Seal Bluth ┆ false │
# └────────────────┴──────────────────────┴────────────┘
print(compare.df1_unq_rows)
# shape: (1, 5)
# ┌─────────────┬────────────┬──────────────────┬───────────┬────────────┐
# │ acct_id ┆ dollar_amt ┆ name ┆ float_fld ┆ date_fld │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str ┆ f64 ┆ str │
# ╞═════════════╪════════════╪══════════════════╪═══════════╪════════════╡
# │ 10000001238 ┆ 1.05 ┆ Loose Seal Bluth ┆ null ┆ 2017-01-01 │
# └─────────────┴────────────┴──────────────────┴───────────┴────────────┘
print(compare.df2_unq_rows)
# shape: (0, 4)
# ┌─────────┬────────────┬──────┬───────────┐
# │ acct_id ┆ dollar_amt ┆ name ┆ float_fld │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ str ┆ f64 │
# ╞═════════╪════════════╪══════╪═══════════╡
# └─────────┴────────────┴──────┴───────────┘
print(compare.intersect_columns())
# OrderedSet(['acct_id', 'dollar_amt', 'name', 'float_fld'])
print(compare.df1_unq_columns())
# OrderedSet(['date_fld'])
print(compare.df2_unq_columns())
# OrderedSet()
Duplicate rows¶
The PolarsCompare will try and handle duplicate rows just like the Pandas version. See Duplicate rows.
Caveats and Limitations¶
Similar to Pandas the Polars version will try and be resilient to duplicate matching. See Pandas: Caveats.
As mentioned above there is no indexing like in Pandas, so you must join on
join_columns
Polars
join
method usesjoin_nulls
. This tries to match Pandas’ behaviour but there might be unexpected results due to this.The Polars documentation has a good section on migrating from Pandas to Polars and some of the things to consider. See Coming from Pandas