Polars Usage ============ .. important:: Please note that Polars support is experimental and new in ``datacompy`` as of v0.11.0 Overview -------- `Polars `_ is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model. DataComPy's implementation of Polars is very similar port of the Pandas version. There are some differences you should be aware of as they may yield slightly different results. For instance Polars does not have indexes like Pandas so that functionality is not supported. Basic Usage ----------- Usage is very similar to the Pandas implementation. Consider the following code: .. code-block:: python from io import StringIO import polars as pl from datacompy import PolarsCompare data1 = """acct_id,dollar_amt,name,float_fld,date_fld 10000001234,123.45,George Maharis,14530.1555,2017-01-01 10000001235,0.45,Michael Bluth,1,2017-01-01 10000001236,1345,George Bluth,,2017-01-01 10000001237,123456,Bob Loblaw,345.12,2017-01-01 10000001238,1.05,Lucille Bluth,,2017-01-01 10000001238,1.05,Loose Seal Bluth,,2017-01-01 """ data2 = """acct_id,dollar_amt,name,float_fld 10000001234,123.4,George Michael Bluth,14530.155 10000001235,0.45,Michael Bluth, 10000001236,1345,George Bluth,1 10000001237,123456,Robert Loblaw,345.12 10000001238,1.05,Loose Seal Bluth,111 """ df1 = pl.read_csv(StringIO(data1)) df2 = pl.read_csv(StringIO(data2)) Compare Object Setup -------------------- Unlike the Pandas implementation there is only one supported method for joining your dataframes: join column(s). .. code-block:: python compare = PolarsCompare( df1, df2, join_columns='acct_id', #You can also specify a list of columns abs_tol=0.0001, rel_tol=0, df1_name='original', df2_name='new') # OR compare = PolarsCompare(df1, df2, join_columns=['acct_id', 'name']) Reports ------- A report is generated by calling ``PolarsCompare.report()``, which returns a string. Here is a sample report generated by ``datacompy`` for the two tables above, joined on ``acct_id`` (Note: if you don't specify ``df1_name`` and/or ``df2_name``, then any instance of "original" or "new" in the report is replaced with "df1" and/or "df2".):: DataComPy Comparison -------------------- DataFrame Summary ----------------- DataFrame Columns Rows 0 original 5 6 1 new 4 5 Column Summary -------------- Number of columns in common: 4 Number of columns in original but not in new: 1 Number of columns in new but not in original: 0 Row Summary ----------- Matched on: acct_id Any duplicates on match values: Yes Absolute Tolerance: 0.0001 Relative Tolerance: 0 Number of rows in common: 5 Number of rows in original but not in new: 1 Number of rows in new but not in original: 0 Number of rows with some compared columns unequal: 5 Number of rows with all compared columns equal: 0 Column Comparison ----------------- Number of columns compared with some values unequal: 3 Number of columns compared with all values equal: 1 Total number of values which compare unequal: 8 Columns with Unequal Values or Types ------------------------------------ Column original dtype new dtype # Unequal Max Diff # Null Diff 0 dollar_amt Float64 Float64 1 0.0500 0 1 float_fld Float64 Float64 4 0.0005 3 2 name String String 3 0.0000 0 Sample Rows with Unequal Values ------------------------------- acct_id dollar_amt (original) dollar_amt (new) 0 10000001234 123.45 123.4 acct_id name (original) name (new) 0 10000001234 George Maharis George Michael Bluth 1 10000001237 Bob Loblaw Robert Loblaw 2 10000001238 Lucille Bluth Loose Seal Bluth acct_id float_fld (original) float_fld (new) 0 10000001234 14530.1555 14530.155 1 10000001235 1.0000 NaN 2 10000001236 NaN 1.000 3 10000001238 NaN 111.000 Sample Rows Only in original (First 10 Columns) ----------------------------------------------- acct_id dollar_amt name float_fld date_fld 0 10000001238 1.05 Loose Seal Bluth NaN 2017-01-01 Convenience Methods ------------------- There are a few convenience methods available after the comparison has been run: .. code-block:: python print(compare.intersect_rows[['name_df1', 'name_df2', 'name_match']]) # shape: (5, 3) # ┌────────────────┬──────────────────────┬────────────┐ # │ name_df1 ┆ name_df2 ┆ name_match │ # │ --- ┆ --- ┆ --- │ # │ str ┆ str ┆ bool │ # ╞════════════════╪══════════════════════╪════════════╡ # │ George Maharis ┆ George Michael Bluth ┆ false │ # │ Michael Bluth ┆ Michael Bluth ┆ true │ # │ George Bluth ┆ George Bluth ┆ true │ # │ Bob Loblaw ┆ Robert Loblaw ┆ false │ # │ Lucille Bluth ┆ Loose Seal Bluth ┆ false │ # └────────────────┴──────────────────────┴────────────┘ print(compare.df1_unq_rows) # shape: (1, 5) # ┌─────────────┬────────────┬──────────────────┬───────────┬────────────┐ # │ acct_id ┆ dollar_amt ┆ name ┆ float_fld ┆ date_fld │ # │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ # │ i64 ┆ f64 ┆ str ┆ f64 ┆ str │ # ╞═════════════╪════════════╪══════════════════╪═══════════╪════════════╡ # │ 10000001238 ┆ 1.05 ┆ Loose Seal Bluth ┆ null ┆ 2017-01-01 │ # └─────────────┴────────────┴──────────────────┴───────────┴────────────┘ print(compare.df2_unq_rows) # shape: (0, 4) # ┌─────────┬────────────┬──────┬───────────┐ # │ acct_id ┆ dollar_amt ┆ name ┆ float_fld │ # │ --- ┆ --- ┆ --- ┆ --- │ # │ i64 ┆ f64 ┆ str ┆ f64 │ # ╞═════════╪════════════╪══════╪═══════════╡ # └─────────┴────────────┴──────┴───────────┘ print(compare.intersect_columns()) # OrderedSet(['acct_id', 'dollar_amt', 'name', 'float_fld']) print(compare.df1_unq_columns()) # OrderedSet(['date_fld']) print(compare.df2_unq_columns()) # OrderedSet() Duplicate rows -------------- The PolarsCompare will try and handle duplicate rows just like the Pandas version. See :ref:`pandas-dup-rows`. Caveats and Limitations ----------------------- - Similar to Pandas the Polars version will try and be resilient to duplicate matching. See Pandas: :ref:`pandas-caveats`. - As mentioned above there is no indexing like in Pandas, so you must join on ``join_columns`` - Polars ``join`` method uses ``join_nulls``. This tries to match `Pandas' behaviour `_ but there might be unexpected results due to this. - The Polars documentation has a good section on migrating from Pandas to Polars and some of the things to consider. See `Coming from Pandas `_