Pandas Usage¶
Overview¶
The main goal of datacompy
is to provide a human-readable output describing
differences between two dataframes. For example, if you have two dataframes
containing data like:
df1
acct_id |
dollar_amt |
name |
float_fld |
date_fld |
---|---|---|---|---|
10000001234 |
123.45 |
George Maharis |
14530.1555 |
2017-01-01 |
10000001235 |
0.45 |
Michael Bluth |
1 |
2017-01-01 |
10000001236 |
1345 |
George Bluth |
2017-01-01 |
|
10000001237 |
123456 |
Bob Loblaw |
345.12 |
2017-01-01 |
10000001238 |
1.05 |
Lucille Bluth |
2017-01-01 |
|
10000001238 |
1.05 |
Loose Seal Bluth |
2017-01-01 |
df2
acct_id |
dollar_amt |
name |
float_fld |
---|---|---|---|
10000001234 |
123.4 |
George Michael Bluth |
14530.155 |
10000001235 |
0.45 |
Michael Bluth |
|
10000001236 |
1345 |
George Bluth |
1 |
10000001237 |
123456 |
Robert Loblaw |
345.12 |
10000001238 |
1.05 |
Loose Seal Bluth |
111 |
Set up like:
from io import StringIO
import pandas as pd
import datacompy
data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001238,1.05,Lucille Bluth,,2017-01-01
10000001238,1.05,Loose Seal Bluth,,2017-01-01
"""
data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""
df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))
Compare Object Setup¶
There are currently two supported methods for joining your dataframes - by join column(s) or by index.
compare = datacompy.Compare(
df1,
df2,
join_columns='acct_id', #You can also specify a list of columns
abs_tol=0.0001,
rel_tol=0,
df1_name='original',
df2_name='new')
# OR
compare = datacompy.Compare(df1, df2, join_columns=['acct_id', 'name'])
# OR
compare = datacompy.Compare(df1, df2, on_index=True)
Reports¶
A report is generated by calling Compare.report()
, which returns a string.
Here is a sample report generated by datacompy
for the two tables above,
joined on acct_id
(Note: if you don’t specify df1_name
and/or df2_name
,
then any instance of “original” or “new” in the report is replaced with “df1”
and/or “df2”.):
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 original 5 6
1 new 4 5
Column Summary
--------------
Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0
Row Summary
-----------
Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 1
Number of rows in new but not in original: 0
Number of rows with some compared columns unequal: 5
Number of rows with all compared columns equal: 0
Column Comparison
-----------------
Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 7
Columns with Unequal Values or Types
------------------------------------
Column original dtype new dtype # Unequal Max Diff # Null Diff
0 dollar_amt float64 float64 1 0.0500 0
1 float_fld float64 float64 4 0.0005 3
2 name object object 2 0.0000 0
Sample Rows with Unequal Values
-------------------------------
acct_id dollar_amt (original) dollar_amt (new)
0 10000001234 123.45 123.4
acct_id float_fld (original) float_fld (new)
0 10000001234 14530.1555 14530.155
5 10000001238 NaN 111.000
2 10000001236 NaN 1.000
1 10000001235 1.0000 NaN
acct_id name (original) name (new)
0 10000001234 George Maharis George Michael Bluth
3 10000001237 Bob Loblaw Robert Loblaw
Sample Rows Only in original (First 10 Columns)
-----------------------------------------------
acct_id dollar_amt name float_fld date_fld
4 10000001238 1.05 Lucille Bluth NaN 2017-01-01
Convenience Methods¶
There are a few convenience methods available after the comparison has been run:
print(compare.intersect_rows[['name_df1', 'name_df2', 'name_match']])
# name_df1 name_df2 name_match
# 0 George Maharis George Michael Bluth False
# 1 Michael Bluth Michael Bluth True
# 2 George Bluth George Bluth True
# 3 Bob Loblaw Robert Loblaw False
# 5 Loose Seal Bluth Loose Seal Bluth True
print(compare.df1_unq_rows)
# acct_id dollar_amt name float_fld date_fld
# 4 10000001238 1.05 Lucille Bluth NaN 2017-01-01
print(compare.df2_unq_rows)
# Empty DataFrame
# Columns: [acct_id, dollar_amt, name, float_fld]
# Index: []
print(compare.intersect_columns())
# {'float_fld', 'acct_id', 'name', 'dollar_amt'}
print(compare.df1_unq_columns())
# {'date_fld'}
print(compare.df2_unq_columns())
# set()
Duplicate rows¶
Datacompy will try to handle rows that are duplicate in the join columns. It does this behind the scenes by generating a unique ID within each unique group of the join columns. For example, if you have two dataframes you’re trying to join on acct_id:
acct_id |
name |
---|---|
1 |
George Maharis |
1 |
Michael Bluth |
2 |
George Bluth |
acct_id |
name |
---|---|
1 |
George Maharis |
1 |
Michael Bluth |
1 |
Tony Wonder |
2 |
George Bluth |
Datacompy will generate a unique temporary ID for joining:
acct_id |
name |
temp_id |
---|---|---|
1 |
George Maharis |
0 |
1 |
Michael Bluth |
1 |
2 |
George Bluth |
0 |
acct_id |
name |
temp_id |
---|---|---|
1 |
George Maharis |
0 |
1 |
Michael Bluth |
1 |
1 |
Tony Wonder |
2 |
2 |
George Bluth |
0 |
And then merge the two dataframes on a combination of the join_columns you specified and the temporary ID, before dropping the temp_id again. So the first two rows in the first dataframe will match the first two rows in the second dataframe, and the third row in the second dataframe will be recognized as uniquely in the second.
Caveats¶
Duplicate matching is resilient to nulls in your join columns - it will convert the join columns to strings and fill null values with
'DATACOMPY_NULL'
before generating the temporary ID. If you already have'DATACOMPY_NULL'
as a value in your join columns, the merge step will fail with aValueError
. You can also fill null values with a value of your choice before initializing theCompare
class, based on what you know about the data.The duplicate matching is somewhat naïve when it comes to picking which rows to match when there are duplicates. Datacompy sorts by the other fields before generating the temporary ID, then matches directly on that field. If there are a lot of duplicates you may need to join on more columns, or handle them separately.
Limitations¶
There’s a number of limitations with datacompy
:
The dataframes that you’re comparing have to fit in memory. In comparison with SAS
PROC COMPARE
which can operate on datasets that are on disk, this could be a constraint if you’re using very large dataframes.If you only need to check whether or not two dataframes are exactly the same, you should look at the testing capabilities within Pandas and Numpy:
from pandas.testing import assert_series_equal from pandas.testing import assert_frame_equal import numpy.testing as npt #Compare two series assert_series_equal(df1['some_field'], df2['some_field']) #Compare two dataframes assert_frame_equal(df1, df2) #Numpy testing npt.assert_array_equal(arr1, arr2) npt.assert_almost_equal(obj1, obj2)