Pandas Usage
============

Overview
--------

The main goal of ``datacompy`` is to provide a human-readable output describing
differences between two dataframes.  For example, if you have two dataframes
containing data like:

df1

=========== ========== ================ ========== ==========
acct_id     dollar_amt name             float_fld  date_fld
=========== ========== ================ ========== ==========
10000001234 123.45     George Maharis   14530.1555 2017-01-01
10000001235 0.45       Michael Bluth    1          2017-01-01
10000001236 1345       George Bluth                2017-01-01
10000001237 123456     Bob Loblaw       345.12     2017-01-01
10000001238 1.05       Lucille Bluth               2017-01-01
10000001238 1.05       Loose Seal Bluth            2017-01-01
=========== ========== ================ ========== ==========

df2

=========== ========== ==================== =========
acct_id     dollar_amt name                 float_fld
=========== ========== ==================== =========
10000001234 123.4      George Michael Bluth 14530.155
10000001235 0.45       Michael Bluth
10000001236 1345       George Bluth         1
10000001237 123456     Robert Loblaw        345.12
10000001238 1.05       Loose Seal Bluth     111
=========== ========== ==================== =========

Set up like:

.. code-block:: python

    from io import StringIO
    import pandas as pd
    import datacompy

    data1 = """acct_id,dollar_amt,name,float_fld,date_fld
    10000001234,123.45,George Maharis,14530.1555,2017-01-01
    10000001235,0.45,Michael Bluth,1,2017-01-01
    10000001236,1345,George Bluth,,2017-01-01
    10000001237,123456,Bob Loblaw,345.12,2017-01-01
    10000001238,1.05,Lucille Bluth,,2017-01-01
    10000001238,1.05,Loose Seal Bluth,,2017-01-01
    """

    data2 = """acct_id,dollar_amt,name,float_fld
    10000001234,123.4,George Michael Bluth,14530.155
    10000001235,0.45,Michael Bluth,
    10000001236,1345,George Bluth,1
    10000001237,123456,Robert Loblaw,345.12
    10000001238,1.05,Loose Seal Bluth,111
    """

    df1 = pd.read_csv(StringIO(data1))
    df2 = pd.read_csv(StringIO(data2))

Compare Object Setup
--------------------

There are currently two supported methods for joining your dataframes - by
join column(s) or by index.

.. code-block:: python

    compare = datacompy.Compare(
        df1,
        df2,
        join_columns='acct_id',  #You can also specify a list of columns
        abs_tol=0.0001,
        rel_tol=0,
        df1_name='original',
        df2_name='new')

    # OR

    compare = datacompy.Compare(df1, df2, join_columns=['acct_id', 'name'])

    # OR

    compare = datacompy.Compare(df1, df2, on_index=True)

Reports
-------

A report is generated by calling ``Compare.report()``, which returns a string.
Here is a sample report generated by ``datacompy`` for the two tables above,
joined on ``acct_id`` (Note: if you don't specify ``df1_name`` and/or ``df2_name``,
then any instance of "original" or "new" in the report is replaced with "df1"
and/or "df2".)::

    DataComPy Comparison
    --------------------

    DataFrame Summary
    -----------------

      DataFrame  Columns  Rows
    0  original        5     6
    1       new        4     5

    Column Summary
    --------------

    Number of columns in common: 4
    Number of columns in original but not in new: 1
    Number of columns in new but not in original: 0

    Row Summary
    -----------

    Matched on: acct_id
    Any duplicates on match values: Yes
    Absolute Tolerance: 0.0001
    Relative Tolerance: 0
    Number of rows in common: 5
    Number of rows in original but not in new: 1
    Number of rows in new but not in original: 0

    Number of rows with some compared columns unequal: 5
    Number of rows with all compared columns equal: 0

    Column Comparison
    -----------------

    Number of columns compared with some values unequal: 3
    Number of columns compared with all values equal: 1
    Total number of values which compare unequal: 7

    Columns with Unequal Values or Types
    ------------------------------------

           Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
    0  dollar_amt        float64   float64          1    0.0500            0
    1   float_fld        float64   float64          4    0.0005            3
    2        name         object    object          2    0.0000            0

    Sample Rows with Unequal Values
    -------------------------------

           acct_id  dollar_amt (original)  dollar_amt (new)
    0  10000001234                 123.45             123.4

           acct_id  float_fld (original)  float_fld (new)
    0  10000001234            14530.1555        14530.155
    5  10000001238                   NaN          111.000
    2  10000001236                   NaN            1.000
    1  10000001235                1.0000              NaN

           acct_id name (original)            name (new)
    0  10000001234  George Maharis  George Michael Bluth
    3  10000001237      Bob Loblaw         Robert Loblaw

    Sample Rows Only in original (First 10 Columns)
    -----------------------------------------------

           acct_id  dollar_amt           name  float_fld    date_fld
    4  10000001238        1.05  Lucille Bluth        NaN  2017-01-01

Convenience Methods
-------------------

There are a few convenience methods available after the comparison has been run:

.. code-block:: python

    print(compare.intersect_rows[['name_df1', 'name_df2', 'name_match']])
    #            name_df1              name_df2  name_match
    # 0    George Maharis  George Michael Bluth       False
    # 1     Michael Bluth         Michael Bluth        True
    # 2      George Bluth          George Bluth        True
    # 3        Bob Loblaw         Robert Loblaw       False
    # 5  Loose Seal Bluth      Loose Seal Bluth        True

    print(compare.df1_unq_rows)
    #        acct_id  dollar_amt           name  float_fld    date_fld
    # 4  10000001238        1.05  Lucille Bluth        NaN  2017-01-01

    print(compare.df2_unq_rows)
    # Empty DataFrame
    # Columns: [acct_id, dollar_amt, name, float_fld]
    # Index: []

    print(compare.intersect_columns())
    # {'float_fld', 'acct_id', 'name', 'dollar_amt'}

    print(compare.df1_unq_columns())
    # {'date_fld'}

    print(compare.df2_unq_columns())
    # set()


.. _pandas-dup-rows:

Duplicate rows
--------------

Datacompy will try to handle rows that are duplicate in the join columns.  It does this behind the
scenes by generating a unique ID within each unique group of the join columns.  For example, if you
have two dataframes you're trying to join on acct_id:

=========== ================
acct_id     name
=========== ================
1           George Maharis
1           Michael Bluth
2           George Bluth
=========== ================

=========== ================
acct_id     name
=========== ================
1           George Maharis
1           Michael Bluth
1           Tony Wonder
2           George Bluth
=========== ================

Datacompy will generate a unique temporary ID for joining:

=========== ================ ========
acct_id     name             temp_id
=========== ================ ========
1           George Maharis   0
1           Michael Bluth    1
2           George Bluth     0
=========== ================ ========

=========== ================ ========
acct_id     name             temp_id
=========== ================ ========
1           George Maharis   0
1           Michael Bluth    1
1           Tony Wonder      2
2           George Bluth     0
=========== ================ ========

And then merge the two dataframes on a combination of the join_columns you specified and the temporary
ID, before dropping the temp_id again.  So the first two rows in the first dataframe will match the
first two rows in the second dataframe, and the third row in the second dataframe will be recognized
as uniquely in the second.

.. _pandas-caveats:

Caveats
+++++++

- Duplicate matching is resilient to nulls in your join columns - it will convert the join
  columns to strings and fill null values with ``'DATACOMPY_NULL'`` before generating the temporary
  ID.  If you already have ``'DATACOMPY_NULL'`` as a value in your join columns, the merge step will
  fail with a ``ValueError``.  You can also fill null values with a value of your choice before
  initializing the ``Compare`` class, based on what you know about the data.
- The duplicate matching is somewhat naïve when it comes to picking which rows to match when there
  are duplicates.  Datacompy sorts by the other fields before generating the temporary ID, then matches
  directly on that field.  If there are a lot of duplicates you may need to join on more columns, or
  handle them separately.

.. _pandas-limitations:

Limitations
-----------

There's a number of limitations with ``datacompy``:

1. The dataframes that you're comparing have to fit in memory.  In comparison
   with SAS ``PROC COMPARE`` which can operate on datasets that are on disk,
   this could be a constraint if you're using very large dataframes.
2. If you only need to check whether or not two dataframes are exactly the
   same, you should look at the testing capabilities within Pandas and Numpy:

   .. code-block:: python

       from pandas.testing import assert_series_equal
       from pandas.testing import assert_frame_equal
       import numpy.testing as npt

       #Compare two series
       assert_series_equal(df1['some_field'], df2['some_field'])

       #Compare two dataframes
       assert_frame_equal(df1, df2)

       #Numpy testing
       npt.assert_array_equal(arr1, arr2)
       npt.assert_almost_equal(obj1, obj2)