.. _comparator_usage:

===========================
Comparator Framework Usage
===========================

.. versionadded:: 1.0.0

Overview
========

Version 1.0.0 of DataComPy introduces a new, modular comparator framework designed for extensibility and customization.
Previously, the logic for comparing different data types was tightly coupled within each backend's main
``Compare`` class (e.g., ``PandasCompare``). This made it difficult to alter how comparisons were performed.

The new framework moves type-specific comparison logic into a series of independent comparator classes
found in the ``datacompy.comparator`` module. This allows users to create and use their own custom comparators to
handle unique data types or implement specialized comparison logic.

Core Concepts
=============

The framework is built around a few key ideas: the ``BaseComparator`` abstract class, a fallback mechanism, and a pipeline of comparators.

The BaseComparator Class
------------------------

All comparators, both built-in and custom, must inherit from ``datacompy.comparator.base.BaseComparator``.
This class defines the interface that DataComPy's comparison engine expects.

The most important part of this interface is the ``compare`` method:

.. code-block:: python

    from abc import ABC, abstractmethod
    from typing import Any

    class BaseComparator(ABC):
        @abstractmethod
        def compare(self, col1: Any, col2: Any, **kwargs) -> Any:
            """Check if two columns are equal."""
            raise NotImplementedError()

When you create a custom comparator, you must implement this method.

The Fallback Mechanism
----------------------

A crucial feature of the framework is its fallback (or "chain-of-responsibility") mechanism.
When comparing two columns, DataComPy iterates through a list of comparators and calls their ``compare`` method.

.. important::

    If a comparator is not suitable for the data types it receives, it must return ``None``.

Returning ``None`` signals to the comparison engine that the comparator could not handle the columns,
prompting the engine to try the next comparator in the pipeline. If a comparator successfully performs
a comparison, it should return a boolean Series (for Pandas/Polars) or a boolean Column expression
(for Spark/Snowflake) indicating which values are equal.

Comparator Pipeline
-------------------

When you initiate a ``Compare`` object, it creates a pipeline of comparators to use.
If you provide custom comparators via the ``custom_comparators`` parameter, they are placed
at the **beginning** of this pipeline.

The order of execution is:
1. Your list of custom comparators, in the order you provided them.
2. DataComPy's built-in comparators (for arrays, numerics, and strings).

This ensures that your custom logic is always tried first.

Creating a Custom Comparator
============================

To create a custom comparator, you need to:

1. Create a class that inherits from ``datacompy.comparator.base.BaseComparator``.
2. Implement the ``compare`` method.
3. Inside ``compare``, add logic to check if your comparator is applicable to the input columns.
   If not, return ``None``.
4. Add your custom comparison logic and return the boolean result.

Example: A Custom Phone Number Comparator
------------------------------------------

Imagine you have two dataframes with phone numbers stored as strings, but in inconsistent formats
(e.g., with or without parentheses, hyphens, or spaces). DataComPy's default string comparator would
treat ``"(123) 456-7890"`` and ``"1234567890"`` as different.

Let's create a custom comparator to handle this. It will strip all non-numeric characters before
comparing.

.. code-block:: python

    import pandas as pd
    import datacompy
    from datacompy.comparator.base import BaseComparator

    class PhoneNumberComparator(BaseComparator):
        """
        Custom comparator for US phone numbers.

        This comparator strips all non-numeric characters from strings
        before comparing them.
        """
        def compare(self, col1: pd.Series, col2: pd.Series) -> pd.Series | None:
            """
            Compare two series of phone numbers.
            """
            # 1. Check if this comparator is applicable. We only want to act on
            #    columns that are string or object type.
            if not (pd.api.types.is_string_dtype(col1) and pd.api.types.is_string_dtype(col2)):
                return None # Signal to fallback to the next comparator

            # 2. Implement the custom comparison logic.
            # Strip non-numeric characters.
            norm_col1 = col1.str.replace(r'[^0-9]', '', regex=True)
            norm_col2 = col2.str.replace(r'[^0-9]', '', regex=True)

            # 3. Return the boolean result. Handle NaNs to match default behavior.
            return (norm_col1 == norm_col2) | (col1.isnull() & col2.isnull())

Using the Custom Comparator
===========================

To use your custom comparator, pass an instance of it in a list to the ``custom_comparators``
argument of the ``Compare`` constructor. Let's see it in action with our ``PhoneNumberComparator``.

Setup
-----

First, let's create two sample DataFrames. Notice that customer ID `2` has the same phone number
but with different formatting.

.. code-block:: python

    # Sample DataFrames
    df1 = pd.DataFrame({
        'cust_id': [1, 2, 3],
        'phone': ['123-456-7890', '(987) 654-3210', '555-555-5555'],
    })

    df2 = pd.DataFrame({
        'cust_id': [1, 2, 3],
        'phone': ['123-456-7890', '9876543210', '555-123-4567'],
    })

Comparison Without the Custom Comparator
----------------------------------------

If we run the comparison without our custom logic, the phone number for customer `2` and `3`
will be marked as mismatches.

.. code-block:: python

    # Without the custom comparator
    compare_default = datacompy.PandasCompare(
        df1,
        df2,
        join_columns=['cust_id']
    )
    # This will show a mismatch for cust_id 2
    print(compare_default.report())

The report will show two unequal value for the ``phone`` column.

.. code-block:: text

   ...
   Sample Rows with Unequal Values
   -------------------------------

      cust_id     phone (df1)   phone (df2)
   0        2  (987) 654-3210    9876543210
   1        3    555-555-5555  555-123-4567
   ...

Comparison With the Custom Comparator
-------------------------------------

Now, let's pass our ``PhoneNumberComparator`` to the ``PandasCompare`` object.

.. code-block:: python

    # With the custom comparator
    compare_custom = datacompy.PandasCompare(
        df1,
        df2,
        join_columns=['cust_id'],
        custom_comparators=[PhoneNumberComparator()]
    )
    # This will now show a match for cust_id 2's phone
    print(compare_custom.report())


This time, our custom logic is applied first. It correctly identifies that the phone numbers for
customer `2` are the same after normalization. The report will now correctly show that customer `3`
is a mismatch only.

.. code-block:: text

    ...
    Sample Rows with Unequal Values
    -------------------------------

       cust_id   phone (df1)   phone (df2)
    0        3  555-555-5555  555-123-4567
    ...

This example demonstrates how you can easily extend DataComPy to handle your project's
specific data comparison needs.