Comparator Framework Usage¶
Added in version 1.0.0.
Overview¶
Version 1.0.0 of DataComPy introduces a new, modular comparator framework designed for extensibility and customization.
Previously, the logic for comparing different data types was tightly coupled within each backend’s main
Compare class (e.g., PandasCompare). This made it difficult to alter how comparisons were performed.
The new framework moves type-specific comparison logic into a series of independent comparator classes
found in the datacompy.comparator module. This allows users to create and use their own custom comparators to
handle unique data types or implement specialized comparison logic.
Core Concepts¶
The framework is built around a few key ideas: the BaseComparator abstract class, a fallback mechanism, and a pipeline of comparators.
The BaseComparator Class¶
All comparators, both built-in and custom, must inherit from datacompy.comparator.base.BaseComparator.
This class defines the interface that DataComPy’s comparison engine expects.
The most important part of this interface is the compare method:
from abc import ABC, abstractmethod
from typing import Any
class BaseComparator(ABC):
@abstractmethod
def compare(self, col1: Any, col2: Any, **kwargs) -> Any:
"""Check if two columns are equal."""
raise NotImplementedError()
When you create a custom comparator, you must implement this method.
The Fallback Mechanism¶
A crucial feature of the framework is its fallback (or “chain-of-responsibility”) mechanism.
When comparing two columns, DataComPy iterates through a list of comparators and calls their compare method.
Important
If a comparator is not suitable for the data types it receives, it must return None.
Returning None signals to the comparison engine that the comparator could not handle the columns,
prompting the engine to try the next comparator in the pipeline. If a comparator successfully performs
a comparison, it should return a boolean Series (for Pandas/Polars) or a boolean Column expression
(for Spark/Snowflake) indicating which values are equal.
Comparator Pipeline¶
When you initiate a Compare object, it creates a pipeline of comparators to use.
If you provide custom comparators via the custom_comparators parameter, they are placed
at the beginning of this pipeline.
The order of execution is: 1. Your list of custom comparators, in the order you provided them. 2. DataComPy’s built-in comparators (for arrays, numerics, and strings).
This ensures that your custom logic is always tried first.
Creating a Custom Comparator¶
To create a custom comparator, you need to:
Create a class that inherits from
datacompy.comparator.base.BaseComparator.Implement the
comparemethod.Inside
compare, add logic to check if your comparator is applicable to the input columns. If not, returnNone.Add your custom comparison logic and return the boolean result.
Example: A Custom Phone Number Comparator¶
Imagine you have two dataframes with phone numbers stored as strings, but in inconsistent formats
(e.g., with or without parentheses, hyphens, or spaces). DataComPy’s default string comparator would
treat "(123) 456-7890" and "1234567890" as different.
Let’s create a custom comparator to handle this. It will strip all non-numeric characters before comparing.
import pandas as pd
import datacompy
from datacompy.comparator.base import BaseComparator
class PhoneNumberComparator(BaseComparator):
"""
Custom comparator for US phone numbers.
This comparator strips all non-numeric characters from strings
before comparing them.
"""
def compare(self, col1: pd.Series, col2: pd.Series) -> pd.Series | None:
"""
Compare two series of phone numbers.
"""
# 1. Check if this comparator is applicable. We only want to act on
# columns that are string or object type.
if not (pd.api.types.is_string_dtype(col1) and pd.api.types.is_string_dtype(col2)):
return None # Signal to fallback to the next comparator
# 2. Implement the custom comparison logic.
# Strip non-numeric characters.
norm_col1 = col1.str.replace(r'[^0-9]', '', regex=True)
norm_col2 = col2.str.replace(r'[^0-9]', '', regex=True)
# 3. Return the boolean result. Handle NaNs to match default behavior.
return (norm_col1 == norm_col2) | (col1.isnull() & col2.isnull())
Using the Custom Comparator¶
To use your custom comparator, pass an instance of it in a list to the custom_comparators
argument of the Compare constructor. Let’s see it in action with our PhoneNumberComparator.
Setup¶
First, let’s create two sample DataFrames. Notice that customer ID 2 has the same phone number but with different formatting.
# Sample DataFrames
df1 = pd.DataFrame({
'cust_id': [1, 2, 3],
'phone': ['123-456-7890', '(987) 654-3210', '555-555-5555'],
})
df2 = pd.DataFrame({
'cust_id': [1, 2, 3],
'phone': ['123-456-7890', '9876543210', '555-123-4567'],
})
Comparison Without the Custom Comparator¶
If we run the comparison without our custom logic, the phone number for customer 2 and 3 will be marked as mismatches.
# Without the custom comparator
compare_default = datacompy.PandasCompare(
df1,
df2,
join_columns=['cust_id']
)
# This will show a mismatch for cust_id 2
print(compare_default.report())
The report will show two unequal value for the phone column.
...
Sample Rows with Unequal Values
-------------------------------
cust_id phone (df1) phone (df2)
0 2 (987) 654-3210 9876543210
1 3 555-555-5555 555-123-4567
...
Comparison With the Custom Comparator¶
Now, let’s pass our PhoneNumberComparator to the PandasCompare object.
# With the custom comparator
compare_custom = datacompy.PandasCompare(
df1,
df2,
join_columns=['cust_id'],
custom_comparators=[PhoneNumberComparator()]
)
# This will now show a match for cust_id 2's phone
print(compare_custom.report())
This time, our custom logic is applied first. It correctly identifies that the phone numbers for customer 2 are the same after normalization. The report will now correctly show that customer 3 is a mismatch only.
...
Sample Rows with Unequal Values
-------------------------------
cust_id phone (df1) phone (df2)
0 3 555-555-5555 555-123-4567
...
This example demonstrates how you can easily extend DataComPy to handle your project’s specific data comparison needs.