datacompy.comparator package¶

Submodules¶

datacompy.comparator.array module¶

Array Like Comparator Class.

class datacompy.comparator.array.PandasArrayLikeComparator¶

Bases: BaseComparator

Comparator for array-like columns in Pandas.

compare(col1: Series, col2: Series) → Series | None¶

Compare two array like columns for equality.

Parameters:

col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.

Returns:

pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.

class datacompy.comparator.array.PolarsArrayLikeComparator¶

Bases: BaseComparator

Comparator for array-like columns in Polars.

compare(col1: Series, col2: Series) → Series | None¶

Compare two array like columns for equality.

Parameters:

col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.

Returns:

pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.

class datacompy.comparator.array.SnowflakeArrayLikeComparator¶

Bases: BaseComparator

Comparator for array-like columns in Snowflake.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str) → DataFrame | None¶

Compare two array like columns for equality.

Parameters:

dataframe (snowflake.snowpark.DataFrame) – DataFrame to do comparison on
col1 (str) – The first column to look at
col2 (str) – The second column
col_match (str) – The matching column denoting if the compare was a match or not

Returns:

snowflake.snowpark.DataFrame – A PySpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.

class datacompy.comparator.array.SparkArrayLikeComparator¶

Bases: BaseComparator

Comparator for array-like columns in PySpark.

compare(dataframe: DataFrame, col1: str, col2: str) → Column | None¶

Compare two array like columns for equality.

Parameters:

dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on
col_1 (str) – The first column to look at
col_2 (str) – The second column

Returns:

pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are equal.
None – if the columns are not comparable.

datacompy.comparator.base module¶

Base Comparator Class.

class datacompy.comparator.base.BaseComparator¶

Bases: ABC

Base class for all comparators.

This class serves as an abstract base class for implementing specific comparator logic in derived classes.

abstractmethod compare(col1: Any, col2: Any, **kwargs) → Any¶

Check if two columns are equal.

This method should be implemented in derived classes to provide specific comparison logic.

Parameters:

col1 (Any) – The first column to compare.
col2 (Any) – The second column to compare.
**kwargs (Any) – Additional keyword arguments for comparison.

Returns:

Comparison result. (implementation-specific)

Return type:

Any

datacompy.comparator.boolean module¶

Boolean comparator classes.

class datacompy.comparator.boolean.PandasBooleanComparator¶

Bases: BaseComparator

Comparator for Boolean columns in Pandas.

compare(col1: Series, col2: Series, **kwargs: Any) → Series | None¶

Compare columns when either column holds Boolean values.

Boolean comparisons are exact and null-safe. When a Boolean column is compared with another dtype, normal Pandas equality semantics are used; for example, True matches 1 and False matches 0.

Parameters:

col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
**kwargs (Any) – Unused; accepted so this comparator matches the pipeline signature.

Returns:

pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are equal. Two nulls are treated as equal.
None – if the columns are not comparable.

Notes

Detection uses infer_dtype rather than the column dtype because Pandas represents Boolean data as object in common cases: a Boolean column containing a null, and any Boolean column that has been through an outer merge (which upcasts bool to object).

class datacompy.comparator.boolean.PolarsBooleanComparator¶

Bases: BaseComparator

Comparator for Boolean columns in Polars.

compare(col1: Series, col2: Series, **kwargs: Any) → Series | None¶

Compare columns when either Polars dtype is Boolean.

Boolean comparisons are exact and null-safe. When a Boolean column is compared with another dtype, normal Polars equality semantics are used; for example, True matches 1 and False matches 0.

Parameters:

col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
**kwargs (Any) – Unused; accepted so this comparator matches the pipeline signature.

Returns:

pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are equal. Two nulls are treated as equal.
None – if the columns are not comparable.

Notes

eq_missing is used rather than == because Polars propagates nulls through ==; eq_missing treats two nulls as equal and a null against a value as unequal.

Boolean against text is declined, matching the Spark comparator. Polars parses the string when comparing the two, so True would match 'true' but not 'True', the form str(True) produces. That casing rule is arbitrary enough to be a trap, and neither Pandas nor Spark reports a match here, so the pair is left to fall through.

class datacompy.comparator.boolean.SnowflakeBooleanComparator¶

Bases: BaseComparator

Comparator for Boolean columns in Snowflake.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, **kwargs: Any) → DataFrame | None¶

Compare two columns in a Snowflake DataFrame when either is Boolean.

Boolean comparisons are exact and null-safe. A Boolean column may also be compared against a numeric column, in which case True matches exactly 1 and False matches exactly 0.

Parameters:

dataframe (snowflake.snowpark.DataFrame) – The Snowflake DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
col_match (str) – The name of the output column that will store the comparison results.
**kwargs (Any) – Unused; accepted so this comparator matches the pipeline signature.

Returns:

snowflake.snowpark.DataFrame – The DataFrame with an additional column containing the comparison results. Two nulls are treated as equal.
None – Columns are not comparable if neither is Boolean, or if a Boolean is paired with anything other than a numeric type.

Notes

Both paths are verified against a live Snowflake session. Snowpark’s local testing mode does not fully reproduce them: its eqNullSafe returns True for every row, and it truncates high-precision decimals when a DataFrame is created, so a local run cannot confirm the null or precision semantics asserted in the tests.

Snowflake implicitly converts between BOOLEAN and NUMBER, and the direction of that conversion would decide whether 2 matches True: converting the Boolean gives 1 = 2 (no match, which is what Pandas, Polars, and Spark report), while converting the number gives TRUE = TRUE (a match). Rather than depend on which way Snowflake goes, _boolean_equals_numeric() compares each side against a literal of its own type, pinning the semantics to the 1/0 rule the other three backends use.

Boolean against a non-numeric, non-Boolean type is declined so the pair falls through to the rest of the pipeline.

class datacompy.comparator.boolean.SparkBooleanComparator¶

Bases: BaseComparator

Comparator for Boolean columns in PySpark.

compare(dataframe: DataFrame, col1: str, col2: str, **kwargs: Any) → Column | None¶

Compare two columns in a PySpark DataFrame when either is Boolean.

Boolean comparisons are exact and null-safe. A Boolean column may also be compared against a numeric column, in which case True matches exactly 1 and False matches exactly 0.

Parameters:

dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
**kwargs (Any) – Unused; accepted so this comparator matches the pipeline signature.

Returns:

pyspark.sql.Column – A Column containing boolean values indicating whether the values in col1 and col2 are equal. Two nulls are treated as equal.
None – Columns are not comparable if neither is Boolean, or if a Boolean is paired with anything other than a numeric type.

Notes

Unlike the Pandas and Polars comparators, this one claims only Boolean/Boolean and Boolean/numeric pairs. Spark builds the comparison lazily, so an unsupported pair (Boolean against a date, array, or struct) raises AnalysisException when the plan is analysed rather than when the Column is constructed, which is too late for this method to catch. Unsupported pairs are therefore rejected up front. Boolean against string is also declined, because Spark would implicitly cast the string to a Boolean and report True == 'yes' as a match, which the other backends do not.

The Boolean/numeric case does not rely on Spark’s implicit coercion, which is only available with spark.sql.ansi.enabled=false. Under ANSI mode, the default from Spark 4, comparing a Boolean against a numeric raises DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES. See _boolean_equals_numeric() for the comparison used instead, which behaves identically under both settings and preserves the numeric column’s precision.

datacompy.comparator.numeric module¶

Numeric Comparator Class.

class datacompy.comparator.numeric.PandasNumericComparator¶

Bases: BaseComparator

Comparator for numeric columns in Pandas.

Parameters:

rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.

compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) → Series | None¶

Compare two Pandas Series for approximate equality within specified tolerances rtol and atol.

Parameters:

col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – if the columns are not comparable.

Notes

The comparison uses np.isclose to check for approximate equality.
If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.
If the Series shapes do not match, and neither type is numeric a None values is returned.

class datacompy.comparator.numeric.PolarsNumericComparator¶

Bases: BaseComparator

Comparator for numeric columns in Polars.

Parameters:

rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.

compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) → Series | None¶

Compare two Polars Series for approximate equality within specified tolerances rtol and atol.

Parameters:

col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – if the columns are not comparable.

Notes

The comparison uses np.isclose to check for approximate equality.
If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.
If the Series shapes do not match, and neither type is numeric a None values is returned.

class datacompy.comparator.numeric.SnowflakeNumericComparator¶

Bases: BaseComparator

Comparator for numeric columns in Snowflake.

Parameters:

rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, rtol=1e-05, atol=1e-08) → DataFrame | None¶

Compare two columns in a Snowpark DataFrame for approximate equality within specified tolerances rtol and atol.

Parameters:

dataframe (snowflake.snowpark.DataFrame) – The Snowpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
col_match (str) – The name of the output column that will store the comparison results.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

snowflake.snowpark.DataFrame – A Snowpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – If the type conditions are not met.

Notes

The comparison uses Snowpark SQL functions to check for approximate equality.
Null-safe equality (eqNullSafe) is used to handle null values.
If either column contains null values, they are handled explicitly to avoid incorrect comparisons.

class datacompy.comparator.numeric.SparkNumericComparator¶

Bases: BaseComparator

Comparator for numeric columns in PySpark.

Parameters:

rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.

compare(dataframe: DataFrame, col1: str, col2: str, rtol=1e-05, atol=1e-08) → Column | None¶

Compare two columns in a PySpark DataFrame for approximate equality within specified tolerances rtol and atol.

Parameters:

dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are approximately equal within the given tolerances.
None – if the columns are not comparable.

Notes

The comparison uses PySpark SQL functions to check for approximate equality.
Null-safe equality (eqNullSafe) is used to handle null values.
If either column contains NaN values, they are handled explicitly to avoid incorrect comparisons.

datacompy.comparator.numeric.decimal_comparator()¶

Check equality with decimal(X, Y) types.

Otherwise treated as the string “decimal”.

datacompy.comparator.string module¶

String / Dates / Mixed Comparator Class.

class datacompy.comparator.string.PandasStringComparator¶

Bases: BaseComparator

Comparator for string / date / mixed columns in Pandas.

Parameters:

ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.

compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) → Series | None¶

Compare two Pandas Series column-wise, taking into account optional normalization for spaces and case sensitivity.

Parameters:

col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

pd.Series | None – A Pandas Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.
None – if the columns are not comparable.

Note

Pandas dataframes allow for mixed typing which is unique and is also handled here.

Raises:: Exception – If the comparison fails due to incompatible types or other issues, attempts to cast both columns to strings for comparison. If this also fails, returns a Series of False values with the same length as col1.

class datacompy.comparator.string.PolarsStringComparator¶

Bases: BaseComparator

Comparator for string / temporal / date columns in Polars.

compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) → Series | None¶

Compare two Polars Series column-wise, taking into account optional normalization for spaces and case sensitivity.

Parameters:

col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

pl.Series – A Polars Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.
None – if the columns are not comparable.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, attempts to cast both columns to strings for comparison. If this also fails, returns a Series of False values with the same length as col1.

class datacompy.comparator.string.SnowflakeStringComparator¶

Bases: BaseComparator

Comparator for string columns in Snowflake.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, ignore_space: bool = True, ignore_case: bool = True) → DataFrame | None¶

Compare two columns in a Snowflake DataFrame for string equality.

Parameters:

dataframe (snowflake.snowpark.DataFrame) – The Snowflake DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
col_match (str) – The name of the output column that will store the comparison results.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

snowflake.snowpark.DataFrame – The DataFrame with an additional column containing the comparison results.
None – If the comparison fails due to incompatible types or other issues, None is returned.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, None is returned.

class datacompy.comparator.string.SparkStringComparator¶

Bases: BaseComparator

Comparator for string / temporal / date columns in PySpark.

compare(dataframe: DataFrame, col1: str, col2: str, ignore_space: bool = True, ignore_case: bool = True) → Column | None¶

Compare two columns in a PySpark DataFrame for string equality.

Parameters:

dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

pyspark.sql.Column – A Column containing boolean values indicating whether the values in col1 and col2 are equal.
None – Columns are not comparable if their datatypes are not in any of the string or date combination.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, None is returned.

datacompy.comparator.string.pandas_compare_string_and_date_columns(col_1: pd.Series[Any], col_2: pd.Series[Any]) → pd.Series[bool]¶

Compare a string column and date column, value-wise.

This tries to: - convert a string column to a date column and compare - try with format=mixed - finally cast as strings and then compare

Parameters:

col_1 (Pandas.Series) – The first column to look at
col_2 (Pandas.Series) – The second column

Returns:

A series of Boolean values. True == the values match, False == the values don’t match.

Return type:

pandas.Series

datacompy.comparator.string.pandas_normalize_string_column(column: Series, ignore_spaces: bool, ignore_case: bool) → Series¶

Normalize a string column by converting to upper case and stripping whitespace.

Parameters:

column (pd.Series) – The column to normalize
ignore_spaces (bool) – Whether to ignore spaces when normalizing
ignore_case (bool) – Whether to ignore case when normalizing

Returns:

The normalized column

Return type:

pd.Series

Notes

Will not operate on categorical columns.

datacompy.comparator.string.polars_compare_string_and_date_columns(col_1: Series, col_2: Series) → Series¶

Compare a string column and date column, value-wise.

This tries to convert a string column to a date column and compare that way.

Parameters:

col_1 (Polars.Series) – The first column to look at
col_2 (Polars.Series) – The second column

Returns:

A series of Boolean values. True == the values match, False == the values don’t match.

Return type:

Polars.Series

datacompy.comparator.string.polars_normalize_string_column(column: Series, ignore_spaces: bool, ignore_case: bool) → Series¶

Normalize a string column by converting to upper case and stripping whitespace.

Parameters:

column (pl.Series) – The column to normalize
ignore_spaces (bool) – Whether to ignore spaces when normalizing
ignore_case (bool) – Whether to ignore case when normalizing

Returns:

The normalized column

Return type:

pl.Series

Notes

Will not operate on categorical columns.

datacompy.comparator.string.snowpark_normalize_string_column(column: Column, ignore_spaces: bool, ignore_case: bool) → Column¶

Normalize a string column by converting to upper case and stripping whitespace.

Parameters:

column (snowflake.snowpark.Column) – The column to normalize
ignore_spaces (bool) – Whether to ignore spaces when normalizing
ignore_case (bool) – Whether to ignore case when normalizing

Returns:

The normalized column

Return type:

snowflake.snowpark.Column

datacompy.comparator.string.spark_normalize_string_column(column: Column, ignore_spaces: bool, ignore_case: bool) → Column¶

Normalize a string column by converting to upper case and stripping whitespace.

Parameters:

column (pyspark.sql.Column) – The column to normalize
ignore_spaces (bool) – Whether to ignore spaces when normalizing
ignore_case (bool) – Whether to ignore case when normalizing

Returns:

The normalized column

Return type:

pyspark.sql.Column

datacompy.comparator.utility module¶

Utility and helper functions for data comparison.

datacompy.comparator.utility.get_snowflake_column_dtypes(dataframe: DataFrame, col_1: str, col_2: str) → tuple[str, str]¶

Get the dtypes of two columns.

Parameters:

dataframe (sp.DataFrame) – DataFrame to do comparison on
col_1 (str) – The first column to look at
col_2 (str) – The second column

Returns:

Tuple of base and compare datatype

Return type:

Tuple(str, str)

datacompy.comparator.utility.get_spark_column_dtypes(dataframe: DataFrame, col_1: str, col_2: str) → tuple[str, str]¶

Get the dtypes of two columns.

Parameters:

dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on
col_1 (str) – The first column to look at
col_2 (str) – The second column

Returns:

Tuple of base and compare datatype

Return type:

tuple(str, str)

Module contents¶

Comparator classes.

class datacompy.comparator.PandasArrayLikeComparator¶

Bases: BaseComparator

Comparator for array-like columns in Pandas.

compare(col1: Series, col2: Series) → Series | None¶

Compare two array like columns for equality.

Parameters:

col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.

Returns:

pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.

class datacompy.comparator.PandasBooleanComparator¶

Bases: BaseComparator

Comparator for Boolean columns in Pandas.

compare(col1: Series, col2: Series, **kwargs: Any) → Series | None¶

Compare columns when either column holds Boolean values.

Boolean comparisons are exact and null-safe. When a Boolean column is compared with another dtype, normal Pandas equality semantics are used; for example, True matches 1 and False matches 0.

Parameters:

col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
**kwargs (Any) – Unused; accepted so this comparator matches the pipeline signature.

Returns:

pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are equal. Two nulls are treated as equal.
None – if the columns are not comparable.

Notes

class datacompy.comparator.PandasNumericComparator¶

Bases: BaseComparator

Comparator for numeric columns in Pandas.

Parameters:

rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.

compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) → Series | None¶

Compare two Pandas Series for approximate equality within specified tolerances rtol and atol.

Parameters:

col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – if the columns are not comparable.

Notes

The comparison uses np.isclose to check for approximate equality.
If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.
If the Series shapes do not match, and neither type is numeric a None values is returned.

class datacompy.comparator.PandasStringComparator¶

Bases: BaseComparator

Comparator for string / date / mixed columns in Pandas.

Parameters:

ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.

compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) → Series | None¶

Compare two Pandas Series column-wise, taking into account optional normalization for spaces and case sensitivity.

Parameters:

col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

pd.Series | None – A Pandas Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.
None – if the columns are not comparable.

Note

Pandas dataframes allow for mixed typing which is unique and is also handled here.

Raises:: Exception – If the comparison fails due to incompatible types or other issues, attempts to cast both columns to strings for comparison. If this also fails, returns a Series of False values with the same length as col1.

class datacompy.comparator.PolarsArrayLikeComparator¶

Bases: BaseComparator

Comparator for array-like columns in Polars.

compare(col1: Series, col2: Series) → Series | None¶

Compare two array like columns for equality.

Parameters:

col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.

Returns:

pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.

class datacompy.comparator.PolarsBooleanComparator¶

Bases: BaseComparator

Comparator for Boolean columns in Polars.

compare(col1: Series, col2: Series, **kwargs: Any) → Series | None¶

Compare columns when either Polars dtype is Boolean.

Boolean comparisons are exact and null-safe. When a Boolean column is compared with another dtype, normal Polars equality semantics are used; for example, True matches 1 and False matches 0.

Parameters:

col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
**kwargs (Any) – Unused; accepted so this comparator matches the pipeline signature.

Returns:

pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are equal. Two nulls are treated as equal.
None – if the columns are not comparable.

Notes

eq_missing is used rather than == because Polars propagates nulls through ==; eq_missing treats two nulls as equal and a null against a value as unequal.

class datacompy.comparator.PolarsNumericComparator¶

Bases: BaseComparator

Comparator for numeric columns in Polars.

Parameters:

rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.

compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) → Series | None¶

Compare two Polars Series for approximate equality within specified tolerances rtol and atol.

Parameters:

col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – if the columns are not comparable.

Notes

The comparison uses np.isclose to check for approximate equality.
If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.
If the Series shapes do not match, and neither type is numeric a None values is returned.

class datacompy.comparator.PolarsStringComparator¶

Bases: BaseComparator

Comparator for string / temporal / date columns in Polars.

compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) → Series | None¶

Compare two Polars Series column-wise, taking into account optional normalization for spaces and case sensitivity.

Parameters:

col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

pl.Series – A Polars Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.
None – if the columns are not comparable.

Raises:

class datacompy.comparator.SnowflakeArrayLikeComparator¶

Bases: BaseComparator

Comparator for array-like columns in Snowflake.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str) → DataFrame | None¶

Compare two array like columns for equality.

Parameters:

dataframe (snowflake.snowpark.DataFrame) – DataFrame to do comparison on
col1 (str) – The first column to look at
col2 (str) – The second column
col_match (str) – The matching column denoting if the compare was a match or not

Returns:

snowflake.snowpark.DataFrame – A PySpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.

class datacompy.comparator.SnowflakeBooleanComparator¶

Bases: BaseComparator

Comparator for Boolean columns in Snowflake.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, **kwargs: Any) → DataFrame | None¶

Compare two columns in a Snowflake DataFrame when either is Boolean.

Boolean comparisons are exact and null-safe. A Boolean column may also be compared against a numeric column, in which case True matches exactly 1 and False matches exactly 0.

Parameters:

dataframe (snowflake.snowpark.DataFrame) – The Snowflake DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
col_match (str) – The name of the output column that will store the comparison results.
**kwargs (Any) – Unused; accepted so this comparator matches the pipeline signature.

Returns:

snowflake.snowpark.DataFrame – The DataFrame with an additional column containing the comparison results. Two nulls are treated as equal.
None – Columns are not comparable if neither is Boolean, or if a Boolean is paired with anything other than a numeric type.

Notes

Boolean against a non-numeric, non-Boolean type is declined so the pair falls through to the rest of the pipeline.

class datacompy.comparator.SnowflakeNumericComparator¶

Bases: BaseComparator

Comparator for numeric columns in Snowflake.

Parameters:

rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, rtol=1e-05, atol=1e-08) → DataFrame | None¶

Compare two columns in a Snowpark DataFrame for approximate equality within specified tolerances rtol and atol.

Parameters:

dataframe (snowflake.snowpark.DataFrame) – The Snowpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
col_match (str) – The name of the output column that will store the comparison results.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

snowflake.snowpark.DataFrame – A Snowpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – If the type conditions are not met.

Notes

The comparison uses Snowpark SQL functions to check for approximate equality.
Null-safe equality (eqNullSafe) is used to handle null values.
If either column contains null values, they are handled explicitly to avoid incorrect comparisons.

class datacompy.comparator.SnowflakeStringComparator¶

Bases: BaseComparator

Comparator for string columns in Snowflake.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, ignore_space: bool = True, ignore_case: bool = True) → DataFrame | None¶

Compare two columns in a Snowflake DataFrame for string equality.

Parameters:

dataframe (snowflake.snowpark.DataFrame) – The Snowflake DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
col_match (str) – The name of the output column that will store the comparison results.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

snowflake.snowpark.DataFrame – The DataFrame with an additional column containing the comparison results.
None – If the comparison fails due to incompatible types or other issues, None is returned.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, None is returned.

class datacompy.comparator.SparkArrayLikeComparator¶

Bases: BaseComparator

Comparator for array-like columns in PySpark.

compare(dataframe: DataFrame, col1: str, col2: str) → Column | None¶

Compare two array like columns for equality.

Parameters:

dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on
col_1 (str) – The first column to look at
col_2 (str) – The second column

Returns:

pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are equal.
None – if the columns are not comparable.

class datacompy.comparator.SparkBooleanComparator¶

Bases: BaseComparator

Comparator for Boolean columns in PySpark.

compare(dataframe: DataFrame, col1: str, col2: str, **kwargs: Any) → Column | None¶

Compare two columns in a PySpark DataFrame when either is Boolean.

Boolean comparisons are exact and null-safe. A Boolean column may also be compared against a numeric column, in which case True matches exactly 1 and False matches exactly 0.

Parameters:

dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
**kwargs (Any) – Unused; accepted so this comparator matches the pipeline signature.

Returns:

pyspark.sql.Column – A Column containing boolean values indicating whether the values in col1 and col2 are equal. Two nulls are treated as equal.
None – Columns are not comparable if neither is Boolean, or if a Boolean is paired with anything other than a numeric type.

Notes

class datacompy.comparator.SparkNumericComparator¶

Bases: BaseComparator

Comparator for numeric columns in PySpark.

Parameters:

rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.

compare(dataframe: DataFrame, col1: str, col2: str, rtol=1e-05, atol=1e-08) → Column | None¶

Compare two columns in a PySpark DataFrame for approximate equality within specified tolerances rtol and atol.

Parameters:

dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are approximately equal within the given tolerances.
None – if the columns are not comparable.

Notes

The comparison uses PySpark SQL functions to check for approximate equality.
Null-safe equality (eqNullSafe) is used to handle null values.
If either column contains NaN values, they are handled explicitly to avoid incorrect comparisons.

class datacompy.comparator.SparkStringComparator¶

Bases: BaseComparator

Comparator for string / temporal / date columns in PySpark.

compare(dataframe: DataFrame, col1: str, col2: str, ignore_space: bool = True, ignore_case: bool = True) → Column | None¶

Compare two columns in a PySpark DataFrame for string equality.

Parameters:

dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

pyspark.sql.Column – A Column containing boolean values indicating whether the values in col1 and col2 are equal.
None – Columns are not comparable if their datatypes are not in any of the string or date combination.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, None is returned.