datacompy.comparator package

Submodules

datacompy.comparator.array module

Array Like Comparator Class.

class datacompy.comparator.array.PandasArrayLikeComparator

Bases: BaseComparator

Comparator for array-like columns in Pandas.

compare(col1: Series, col2: Series) Series | None

Compare two array like columns for equality.

Parameters:
  • col1 (pd.Series) – The first Pandas Series to compare.

  • col2 (pd.Series) – The second Pandas Series to compare.

Returns:

  • pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are equal.

  • None – if the columns are not comparable.

class datacompy.comparator.array.PolarsArrayLikeComparator

Bases: BaseComparator

Comparator for array-like columns in Polars.

compare(col1: Series, col2: Series) Series | None

Compare two array like columns for equality.

Parameters:
  • col1 (pl.Series) – The first Polars Series to compare.

  • col2 (pl.Series) – The second Polars Series to compare.

Returns:

  • pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are equal.

  • None – if the columns are not comparable.

class datacompy.comparator.array.SnowflakeArrayLikeComparator

Bases: BaseComparator

Comparator for array-like columns in Snowflake.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str) DataFrame | None

Compare two array like columns for equality.

Parameters:
  • dataframe (snowflake.snowpark.DataFrame) – DataFrame to do comparison on

  • col1 (str) – The first column to look at

  • col2 (str) – The second column

  • col_match (str) – The matching column denoting if the compare was a match or not

Returns:

  • snowflake.snowpark.DataFrame – A PySpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are equal.

  • None – if the columns are not comparable.

class datacompy.comparator.array.SparkArrayLikeComparator

Bases: BaseComparator

Comparator for array-like columns in PySpark.

compare(dataframe: DataFrame, col1: str, col2: str) Column | None

Compare two array like columns for equality.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on

  • col_1 (str) – The first column to look at

  • col_2 (str) – The second column

Returns:

  • pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are equal.

  • None – if the columns are not comparable.

datacompy.comparator.base module

Base Comparator Class.

class datacompy.comparator.base.BaseComparator

Bases: ABC

Base class for all comparators.

This class serves as an abstract base class for implementing specific comparator logic in derived classes.

abstractmethod compare(col1: Any, col2: Any, **kwargs) Any

Check if two columns are equal.

This method should be implemented in derived classes to provide specific comparison logic.

Parameters:
  • col1 (Any) – The first column to compare.

  • col2 (Any) – The second column to compare.

  • **kwargs (Any) – Additional keyword arguments for comparison.

Returns:

Comparison result. (implementation-specific)

Return type:

Any

datacompy.comparator.numeric module

Numeric Comparator Class.

class datacompy.comparator.numeric.PandasNumericComparator

Bases: BaseComparator

Comparator for numeric columns in Pandas.

Parameters:
  • rtol (float) – The relative tolerance to use for comparison.

  • atol (float) – The absolute tolerance to use for comparison.

compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) Series | None

Compare two Pandas Series for approximate equality within specified tolerances rtol and atol.

Parameters:
  • col1 (pd.Series) – The first Pandas Series to compare.

  • col2 (pd.Series) – The second Pandas Series to compare.

  • rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.

  • atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

  • pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.

  • None – if the columns are not comparable.

Notes

  • The comparison uses np.isclose to check for approximate equality.

  • If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.

  • If the Series shapes do not match, and neither type is numeric a None values is returned.

class datacompy.comparator.numeric.PolarsNumericComparator

Bases: BaseComparator

Comparator for numeric columns in Polars.

Parameters:
  • rtol (float) – The relative tolerance to use for comparison.

  • atol (float) – The absolute tolerance to use for comparison.

compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) Series | None

Compare two Polars Series for approximate equality within specified tolerances rtol and atol.

Parameters:
  • col1 (pl.Series) – The first Polars Series to compare.

  • col2 (pl.Series) – The second Polars Series to compare.

  • rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.

  • atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

  • pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.

  • None – if the columns are not comparable.

Notes

  • The comparison uses np.isclose to check for approximate equality.

  • If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.

  • If the Series shapes do not match, and neither type is numeric a None values is returned.

class datacompy.comparator.numeric.SnowflakeNumericComparator

Bases: BaseComparator

Comparator for numeric columns in Snowflake.

Parameters:
  • rtol (float) – The relative tolerance to use for comparison.

  • atol (float) – The absolute tolerance to use for comparison.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, rtol=1e-05, atol=1e-08) DataFrame | None

Compare two columns in a Snowpark DataFrame for approximate equality within specified tolerances rtol and atol.

Parameters:
  • dataframe (snowflake.snowpark.DataFrame) – The Snowpark DataFrame containing the columns to compare.

  • col1 (str) – The name of the first column to compare.

  • col2 (str) – The name of the second column to compare.

  • col_match (str) – The name of the output column that will store the comparison results.

  • rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.

  • atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

  • snowflake.snowpark.DataFrame – A Snowpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are approximately equal within the given tolerances.

  • None – If the type conditions are not met.

Notes

  • The comparison uses Snowpark SQL functions to check for approximate equality.

  • Null-safe equality (eqNullSafe) is used to handle null values.

  • If either column contains null values, they are handled explicitly to avoid incorrect comparisons.

class datacompy.comparator.numeric.SparkNumericComparator

Bases: BaseComparator

Comparator for numeric columns in PySpark.

Parameters:
  • rtol (float) – The relative tolerance to use for comparison.

  • atol (float) – The absolute tolerance to use for comparison.

compare(dataframe: DataFrame, col1: str, col2: str, rtol=1e-05, atol=1e-08) Column | None

Compare two columns in a PySpark DataFrame for approximate equality within specified tolerances rtol and atol.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.

  • col1 (str) – The name of the first column to compare.

  • col2 (str) – The name of the second column to compare.

  • rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.

  • atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

  • pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are approximately equal within the given tolerances.

  • None – if the columns are not comparable.

Notes

  • The comparison uses PySpark SQL functions to check for approximate equality.

  • Null-safe equality (eqNullSafe) is used to handle null values.

  • If either column contains NaN values, they are handled explicitly to avoid incorrect comparisons.

datacompy.comparator.numeric.decimal_comparator()

Check equality with decimal(X, Y) types.

Otherwise treated as the string “decimal”.

datacompy.comparator.string module

String / Dates / Mixed Comparator Class.

class datacompy.comparator.string.PandasStringComparator

Bases: BaseComparator

Comparator for string / date / mixed columns in Pandas.

Parameters:
  • ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.

  • ignore_case (bool) – Whether to ignore case when comparing strings.

compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) Series | None

Compare two Pandas Series column-wise, taking into account optional normalization for spaces and case sensitivity.

Parameters:
  • col1 (pd.Series) – The first Pandas Series to compare.

  • col2 (pd.Series) – The second Pandas Series to compare.

  • ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.

  • ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

  • pd.Series | None – A Pandas Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.

  • None – if the columns are not comparable.

Note

Pandas dataframes allow for mixed typing which is unique and is also handled here.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, attempts to cast both columns to strings for comparison. If this also fails, returns a Series of False values with the same length as col1.

class datacompy.comparator.string.PolarsStringComparator

Bases: BaseComparator

Comparator for string / temporal / date columns in Polars.

compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) Series | None

Compare two Polars Series column-wise, taking into account optional normalization for spaces and case sensitivity.

Parameters:
  • col1 (pl.Series) – The first Polars Series to compare.

  • col2 (pl.Series) – The second Polars Series to compare.

  • ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.

  • ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

  • pl.Series – A Polars Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.

  • None – if the columns are not comparable.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, attempts to cast both columns to strings for comparison. If this also fails, returns a Series of False values with the same length as col1.

class datacompy.comparator.string.SnowflakeStringComparator

Bases: BaseComparator

Comparator for string columns in Snowflake.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, ignore_space: bool = True, ignore_case: bool = True) DataFrame | None

Compare two columns in a Snowflake DataFrame for string equality.

Parameters:
  • dataframe (snowflake.snowpark.DataFrame) – The Snowflake DataFrame containing the columns to compare.

  • col1 (str) – The name of the first column to compare.

  • col2 (str) – The name of the second column to compare.

  • col_match (str) – The name of the output column that will store the comparison results.

  • ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.

  • ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

  • snowflake.snowpark.DataFrame – The DataFrame with an additional column containing the comparison results.

  • None – If the comparison fails due to incompatible types or other issues, None is returned.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, None is returned.

class datacompy.comparator.string.SparkStringComparator

Bases: BaseComparator

Comparator for string / temporal / date columns in PySpark.

compare(dataframe: DataFrame, col1: str, col2: str, ignore_space: bool = True, ignore_case: bool = True) Column | None

Compare two columns in a PySpark DataFrame for string equality.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.

  • col1 (str) – The name of the first column to compare.

  • col2 (str) – The name of the second column to compare.

  • ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.

  • ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

  • pyspark.sql.Column – A Column containing boolean values indicating whether the values in col1 and col2 are equal.

  • None – Columns are not comparable if their datatypes are not in any of the string or date combination.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, None is returned.

datacompy.comparator.string.pandas_compare_string_and_date_columns(col_1: pd.Series[Any], col_2: pd.Series[Any]) pd.Series[bool]

Compare a string column and date column, value-wise.

This tries to: - convert a string column to a date column and compare - try with format=mixed - finally cast as strings and then compare

Parameters:
  • col_1 (Pandas.Series) – The first column to look at

  • col_2 (Pandas.Series) – The second column

Returns:

A series of Boolean values. True == the values match, False == the values don’t match.

Return type:

pandas.Series

datacompy.comparator.string.pandas_normalize_string_column(column: Series, ignore_spaces: bool, ignore_case: bool) Series

Normalize a string column by converting to upper case and stripping whitespace.

Parameters:
  • column (pd.Series) – The column to normalize

  • ignore_spaces (bool) – Whether to ignore spaces when normalizing

  • ignore_case (bool) – Whether to ignore case when normalizing

Returns:

The normalized column

Return type:

pd.Series

Notes

Will not operate on categorical columns.

datacompy.comparator.string.polars_compare_string_and_date_columns(col_1: Series, col_2: Series) Series

Compare a string column and date column, value-wise.

This tries to convert a string column to a date column and compare that way.

Parameters:
  • col_1 (Polars.Series) – The first column to look at

  • col_2 (Polars.Series) – The second column

Returns:

A series of Boolean values. True == the values match, False == the values don’t match.

Return type:

Polars.Series

datacompy.comparator.string.polars_normalize_string_column(column: Series, ignore_spaces: bool, ignore_case: bool) Series

Normalize a string column by converting to upper case and stripping whitespace.

Parameters:
  • column (pl.Series) – The column to normalize

  • ignore_spaces (bool) – Whether to ignore spaces when normalizing

  • ignore_case (bool) – Whether to ignore case when normalizing

Returns:

The normalized column

Return type:

pl.Series

Notes

Will not operate on categorical columns.

datacompy.comparator.string.snowpark_normalize_string_column(column: Column, ignore_spaces: bool, ignore_case: bool) Column

Normalize a string column by converting to upper case and stripping whitespace.

Parameters:
  • column (snowflake.snowpark.Column) – The column to normalize

  • ignore_spaces (bool) – Whether to ignore spaces when normalizing

  • ignore_case (bool) – Whether to ignore case when normalizing

Returns:

The normalized column

Return type:

snowflake.snowpark.Column

datacompy.comparator.string.spark_normalize_string_column(column: Column, ignore_spaces: bool, ignore_case: bool) Column

Normalize a string column by converting to upper case and stripping whitespace.

Parameters:
  • column (pyspark.sql.Column) – The column to normalize

  • ignore_spaces (bool) – Whether to ignore spaces when normalizing

  • ignore_case (bool) – Whether to ignore case when normalizing

Returns:

The normalized column

Return type:

pyspark.sql.Column

datacompy.comparator.utility module

Utility and helper functions for data comparison.

datacompy.comparator.utility.get_snowflake_column_dtypes(dataframe: DataFrame, col_1: str, col_2: str) tuple[str, str]

Get the dtypes of two columns.

Parameters:
  • dataframe (sp.DataFrame) – DataFrame to do comparison on

  • col_1 (str) – The first column to look at

  • col_2 (str) – The second column

Returns:

Tuple of base and compare datatype

Return type:

Tuple(str, str)

datacompy.comparator.utility.get_spark_column_dtypes(dataframe: DataFrame, col_1: str, col_2: str) tuple[str, str]

Get the dtypes of two columns.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on

  • col_1 (str) – The first column to look at

  • col_2 (str) – The second column

Returns:

Tuple of base and compare datatype

Return type:

tuple(str, str)

Module contents

Comparator classes.

class datacompy.comparator.PandasArrayLikeComparator

Bases: BaseComparator

Comparator for array-like columns in Pandas.

compare(col1: Series, col2: Series) Series | None

Compare two array like columns for equality.

Parameters:
  • col1 (pd.Series) – The first Pandas Series to compare.

  • col2 (pd.Series) – The second Pandas Series to compare.

Returns:

  • pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are equal.

  • None – if the columns are not comparable.

class datacompy.comparator.PandasNumericComparator

Bases: BaseComparator

Comparator for numeric columns in Pandas.

Parameters:
  • rtol (float) – The relative tolerance to use for comparison.

  • atol (float) – The absolute tolerance to use for comparison.

compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) Series | None

Compare two Pandas Series for approximate equality within specified tolerances rtol and atol.

Parameters:
  • col1 (pd.Series) – The first Pandas Series to compare.

  • col2 (pd.Series) – The second Pandas Series to compare.

  • rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.

  • atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

  • pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.

  • None – if the columns are not comparable.

Notes

  • The comparison uses np.isclose to check for approximate equality.

  • If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.

  • If the Series shapes do not match, and neither type is numeric a None values is returned.

class datacompy.comparator.PandasStringComparator

Bases: BaseComparator

Comparator for string / date / mixed columns in Pandas.

Parameters:
  • ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.

  • ignore_case (bool) – Whether to ignore case when comparing strings.

compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) Series | None

Compare two Pandas Series column-wise, taking into account optional normalization for spaces and case sensitivity.

Parameters:
  • col1 (pd.Series) – The first Pandas Series to compare.

  • col2 (pd.Series) – The second Pandas Series to compare.

  • ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.

  • ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

  • pd.Series | None – A Pandas Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.

  • None – if the columns are not comparable.

Note

Pandas dataframes allow for mixed typing which is unique and is also handled here.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, attempts to cast both columns to strings for comparison. If this also fails, returns a Series of False values with the same length as col1.

class datacompy.comparator.PolarsArrayLikeComparator

Bases: BaseComparator

Comparator for array-like columns in Polars.

compare(col1: Series, col2: Series) Series | None

Compare two array like columns for equality.

Parameters:
  • col1 (pl.Series) – The first Polars Series to compare.

  • col2 (pl.Series) – The second Polars Series to compare.

Returns:

  • pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are equal.

  • None – if the columns are not comparable.

class datacompy.comparator.PolarsNumericComparator

Bases: BaseComparator

Comparator for numeric columns in Polars.

Parameters:
  • rtol (float) – The relative tolerance to use for comparison.

  • atol (float) – The absolute tolerance to use for comparison.

compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) Series | None

Compare two Polars Series for approximate equality within specified tolerances rtol and atol.

Parameters:
  • col1 (pl.Series) – The first Polars Series to compare.

  • col2 (pl.Series) – The second Polars Series to compare.

  • rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.

  • atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

  • pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.

  • None – if the columns are not comparable.

Notes

  • The comparison uses np.isclose to check for approximate equality.

  • If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.

  • If the Series shapes do not match, and neither type is numeric a None values is returned.

class datacompy.comparator.PolarsStringComparator

Bases: BaseComparator

Comparator for string / temporal / date columns in Polars.

compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) Series | None

Compare two Polars Series column-wise, taking into account optional normalization for spaces and case sensitivity.

Parameters:
  • col1 (pl.Series) – The first Polars Series to compare.

  • col2 (pl.Series) – The second Polars Series to compare.

  • ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.

  • ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

  • pl.Series – A Polars Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.

  • None – if the columns are not comparable.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, attempts to cast both columns to strings for comparison. If this also fails, returns a Series of False values with the same length as col1.

class datacompy.comparator.SnowflakeArrayLikeComparator

Bases: BaseComparator

Comparator for array-like columns in Snowflake.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str) DataFrame | None

Compare two array like columns for equality.

Parameters:
  • dataframe (snowflake.snowpark.DataFrame) – DataFrame to do comparison on

  • col1 (str) – The first column to look at

  • col2 (str) – The second column

  • col_match (str) – The matching column denoting if the compare was a match or not

Returns:

  • snowflake.snowpark.DataFrame – A PySpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are equal.

  • None – if the columns are not comparable.

class datacompy.comparator.SnowflakeNumericComparator

Bases: BaseComparator

Comparator for numeric columns in Snowflake.

Parameters:
  • rtol (float) – The relative tolerance to use for comparison.

  • atol (float) – The absolute tolerance to use for comparison.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, rtol=1e-05, atol=1e-08) DataFrame | None

Compare two columns in a Snowpark DataFrame for approximate equality within specified tolerances rtol and atol.

Parameters:
  • dataframe (snowflake.snowpark.DataFrame) – The Snowpark DataFrame containing the columns to compare.

  • col1 (str) – The name of the first column to compare.

  • col2 (str) – The name of the second column to compare.

  • col_match (str) – The name of the output column that will store the comparison results.

  • rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.

  • atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

  • snowflake.snowpark.DataFrame – A Snowpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are approximately equal within the given tolerances.

  • None – If the type conditions are not met.

Notes

  • The comparison uses Snowpark SQL functions to check for approximate equality.

  • Null-safe equality (eqNullSafe) is used to handle null values.

  • If either column contains null values, they are handled explicitly to avoid incorrect comparisons.

class datacompy.comparator.SnowflakeStringComparator

Bases: BaseComparator

Comparator for string columns in Snowflake.

compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, ignore_space: bool = True, ignore_case: bool = True) DataFrame | None

Compare two columns in a Snowflake DataFrame for string equality.

Parameters:
  • dataframe (snowflake.snowpark.DataFrame) – The Snowflake DataFrame containing the columns to compare.

  • col1 (str) – The name of the first column to compare.

  • col2 (str) – The name of the second column to compare.

  • col_match (str) – The name of the output column that will store the comparison results.

  • ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.

  • ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

  • snowflake.snowpark.DataFrame – The DataFrame with an additional column containing the comparison results.

  • None – If the comparison fails due to incompatible types or other issues, None is returned.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, None is returned.

class datacompy.comparator.SparkArrayLikeComparator

Bases: BaseComparator

Comparator for array-like columns in PySpark.

compare(dataframe: DataFrame, col1: str, col2: str) Column | None

Compare two array like columns for equality.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on

  • col_1 (str) – The first column to look at

  • col_2 (str) – The second column

Returns:

  • pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are equal.

  • None – if the columns are not comparable.

class datacompy.comparator.SparkNumericComparator

Bases: BaseComparator

Comparator for numeric columns in PySpark.

Parameters:
  • rtol (float) – The relative tolerance to use for comparison.

  • atol (float) – The absolute tolerance to use for comparison.

compare(dataframe: DataFrame, col1: str, col2: str, rtol=1e-05, atol=1e-08) Column | None

Compare two columns in a PySpark DataFrame for approximate equality within specified tolerances rtol and atol.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.

  • col1 (str) – The name of the first column to compare.

  • col2 (str) – The name of the second column to compare.

  • rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.

  • atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.

Returns:

  • pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are approximately equal within the given tolerances.

  • None – if the columns are not comparable.

Notes

  • The comparison uses PySpark SQL functions to check for approximate equality.

  • Null-safe equality (eqNullSafe) is used to handle null values.

  • If either column contains NaN values, they are handled explicitly to avoid incorrect comparisons.

class datacompy.comparator.SparkStringComparator

Bases: BaseComparator

Comparator for string / temporal / date columns in PySpark.

compare(dataframe: DataFrame, col1: str, col2: str, ignore_space: bool = True, ignore_case: bool = True) Column | None

Compare two columns in a PySpark DataFrame for string equality.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.

  • col1 (str) – The name of the first column to compare.

  • col2 (str) – The name of the second column to compare.

  • ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.

  • ignore_case (bool) – Whether to ignore case when comparing strings.

Returns:

  • pyspark.sql.Column – A Column containing boolean values indicating whether the values in col1 and col2 are equal.

  • None – Columns are not comparable if their datatypes are not in any of the string or date combination.

Raises:

Exception – If the comparison fails due to incompatible types or other issues, None is returned.