datacompy.comparator package¶
Submodules¶
datacompy.comparator.array module¶
Array Like Comparator Class.
- class datacompy.comparator.array.PandasArrayLikeComparator¶
Bases:
BaseComparatorComparator for array-like columns in Pandas.
- compare(col1: Series, col2: Series) Series | None¶
Compare two array like columns for equality.
- Parameters:
col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
- Returns:
pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.
- class datacompy.comparator.array.PolarsArrayLikeComparator¶
Bases:
BaseComparatorComparator for array-like columns in Polars.
- compare(col1: Series, col2: Series) Series | None¶
Compare two array like columns for equality.
- Parameters:
col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
- Returns:
pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.
- class datacompy.comparator.array.SnowflakeArrayLikeComparator¶
Bases:
BaseComparatorComparator for array-like columns in Snowflake.
- compare(dataframe: DataFrame, col1: str, col2: str, col_match: str) DataFrame | None¶
Compare two array like columns for equality.
- Parameters:
dataframe (snowflake.snowpark.DataFrame) – DataFrame to do comparison on
col1 (str) – The first column to look at
col2 (str) – The second column
col_match (str) – The matching column denoting if the compare was a match or not
- Returns:
snowflake.snowpark.DataFrame – A PySpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.
- class datacompy.comparator.array.SparkArrayLikeComparator¶
Bases:
BaseComparatorComparator for array-like columns in PySpark.
- compare(dataframe: DataFrame, col1: str, col2: str) Column | None¶
Compare two array like columns for equality.
- Parameters:
dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on
col_1 (str) – The first column to look at
col_2 (str) – The second column
- Returns:
pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are equal.
None – if the columns are not comparable.
datacompy.comparator.base module¶
Base Comparator Class.
- class datacompy.comparator.base.BaseComparator¶
Bases:
ABCBase class for all comparators.
This class serves as an abstract base class for implementing specific comparator logic in derived classes.
- abstractmethod compare(col1: Any, col2: Any, **kwargs) Any¶
Check if two columns are equal.
This method should be implemented in derived classes to provide specific comparison logic.
- Parameters:
col1 (Any) – The first column to compare.
col2 (Any) – The second column to compare.
**kwargs (Any) – Additional keyword arguments for comparison.
- Returns:
Comparison result. (implementation-specific)
- Return type:
Any
datacompy.comparator.numeric module¶
Numeric Comparator Class.
- class datacompy.comparator.numeric.PandasNumericComparator¶
Bases:
BaseComparatorComparator for numeric columns in Pandas.
- Parameters:
rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.
- compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) Series | None¶
Compare two Pandas Series for approximate equality within specified tolerances rtol and atol.
- Parameters:
col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.
- Returns:
pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – if the columns are not comparable.
Notes
The comparison uses np.isclose to check for approximate equality.
If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.
If the Series shapes do not match, and neither type is numeric a None values is returned.
- class datacompy.comparator.numeric.PolarsNumericComparator¶
Bases:
BaseComparatorComparator for numeric columns in Polars.
- Parameters:
rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.
- compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) Series | None¶
Compare two Polars Series for approximate equality within specified tolerances rtol and atol.
- Parameters:
col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.
- Returns:
pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – if the columns are not comparable.
Notes
The comparison uses np.isclose to check for approximate equality.
If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.
If the Series shapes do not match, and neither type is numeric a None values is returned.
- class datacompy.comparator.numeric.SnowflakeNumericComparator¶
Bases:
BaseComparatorComparator for numeric columns in Snowflake.
- Parameters:
rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.
- compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, rtol=1e-05, atol=1e-08) DataFrame | None¶
Compare two columns in a Snowpark DataFrame for approximate equality within specified tolerances rtol and atol.
- Parameters:
dataframe (snowflake.snowpark.DataFrame) – The Snowpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
col_match (str) – The name of the output column that will store the comparison results.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.
- Returns:
snowflake.snowpark.DataFrame – A Snowpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – If the type conditions are not met.
Notes
The comparison uses Snowpark SQL functions to check for approximate equality.
Null-safe equality (eqNullSafe) is used to handle null values.
If either column contains null values, they are handled explicitly to avoid incorrect comparisons.
- class datacompy.comparator.numeric.SparkNumericComparator¶
Bases:
BaseComparatorComparator for numeric columns in PySpark.
- Parameters:
rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.
- compare(dataframe: DataFrame, col1: str, col2: str, rtol=1e-05, atol=1e-08) Column | None¶
Compare two columns in a PySpark DataFrame for approximate equality within specified tolerances rtol and atol.
- Parameters:
dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.
- Returns:
pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are approximately equal within the given tolerances.
None – if the columns are not comparable.
Notes
The comparison uses PySpark SQL functions to check for approximate equality.
Null-safe equality (eqNullSafe) is used to handle null values.
If either column contains NaN values, they are handled explicitly to avoid incorrect comparisons.
- datacompy.comparator.numeric.decimal_comparator()¶
Check equality with decimal(X, Y) types.
Otherwise treated as the string “decimal”.
datacompy.comparator.string module¶
String / Dates / Mixed Comparator Class.
- class datacompy.comparator.string.PandasStringComparator¶
Bases:
BaseComparatorComparator for string / date / mixed columns in Pandas.
- Parameters:
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.
- compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) Series | None¶
Compare two Pandas Series column-wise, taking into account optional normalization for spaces and case sensitivity.
- Parameters:
col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.
- Returns:
pd.Series | None – A Pandas Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.
None – if the columns are not comparable.
Note
Pandas dataframes allow for mixed typing which is unique and is also handled here.
- Raises:
Exception – If the comparison fails due to incompatible types or other issues, attempts to cast both columns to strings for comparison. If this also fails, returns a Series of False values with the same length as col1.
- class datacompy.comparator.string.PolarsStringComparator¶
Bases:
BaseComparatorComparator for string / temporal / date columns in Polars.
- compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) Series | None¶
Compare two Polars Series column-wise, taking into account optional normalization for spaces and case sensitivity.
- Parameters:
col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.
- Returns:
pl.Series – A Polars Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.
None – if the columns are not comparable.
- Raises:
Exception – If the comparison fails due to incompatible types or other issues, attempts to cast both columns to strings for comparison. If this also fails, returns a Series of False values with the same length as col1.
- class datacompy.comparator.string.SnowflakeStringComparator¶
Bases:
BaseComparatorComparator for string columns in Snowflake.
- compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, ignore_space: bool = True, ignore_case: bool = True) DataFrame | None¶
Compare two columns in a Snowflake DataFrame for string equality.
- Parameters:
dataframe (snowflake.snowpark.DataFrame) – The Snowflake DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
col_match (str) – The name of the output column that will store the comparison results.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.
- Returns:
snowflake.snowpark.DataFrame – The DataFrame with an additional column containing the comparison results.
None – If the comparison fails due to incompatible types or other issues, None is returned.
- Raises:
Exception – If the comparison fails due to incompatible types or other issues, None is returned.
- class datacompy.comparator.string.SparkStringComparator¶
Bases:
BaseComparatorComparator for string / temporal / date columns in PySpark.
- compare(dataframe: DataFrame, col1: str, col2: str, ignore_space: bool = True, ignore_case: bool = True) Column | None¶
Compare two columns in a PySpark DataFrame for string equality.
- Parameters:
dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.
- Returns:
pyspark.sql.Column – A Column containing boolean values indicating whether the values in col1 and col2 are equal.
None – Columns are not comparable if their datatypes are not in any of the string or date combination.
- Raises:
Exception – If the comparison fails due to incompatible types or other issues, None is returned.
- datacompy.comparator.string.pandas_compare_string_and_date_columns(col_1: pd.Series[Any], col_2: pd.Series[Any]) pd.Series[bool]¶
Compare a string column and date column, value-wise.
This tries to: - convert a string column to a date column and compare - try with format=mixed - finally cast as strings and then compare
- Parameters:
col_1 (Pandas.Series) – The first column to look at
col_2 (Pandas.Series) – The second column
- Returns:
A series of Boolean values. True == the values match, False == the values don’t match.
- Return type:
pandas.Series
- datacompy.comparator.string.pandas_normalize_string_column(column: Series, ignore_spaces: bool, ignore_case: bool) Series¶
Normalize a string column by converting to upper case and stripping whitespace.
- Parameters:
column (pd.Series) – The column to normalize
ignore_spaces (bool) – Whether to ignore spaces when normalizing
ignore_case (bool) – Whether to ignore case when normalizing
- Returns:
The normalized column
- Return type:
pd.Series
Notes
Will not operate on categorical columns.
- datacompy.comparator.string.polars_compare_string_and_date_columns(col_1: Series, col_2: Series) Series¶
Compare a string column and date column, value-wise.
This tries to convert a string column to a date column and compare that way.
- Parameters:
col_1 (Polars.Series) – The first column to look at
col_2 (Polars.Series) – The second column
- Returns:
A series of Boolean values. True == the values match, False == the values don’t match.
- Return type:
Polars.Series
- datacompy.comparator.string.polars_normalize_string_column(column: Series, ignore_spaces: bool, ignore_case: bool) Series¶
Normalize a string column by converting to upper case and stripping whitespace.
- Parameters:
column (pl.Series) – The column to normalize
ignore_spaces (bool) – Whether to ignore spaces when normalizing
ignore_case (bool) – Whether to ignore case when normalizing
- Returns:
The normalized column
- Return type:
pl.Series
Notes
Will not operate on categorical columns.
- datacompy.comparator.string.snowpark_normalize_string_column(column: Column, ignore_spaces: bool, ignore_case: bool) Column¶
Normalize a string column by converting to upper case and stripping whitespace.
- Parameters:
column (snowflake.snowpark.Column) – The column to normalize
ignore_spaces (bool) – Whether to ignore spaces when normalizing
ignore_case (bool) – Whether to ignore case when normalizing
- Returns:
The normalized column
- Return type:
snowflake.snowpark.Column
- datacompy.comparator.string.spark_normalize_string_column(column: Column, ignore_spaces: bool, ignore_case: bool) Column¶
Normalize a string column by converting to upper case and stripping whitespace.
- Parameters:
column (pyspark.sql.Column) – The column to normalize
ignore_spaces (bool) – Whether to ignore spaces when normalizing
ignore_case (bool) – Whether to ignore case when normalizing
- Returns:
The normalized column
- Return type:
pyspark.sql.Column
datacompy.comparator.utility module¶
Utility and helper functions for data comparison.
- datacompy.comparator.utility.get_snowflake_column_dtypes(dataframe: DataFrame, col_1: str, col_2: str) tuple[str, str]¶
Get the dtypes of two columns.
- Parameters:
dataframe (sp.DataFrame) – DataFrame to do comparison on
col_1 (str) – The first column to look at
col_2 (str) – The second column
- Returns:
Tuple of base and compare datatype
- Return type:
Tuple(str, str)
- datacompy.comparator.utility.get_spark_column_dtypes(dataframe: DataFrame, col_1: str, col_2: str) tuple[str, str]¶
Get the dtypes of two columns.
- Parameters:
dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on
col_1 (str) – The first column to look at
col_2 (str) – The second column
- Returns:
Tuple of base and compare datatype
- Return type:
tuple(str, str)
Module contents¶
Comparator classes.
- class datacompy.comparator.PandasArrayLikeComparator¶
Bases:
BaseComparatorComparator for array-like columns in Pandas.
- compare(col1: Series, col2: Series) Series | None¶
Compare two array like columns for equality.
- Parameters:
col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
- Returns:
pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.
- class datacompy.comparator.PandasNumericComparator¶
Bases:
BaseComparatorComparator for numeric columns in Pandas.
- Parameters:
rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.
- compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) Series | None¶
Compare two Pandas Series for approximate equality within specified tolerances rtol and atol.
- Parameters:
col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.
- Returns:
pd.Series – A Pandas Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – if the columns are not comparable.
Notes
The comparison uses np.isclose to check for approximate equality.
If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.
If the Series shapes do not match, and neither type is numeric a None values is returned.
- class datacompy.comparator.PandasStringComparator¶
Bases:
BaseComparatorComparator for string / date / mixed columns in Pandas.
- Parameters:
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.
- compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) Series | None¶
Compare two Pandas Series column-wise, taking into account optional normalization for spaces and case sensitivity.
- Parameters:
col1 (pd.Series) – The first Pandas Series to compare.
col2 (pd.Series) – The second Pandas Series to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.
- Returns:
pd.Series | None – A Pandas Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.
None – if the columns are not comparable.
Note
Pandas dataframes allow for mixed typing which is unique and is also handled here.
- Raises:
Exception – If the comparison fails due to incompatible types or other issues, attempts to cast both columns to strings for comparison. If this also fails, returns a Series of False values with the same length as col1.
- class datacompy.comparator.PolarsArrayLikeComparator¶
Bases:
BaseComparatorComparator for array-like columns in Polars.
- compare(col1: Series, col2: Series) Series | None¶
Compare two array like columns for equality.
- Parameters:
col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
- Returns:
pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.
- class datacompy.comparator.PolarsNumericComparator¶
Bases:
BaseComparatorComparator for numeric columns in Polars.
- Parameters:
rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.
- compare(col1: Series, col2: Series, rtol=1e-05, atol=1e-08) Series | None¶
Compare two Polars Series for approximate equality within specified tolerances rtol and atol.
- Parameters:
col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.
- Returns:
pl.Series – A Polars Series of booleans indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – if the columns are not comparable.
Notes
The comparison uses np.isclose to check for approximate equality.
If the series cannot be directly compared due to numeric type mismatches, If casting fails, a series of False values is returned.
If the Series shapes do not match, and neither type is numeric a None values is returned.
- class datacompy.comparator.PolarsStringComparator¶
Bases:
BaseComparatorComparator for string / temporal / date columns in Polars.
- compare(col1: Series, col2: Series, ignore_space: bool = True, ignore_case: bool = True) Series | None¶
Compare two Polars Series column-wise, taking into account optional normalization for spaces and case sensitivity.
- Parameters:
col1 (pl.Series) – The first Polars Series to compare.
col2 (pl.Series) – The second Polars Series to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.
- Returns:
pl.Series – A Polars Series of boolean values where each element indicates whether the corresponding elements in col1 and col2 are equal. Handles missing values by treating nulls as equal.
None – if the columns are not comparable.
- Raises:
Exception – If the comparison fails due to incompatible types or other issues, attempts to cast both columns to strings for comparison. If this also fails, returns a Series of False values with the same length as col1.
- class datacompy.comparator.SnowflakeArrayLikeComparator¶
Bases:
BaseComparatorComparator for array-like columns in Snowflake.
- compare(dataframe: DataFrame, col1: str, col2: str, col_match: str) DataFrame | None¶
Compare two array like columns for equality.
- Parameters:
dataframe (snowflake.snowpark.DataFrame) – DataFrame to do comparison on
col1 (str) – The first column to look at
col2 (str) – The second column
col_match (str) – The matching column denoting if the compare was a match or not
- Returns:
snowflake.snowpark.DataFrame – A PySpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are equal.
None – if the columns are not comparable.
- class datacompy.comparator.SnowflakeNumericComparator¶
Bases:
BaseComparatorComparator for numeric columns in Snowflake.
- Parameters:
rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.
- compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, rtol=1e-05, atol=1e-08) DataFrame | None¶
Compare two columns in a Snowpark DataFrame for approximate equality within specified tolerances rtol and atol.
- Parameters:
dataframe (snowflake.snowpark.DataFrame) – The Snowpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
col_match (str) – The name of the output column that will store the comparison results.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.
- Returns:
snowflake.snowpark.DataFrame – A Snowpark DataFrame with an additional column (col_match) containing boolean values indicating whether the values in col1 and col2 are approximately equal within the given tolerances.
None – If the type conditions are not met.
Notes
The comparison uses Snowpark SQL functions to check for approximate equality.
Null-safe equality (eqNullSafe) is used to handle null values.
If either column contains null values, they are handled explicitly to avoid incorrect comparisons.
- class datacompy.comparator.SnowflakeStringComparator¶
Bases:
BaseComparatorComparator for string columns in Snowflake.
- compare(dataframe: DataFrame, col1: str, col2: str, col_match: str, ignore_space: bool = True, ignore_case: bool = True) DataFrame | None¶
Compare two columns in a Snowflake DataFrame for string equality.
- Parameters:
dataframe (snowflake.snowpark.DataFrame) – The Snowflake DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
col_match (str) – The name of the output column that will store the comparison results.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.
- Returns:
snowflake.snowpark.DataFrame – The DataFrame with an additional column containing the comparison results.
None – If the comparison fails due to incompatible types or other issues, None is returned.
- Raises:
Exception – If the comparison fails due to incompatible types or other issues, None is returned.
- class datacompy.comparator.SparkArrayLikeComparator¶
Bases:
BaseComparatorComparator for array-like columns in PySpark.
- compare(dataframe: DataFrame, col1: str, col2: str) Column | None¶
Compare two array like columns for equality.
- Parameters:
dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on
col_1 (str) – The first column to look at
col_2 (str) – The second column
- Returns:
pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are equal.
None – if the columns are not comparable.
- class datacompy.comparator.SparkNumericComparator¶
Bases:
BaseComparatorComparator for numeric columns in PySpark.
- Parameters:
rtol (float) – The relative tolerance to use for comparison.
atol (float) – The absolute tolerance to use for comparison.
- compare(dataframe: DataFrame, col1: str, col2: str, rtol=1e-05, atol=1e-08) Column | None¶
Compare two columns in a PySpark DataFrame for approximate equality within specified tolerances rtol and atol.
- Parameters:
dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
rtol (float, optional) – The relative tolerance to use for comparison. Default is 1e-5.
atol (float, optional) – The absolute tolerance to use for comparison. Default is 1e-8.
- Returns:
pyspark.sql.Column – A PySpark Column containing boolean values indicating whether the values in col_1 and col_2 are approximately equal within the given tolerances.
None – if the columns are not comparable.
Notes
The comparison uses PySpark SQL functions to check for approximate equality.
Null-safe equality (eqNullSafe) is used to handle null values.
If either column contains NaN values, they are handled explicitly to avoid incorrect comparisons.
- class datacompy.comparator.SparkStringComparator¶
Bases:
BaseComparatorComparator for string / temporal / date columns in PySpark.
- compare(dataframe: DataFrame, col1: str, col2: str, ignore_space: bool = True, ignore_case: bool = True) Column | None¶
Compare two columns in a PySpark DataFrame for string equality.
- Parameters:
dataframe (pyspark.sql.DataFrame) – The PySpark DataFrame containing the columns to compare.
col1 (str) – The name of the first column to compare.
col2 (str) – The name of the second column to compare.
ignore_space (bool) – Whether to ignore leading and trailing whitespace when comparing strings.
ignore_case (bool) – Whether to ignore case when comparing strings.
- Returns:
pyspark.sql.Column – A Column containing boolean values indicating whether the values in col1 and col2 are equal.
None – Columns are not comparable if their datatypes are not in any of the string or date combination.
- Raises:
Exception – If the comparison fails due to incompatible types or other issues, None is returned.