datacompy package

Subpackages

Submodules

datacompy.base module

Base module for comparing two DataFrames.

Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.

class datacompy.base.BaseCompare

Bases: ABC

Base comparison class.

abstractmethod all_columns_match() bool

Check if all columns match.

abstractmethod all_mismatch(ignore_matching_cols: bool = False) Any

Get all rows that mismatch.

abstractmethod all_rows_overlap() bool

Check if all rows overlap.

columns_with_mismatches() list[str]

Return a list of column names where at least one row has a mismatch.

This method identifies columns that have differences between df1 and df2, excluding the join columns. This is useful for identifying problematic columns and potentially rerunning comparisons on a subset of the data.

Returns:

A list of column names that have at least one mismatch.

Return type:

list[str]

Examples

>>> compare = PandasCompare(df1, df2, join_columns=['id'])
>>> mismatched_cols = compare.columns_with_mismatches()
>>> print(mismatched_cols)
['col_a', 'col_b']
abstractmethod count_matching_rows() int

Count the number of matching rows.

abstract property df1: Any

Get the first dataframe.

abstractmethod df1_unq_columns() OrderedSet[str]

Get columns that are unique to df1.

abstract property df2: Any

Get the second dataframe.

abstractmethod df2_unq_columns() OrderedSet[str]

Get columns that are unique to df2.

abstractmethod intersect_columns() OrderedSet[str]

Get columns that are shared between the two dataframes.

abstractmethod intersect_rows_match() bool

Check if the intersection of rows match.

abstractmethod matches(ignore_extra_columns: bool = False) bool

Check if the dataframes match.

only_join_columns() bool

Boolean on if the only columns are the join columns.

abstractmethod report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str

Return a string representation of a report.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

  • template_path (str, optional) – Path to a custom Jinja2 template file to use for report generation. If None, the default template will be used.

Returns:

The report, formatted according to the template.

Return type:

str

reveal_sensitive_columns() None

Reveals all sensitive columns.

Notes

  • This re-runs the full comparison to restore original values.

  • Revealing sensitive columns when there aren’t any is treated as a NOP to avoid redundant computations.

abstractmethod sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) Any

Get a sample of rows that mismatch.

property sensitive_columns: List[str] | None

Get the list of sensitive columns.

abstractmethod subset() bool

Check if one dataframe is a subset of the other.

datacompy.base.df_to_str(df: Any, sample_count: int | None = None, on_index: bool = False) str

Convert a DataFrame to a string representation.

This is a centralized function to handle DataFrame to string conversion for different DataFrame types (pandas, Spark, Polars, etc.)

Parameters:
  • df (Any) – The DataFrame to convert to string. Can be pandas, Spark, or Polars DataFrame.

  • sample_count (int, optional) – For distributed DataFrames (like Spark), limit the number of rows to convert.

  • on_index (bool, default False) – If True, include the index in the output.

Returns:

String representation of the DataFrame

Return type:

str

datacompy.base.get_column_tolerance(column: str, tol_dict: Dict[str, float]) float

Return the tolerance value for a given column from a dictionary of tolerances.

Parameters:
  • column (str) – The name of the column for which to retrieve the tolerance.

  • tol_dict (dict of str to float) – Dictionary mapping column names to their tolerance values. May contain a “default” key for columns not explicitly listed.

Returns:

The tolerance value for the specified column, or the “default” tolerance if the column is not found. Returns 0 if neither the column nor “default” is present in the dictionary.

Return type:

float

datacompy.base.render(template_name: str, **context: Any) str

Render a template using Jinja2.

Parameters:
  • template_name (str) – The name of the template file to render. This can be: - A filename in the default templates directory (with or without .j2 extension) - A relative path from the default templates directory - An absolute path to a template file

  • **context (dict) – The context variables to pass to the template

Returns:

The rendered template

Return type:

str

Raises:

FileNotFoundError – If the template file cannot be found in any of the expected locations

datacompy.base.save_html_report(report: str, html_file: str | Path) None

Save a text report as an HTML file.

Parameters:
  • report (str) – The text report to convert to HTML

  • html_file (str or Path) – The path where the HTML file should be saved

datacompy.base.temp_column_name(*dataframes) str

Get a temp column name that isn’t included in columns of any dataframes.

Parameters:

dataframes (list of DataFrames) – The DataFrames to create a temporary column name for

Returns:

String column name that looks like ‘_temp_x’ for some integer x

Return type:

str

datacompy.base.validate_tolerance_parameter(param_value: float | Dict[str, float], param_name: str, case_mode: str = 'lower') Dict[str, float]

Validate and normalize tolerance parameter input.

Parameters:
  • param_value (float or dict) – The tolerance value to validate. Can be either a float or a dictionary mapping column names to float values.

  • param_name (str) – Name of the parameter being validated (‘abs_tol’ or ‘rel_tol’)

  • case_mode (str) – How to handle column name case. Options are: - “lower”: convert to lowercase - “upper”: convert to uppercase - “preserve”: keep original case

Returns:

Normalized dictionary of tolerance values

Return type:

dict

Raises:
  • TypeError – If param_value is not a float or dict

  • ValueError – If any tolerance values are not numeric or negative or if case_mode is invalid

datacompy.pandas module

Compare two Pandas DataFrames.

Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.

class datacompy.pandas.PandasCompare(df1: DataFrame, df2: DataFrame, join_columns: List[str] | str | None = None, on_index: bool = False, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, custom_comparators: List[BaseComparator] | None = None)

Bases: BaseCompare

Comparison class to be used to compare whether two dataframes as equal.

Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • df1 (pandas DataFrame) – First dataframe to check

  • df2 (pandas DataFrame) – Second dataframe to check

  • join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used. join_columns or on_index must be set, but not both.

  • on_index (bool, optional) – If True, the index will be used to join the two dataframes. If both join_columns and on_index are provided, an exception will be raised.

  • abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

  • df2_name (str, optional) – A string name for the second dataframe

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns). Excludes categoricals.

  • ignore_case (bool, optional) – Flag to ignore the case of string columns. Excludes categoricals.

  • cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case

  • custom_comparators (list of BaseComparator, optional) – A list of custom comparator classes to use to compare columns.

all_columns_match() bool

Whether the columns all match in the dataframes.

all_mismatch(ignore_matching_cols: bool = False) DataFrame

Get all rows with any columns that have a mismatch.

Returns all df1 and df2 versions of the columns and join columns.

Parameters:

ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.

Returns:

All rows of the intersection dataframe, containing any columns, that don’t match.

Return type:

Pandas.DataFrame

all_rows_overlap() bool

Whether the rows are all present in both dataframes.

Returns:

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

Return type:

bool

count_matching_rows() int

Count the number of rows match (on overlapping fields).

Returns:

Number of matching rows

Return type:

int

property df1: DataFrame

Get the first dataframe.

df1_unq_columns() OrderedSet[str]

Get columns that are unique to df1.

property df2: DataFrame

Get the second dataframe.

df2_unq_columns() OrderedSet[str]

Get columns that are unique to df2.

hide_sensitive_columns(sensitive_columns: List[str]) None

Hides sensitive columns of df1 or df2 if applicable in the compare.

intersect_columns() OrderedSet[str]

Get columns that are shared between the two dataframes.

intersect_rows_match() bool

Check whether the intersect rows all match.

matches(ignore_extra_columns: bool = False) bool

Return True or False if the dataframes match.

Parameters:

ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.

Returns:

True or False if the dataframes match.

Return type:

bool

report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str

Return a string representation of a report.

The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

  • template_path (str, optional) –

    Path to a custom Jinja2 template file to use for report generation. If None, the default template will be used. The template receives the following context variables:

    • column_summary: Dict with column statistics including: common_columns, df1_unique, df2_unique, df1_name, df2_name

    • row_summary: Dict with row statistics including: match_columns, equal_rows, unequal_rows

    • column_comparison: Dict with column comparison statistics including: unequal_columns, equal_columns, unequal_values

    • mismatch_stats: Dict containing:
      • stats: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)

      • samples: Sample rows with mismatched values

      • has_samples: Boolean indicating if there are any mismatch samples

      • has_mismatches: Boolean indicating if there are any mismatches

    • df1_unique_rows: Dict with unique rows in df1 including: has_rows, rows, columns

    • df2_unique_rows: Dict with unique rows in df2 including: has_rows, rows, columns

Returns:

The report, formatted according to the template.

Return type:

str

Examples

Basic usage with default template:

>>> compare = datacompy.Compare(df1, df2, join_columns=['id'])
>>> report = compare.report()
>>> print(report)

Using a custom template file:

>>> # Create a custom template file (custom_report.j2)
>>> with open('custom_report.j2', 'w') as f:
...     f.write('''
...     Comparison Report
...     ================
...
...     DataFrames: {{ df1_name }} vs {{ df2_name }}
...
...     Shape Summary:
...     - {{ df1_name }}: {{ df1_shape[0] }} rows x {{ df1_shape[1] }} columns
...     - {{ df2_name }}: {{ df2_shape[0] }} rows x {{ df2_shape[1] }} columns
...
...     {% if mismatch_stats %}
...     Mismatched Columns ({{ mismatch_stats|length }}):
...     {% for col in mismatch_stats %}
...     - {{ col.column }} ({{ col.unequal_cnt }} mismatches)
...     {% endfor %}
...     {% else %}
...     No mismatches found in any columns!
...     {% endif %}
...     ''')
...
>>> # Generate report with custom template
>>> report = compare.report(template_path='custom_report.j2')
>>> print(report)
sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame | None

Return sample mismatches.

Gets a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.

Parameters:
  • column (str) – The raw column name (i.e. without _df1 appended)

  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)

Returns:

  • Pandas.DataFrame – A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.

  • None – When the column being requested is not an intersecting column between dataframes.

subset() bool

Return True if dataframe 2 is a subset of dataframe 1.

Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.

Returns:

True if dataframe 2 is a subset of dataframe 1.

Return type:

bool

datacompy.pandas.calculate_max_diff(col_1: pd.Series[Any], col_2: pd.Series[Any]) float

Get a maximum difference between two columns.

Parameters:
  • col_1 (Pandas.Series) – The first column

  • col_2 (Pandas.Series) – The second column

Returns:

Numeric field, or zero.

Return type:

Numeric

datacompy.pandas.columns_equal(col_1: pd.Series[Any], col_2: pd.Series[Any], rel_tol: float = 0, abs_tol: float = 0, ignore_spaces: bool = False, ignore_case: bool = False, comparators: List[BaseComparator] | None = None, **kwargs) pd.Series[bool]

Compare two columns from a dataframe.

Returns a True`//False series, with the same index as column 1.

  • Two nulls (np.nan) will evaluate to True.

  • A null and a non-null value will evaluate to False.

  • Numeric values will use the relative and absolute tolerances.

  • Decimal values (decimal.Decimal) will attempt to be converted to floats before comparing.

  • Non-numeric values (i.e. where np.isclose can’t be used) will just trigger True on two nulls or exact matches.

Parameters:
  • col_1 (Pandas.Series) – The first column to look at

  • col_2 (Pandas.Series) – The second column

  • rel_tol (float, optional) – Relative tolerance

  • abs_tol (float, optional) – Absolute tolerance

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • comparators (list of BaseComparator, optional) – A list of custom comparator classes to use to compare columns.

  • **kwargs – Additional keyword arguments to pass to custom comparators.

Returns:

A series of Boolean values. True == the values match, False == the values don’t match.

Return type:

pandas.Series

Notes

  • As of version 0.14.0 If a column is of a mixed data type the compare will default to returning False.

  • list and np.array types will be compared row wise using np.array_equal. Depending on the size of your data this might lead to performance issues.

  • All the rows must be of the same type otherwise it is considered “mixed” and will default to being False for everything.

datacompy.pandas.generate_id_within_group(dataframe: DataFrame, join_columns: List[str]) pd.Series[int]

Generate an ID column that can be used to deduplicate identical rows.

The series generated is the order within a unique group, and it handles nulls.

Parameters:
  • dataframe (Pandas.DataFrame) – The dataframe to operate on

  • join_columns (list) – List of strings which are the join columns

Returns:

The ID column that’s unique in each group.

Return type:

Pandas.Series

datacompy.pandas.get_merged_columns(original_df: DataFrame, merged_df: DataFrame, suffix: str) List[str]

Get the columns from an original dataframe, in the new merged dataframe.

Parameters:
  • original_df (Pandas.DataFrame) – The original, pre-merge dataframe

  • merged_df (Pandas.DataFrame) – Post-merge with another dataframe, with suffixes added in.

  • suffix (str) – What suffix was used to distinguish when the original dataframe was overlapping with the other merged dataframe.

datacompy.polars module

Compare two Polars DataFrames.

Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.

class datacompy.polars.PolarsCompare(df1: DataFrame, df2: DataFrame, join_columns: List[str] | str, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, custom_comparators: List[BaseComparator] | None = None)

Bases: BaseCompare

Comparison class to be used to compare whether two dataframes as equal.

Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • df1 (Polars DataFrame) – First dataframe to check

  • df2 (Polars DataFrame) – Second dataframe to check

  • join_columns (list or str) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.

  • abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default”

  • df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

  • df2_name (str, optional) – A string name for the second dataframe

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns). Excludes categoricals.

  • ignore_case (bool, optional) – Flag to ignore the case of string columns. Excludes categoricals.

  • cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case

  • custom_comparators (list of BaseComparator, optional) – A list of custom comparator classes to use to compare columns.

all_columns_match() bool

Whether the columns all match in the dataframes.

all_mismatch(ignore_matching_cols: bool = False) DataFrame

Get all rows with any columns that have a mismatch.

Returns all df1 and df2 versions of the columns and join columns.

Parameters:

ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.

Returns:

All rows of the intersection dataframe, containing any columns, that don’t match.

Return type:

Polars.DataFrame

all_rows_overlap() bool

Whether the rows are all present in both dataframes.

Returns:

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

Return type:

bool

count_matching_rows() int

Count the number of rows match (on overlapping fields).

Returns:

Number of matching rows

Return type:

int

property df1: DataFrame

Get the first dataframe.

df1_unq_columns() OrderedSet[str]

Get columns that are unique to df1.

property df2: DataFrame

Get the second dataframe.

df2_unq_columns() OrderedSet[str]

Get columns that are unique to df2.

hide_sensitive_columns(sensitive_columns: List[str]) None

Hides sensitive columns of df1 or df2 if applicable in the compare.

intersect_columns() OrderedSet[str]

Get columns that are shared between the two dataframes.

intersect_rows_match() bool

Check whether the intersect rows all match.

matches(ignore_extra_columns: bool = False) bool

Return True or False if the dataframes match.

Parameters:

ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.

Returns:

True or False if the dataframes match.

Return type:

bool

report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str

Return a string representation of a report.

The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

  • template_path (str, optional) –

    Path to a custom Jinja2 template file to use for report generation. If None, the default template will be used. The template receives the following context variables:

    • column_summary: Dict with column statistics including: common_columns, df1_unique, df2_unique, df1_name, df2_name

    • row_summary: Dict with row statistics including: match_columns, equal_rows, unequal_rows

    • column_comparison: Dict with column comparison statistics including: unequal_columns, equal_columns, unequal_values

    • mismatch_stats: Dict containing:
      • stats: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)

      • samples: Sample rows with mismatched values

      • has_samples: Boolean indicating if there are any mismatch samples

      • has_mismatches: Boolean indicating if there are any mismatches

    • df1_unique_rows: Dict with unique rows in df1 including: has_rows, rows, columns

    • df2_unique_rows: Dict with unique rows in df2 including: has_rows, rows, columns

Returns:

The report, formatted according to the template.

Return type:

str

sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame | None

Return sample mismatches.

Get a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.

Parameters:
  • column (str) – The raw column name (i.e. without _df1 appended)

  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)

Returns:

  • Polars.DataFrame – A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.

  • None – When the column being requested is not an intersecting column between dataframes.

subset() bool

Return True if dataframe 2 is a subset of dataframe 1.

Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.

Returns:

True if dataframe 2 is a subset of dataframe 1.

Return type:

bool

datacompy.polars.calculate_max_diff(col_1: Series, col_2: Series) float

Get a maximum difference between two columns.

Parameters:
  • col_1 (Polars.Series) – The first column

  • col_2 (Polars.Series) – The second column

Returns:

Numeric field, or zero.

Return type:

Numeric

datacompy.polars.columns_equal(col_1: Series, col_2: Series, rel_tol: float = 0, abs_tol: float = 0, ignore_spaces: bool = False, ignore_case: bool = False, comparators: List[BaseComparator] | None = None, **kwargs) Series

Compare two columns from a dataframe.

Returns a True/False series, with the same index as column 1.

  • Two nulls (np.nan) will evaluate to True.

  • A null and a non-null value will evaluate to False.

  • Numeric values will use the relative and absolute tolerances.

  • Decimal values (decimal.Decimal) will attempt to be converted to floats before comparing

  • Non-numeric values (i.e. where np.isclose can’t be used) will just trigger True on two nulls or exact matches.

Parameters:
  • col_1 (Polars.Series) – The first column to look at

  • col_2 (Polars.Series) – The second column

  • rel_tol (float, optional) – Relative tolerance

  • abs_tol (float, optional) – Absolute tolerance

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • comparators (list of BaseComparator, optional) – A list of custom comparator classes to use to compare columns.

  • **kwargs – Additional keyword arguments to pass to custom comparators.

Returns:

A series of Boolean values. True == the values match, False == the values don’t match.

Return type:

Polars.Series

datacompy.polars.generate_id_within_group(dataframe: DataFrame, join_columns: List[str]) Series

Generate an ID column that can be used to deduplicate identical rows.

The series generated is the order within a unique group, and it handles nulls.

Parameters:
  • dataframe (Polars.DataFrame) – The dataframe to operate on

  • join_columns (list) – List of strings which are the join columns

Returns:

The ID column that’s unique in each group.

Return type:

Polars.Series

datacompy.polars.get_merged_columns(original_df: DataFrame, merged_df: DataFrame, suffix: str) List[str]

Get the columns from an original dataframe, in the new merged dataframe.

Parameters:
  • original_df (Polars.DataFrame) – The original, pre-merge dataframe

  • merged_df (Polars.DataFrame) – Post-merge with another dataframe, with suffixes added in.

  • suffix (str) – What suffix was used to distinguish when the original dataframe was overlapping with the other merged dataframe.

datacompy.snowflake module

Compare two Snowpark SQL DataFrames and Snowflake tables.

Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.

class datacompy.snowflake.SnowflakeCompare(session: Session, df1: str | DataFrame, df2: str | DataFrame, join_columns: List[str] | str | None, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str | None = None, df2_name: str | None = None, ignore_spaces: bool = False, ignore_case: bool = False, custom_comparators: List[BaseComparator] | None = None)

Bases: BaseCompare

Comparison class to be used to compare whether two Snowpark dataframes are equal.

df1 and df2 can refer to either a Snowpark dataframe or the name of a valid Snowflake table. The data structures which df1 and df2 represent must contain all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • session (snowflake.snowpark.session) – Session with the required connection session info for user and targeted tables

  • df1 (Union[str, sp.Dataframe]) – First table to check, provided either as the table’s name or as a Snowpark DF.

  • df2 (Union[str, sp.Dataframe]) – Second table to check, provided either as the table’s name or as a Snowpark DF.

  • join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.

  • abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • df1_name (str, optional) – A string name for the first dataframe. If used alongside a snowflake table, overrides the default convention of naming the dataframe after the table.

  • df2_name (str, optional) – A string name for the second dataframe.

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns).

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • custom_comparators (list of BaseComparator, optional) – A list of custom comparator classes to use to compare columns.

all_columns_match() bool

Whether the columns all match in the dataframes.

Returns:

  • bool

  • True if all columns in df1 are in df2 and vice versa

all_mismatch(ignore_matching_cols: bool = False) DataFrame

Get all rows with any columns that have a mismatch.

Returns all df1 and df2 versions of the columns and join columns.

Parameters:

ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.

Returns:

All rows of the intersection dataframe, containing any columns, that don’t match.

Return type:

sp.DataFrame

all_rows_overlap() bool

Whether the rows are all present in both dataframes.

Returns:

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

Return type:

bool

count_matching_rows() int

Count the number of rows match (on overlapping fields).

Returns:

Number of matching rows

Return type:

int

property df1: DataFrame

Get the first dataframe.

df1_unq_columns() OrderedSet[str]

Get columns that are unique to df1.

property df2: DataFrame

Get the second dataframe.

df2_unq_columns() OrderedSet[str]

Get columns that are unique to df2.

hide_sensitive_columns(sensitive_columns: List[str]) None

Hides sensitive columns of df1 or df2 if applicable in the compare.

Parameters:

sensitive_columns (List[str]) – List of column names to hide. Column names are case-insensitive and will be normalized to uppercase. Quoted identifiers (e.g. ‘“col”’) will have quotes stripped before matching.

intersect_columns() OrderedSet[str]

Get columns that are shared between the two dataframes.

intersect_rows_match() bool

Check whether the intersect rows all match.

matches(ignore_extra_columns: bool = False) bool

Return True or False if the dataframes match.

Parameters:

ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.

Returns:

True or False if the dataframes match.

Return type:

bool

report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str

Return a string representation of a report.

The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

  • template_path (str, optional) –

    Path to a custom Jinja2 template file to use for report generation. If None, the default template will be used. The template receives the following context variables:

    • column_summary: Dict with column statistics including: common_columns, df1_unique, df2_unique, df1_name, df2_name

    • row_summary: Dict with row statistics including: match_columns, equal_rows, unequal_rows

    • column_comparison: Dict with column comparison statistics including: unequal_columns, equal_columns, unequal_values

    • mismatch_stats: Dict containing:
      • stats: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)

      • samples: Sample rows with mismatched values

      • has_samples: Boolean indicating if there are any mismatch samples

      • has_mismatches: Boolean indicating if there are any mismatches

    • df1_unique_rows: Dict with unique rows in df1 including: has_rows, rows, columns

    • df2_unique_rows: Dict with unique rows in df2 including: has_rows, rows, columns

Returns:

The report, formatted according to the template.

Return type:

str

sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame

Return sample mismatches.

Gets a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.

Parameters:
  • column (str) – The raw column name (i.e. without _df1 appended)

  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)

Returns:

A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.

Return type:

sp.DataFrame

subset() bool

Return True if dataframe 2 is a subset of dataframe 1.

Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.

Returns:

True if dataframe 2 is a subset of dataframe 1.

Return type:

bool

datacompy.snowflake.calculate_max_diff(dataframe: DataFrame, col_1: str, col_2: str) float

Get a maximum difference between two columns.

Parameters:
  • dataframe (sp.DataFrame) – DataFrame to do comparison on

  • col_1 (str) – The first column to look at

  • col_2 (str) – The second column

Returns:

max diff

Return type:

float

datacompy.snowflake.calculate_null_diff(dataframe: DataFrame, col_1: str, col_2: str) int

Get the null differences between two columns.

Parameters:
  • dataframe (sp.DataFrame) – DataFrame to do comparison on

  • col_1 (str) – The first column to look at

  • col_2 (str) – The second column

Returns:

null diff

Return type:

int

datacompy.snowflake.columns_equal(dataframe: DataFrame, col_1: str, col_2: str, col_match: str, rel_tol: float = 0, abs_tol: float = 0, ignore_spaces: bool = False, ignore_case: bool = False, comparators: List[BaseComparator] | None = None, **kwargs) DataFrame

Compare two columns from a dataframe.

Returns a True/False series with the same index as column 1.

  • Two nulls (np.nan) will evaluate to True.

  • A null and a non-null value will evaluate to False.

  • Numeric values will use the relative and absolute tolerances.

  • Decimal values (decimal.Decimal) will attempt to be converted to floats before comparing

  • Non-numeric values (i.e. where np.isclose can’t be used) will just trigger True on two nulls or exact matches.

Parameters:
  • dataframe (sp.DataFrame) – DataFrame to do comparison on

  • col_1 (str) – The first column to look at

  • col_2 (str) – The second column

  • col_match (str) – The matching column denoting if the compare was a match or not

  • rel_tol (float, optional) – Relative tolerance

  • abs_tol (float, optional) – Absolute tolerance

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • comparators (List[BaseComparator] | None = None,) – A list of custom comparator classes to use to compare columns.

  • **kwargs – Additional keyword arguments to pass to custom comparators.

Returns:

A column of boolean values are added. True == the values match, False == the values don’t match.

Return type:

sp.DataFrame

datacompy.snowflake.get_merged_columns(original_df: DataFrame, merged_df: DataFrame, suffix: str) List[str]

Get the columns from an original dataframe, in the new merged dataframe.

Parameters:
  • original_df (sp.DataFrame) – The original, pre-merge dataframe

  • merged_df (sp.DataFrame) – Post-merge with another dataframe, with suffixes added in.

  • suffix (str) – What suffix was used to distinguish when the original dataframe was overlapping with the other merged dataframe.

Returns:

Column list of the original dataframe pre suffix

Return type:

List[str]

datacompy.snowflake.temp_column_name(*dataframes) str

Get a temp column name that isn’t included in columns of any dataframes.

Parameters:

dataframes (list of DataFrames) – The DataFrames to create a temporary column name for

Returns:

String column name that looks like ‘_temp_x’ for some integer x

Return type:

str

datacompy.spark module

Compare two PySpark SQL DataFrames.

Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.

class datacompy.spark.SparkSQLCompare(spark_session: SparkSession, df1: DataFrame, df2: DataFrame, join_columns: List[str] | str, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, custom_comparators: List[BaseComparator] | None = None, cache_intermediates: bool = True)

Bases: BaseCompare

Comparison class to be used to compare whether two Spark SQL dataframes are equal.

Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • spark_session (pyspark.sql.SparkSession) – A SparkSession to be used to execute Spark commands in the comparison.

  • df1 (pyspark.sql.DataFrame) – First dataframe to check

  • df2 (pyspark.sql.DataFrame) – Second dataframe to check

  • join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.

  • abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

  • df2_name (str, optional) – A string name for the second dataframe

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case

  • custom_comparators (list of BaseComparator, optional) – A list of custom comparator classes to use to compare columns.

  • cache_intermediates (bool, optional) – Flag to enable/disable caching of intermediate DataFrames. Set to False when using Databricks Serverless or other environments that don’t support caching. Defaults to True.

all_columns_match() bool

Whether the columns all match in the dataframes.

Returns:

True if all columns in df1 are in df2 and vice versa

Return type:

bool

all_mismatch(ignore_matching_cols: bool = False) DataFrame

Get all rows with any columns that have a mismatch.

Returns all df1 and df2 versions of the columns and join columns.

Parameters:

ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.

Returns:

All rows of the intersection dataframe, containing any columns, that don’t match.

Return type:

pyspark.sql.DataFrame

all_rows_overlap() bool

Whether the rows are all present in both dataframes.

Returns:

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

Return type:

bool

count_matching_rows() int

Count the number of rows match (on overlapping fields).

Returns:

Number of matching rows

Return type:

int

property df1: DataFrame

Get the first dataframe.

df1_unq_columns() OrderedSet[str]

Get columns that are unique to df1.

property df2: DataFrame

Get the second dataframe.

df2_unq_columns() OrderedSet[str]

Get columns that are unique to df2.

hide_sensitive_columns(sensitive_columns: List[str]) None

Hides sensitive columns of df1 or df2 if applicable in the compare.

intersect_columns() OrderedSet[str]

Get columns that are shared between the two dataframes.

intersect_rows_match() bool

Check whether the intersect rows all match.

matches(ignore_extra_columns: bool = False) bool

Return True or False if the dataframes match.

Parameters:

ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.

report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str

Return a string representation of a report.

The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

  • template_path (str, optional) –

    Path to a custom Jinja2 template file to use for report generation. If None, the default template will be used. The template receives the following context variables:

    • column_summary: Dict with column statistics including: common_columns, df1_unique, df2_unique, df1_name, df2_name

    • row_summary: Dict with row statistics including: match_columns, equal_rows, unequal_rows

    • column_comparison: Dict with column comparison statistics including: unequal_columns, equal_columns, unequal_values

    • mismatch_stats: Dict containing:
      • stats: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)

      • samples: Sample rows with mismatched values

      • has_samples: Boolean indicating if there are any mismatch samples

      • has_mismatches: Boolean indicating if there are any mismatches

    • df1_unique_rows: Dict with unique rows in df1 including: has_rows, rows, columns

    • df2_unique_rows: Dict with unique rows in df2 including: has_rows, rows, columns

Returns:

The report, formatted according to the template.

Return type:

str

sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame

Return sample mismatches.

Gets a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.

Parameters:
  • column (str) – The raw column name (i.e. without _df1 appended)

  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)

Returns:

A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.

Return type:

pyspark.sql.DataFrame

subset() bool

Return True if dataframe 2 is a subset of dataframe 1.

Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.

Returns:

True if dataframe 2 is a subset of dataframe 1.

Return type:

bool

datacompy.spark.calculate_max_diff(dataframe: DataFrame, col_1: str, col_2: str) float

Get a maximum difference between two columns.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on

  • col_1 (str) – The first column to look at

  • col_2 (str) – The second column

Returns:

max diff

Return type:

float

datacompy.spark.calculate_null_diff(dataframe: DataFrame, col_1: str, col_2: str) int

Get the null differences between two columns.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on

  • col_1 (str) – The first column to look at

  • col_2 (str) – The second column

Returns:

null diff

Return type:

int

datacompy.spark.columns_equal(dataframe: DataFrame, col_1: str, col_2: str, rel_tol: float = 0, abs_tol: float = 0, ignore_spaces: bool = False, ignore_case: bool = False, comparators: List[BaseComparator] | None = None, **kwargs) Column

Compare if two columns are considered equal, returns a boolean Spark Column to be used in a .withColumn(…) statement.

Returns a True/False series with the same index as column 1.

  • Two nulls (np.nan) will evaluate to True.

  • A null and a non-null value will evaluate to False.

  • Numeric values will use the relative and absolute tolerances.

  • Decimal values (decimal.Decimal) will attempt to be converted to floats before comparing

  • Non-numeric values (i.e. where np.isclose can’t be used) will just trigger True on two nulls or exact matches.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – DataFrame to do comparison on

  • col_1 (str) – The first column to look at

  • col_2 (str) – The second column

  • rel_tol (float, optional) – Relative tolerance

  • abs_tol (float, optional) – Absolute tolerance

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • comparators (List[BaseComparator] | None = None,) – A list of custom comparator classes to use to compare columns.

  • **kwargs – Additional keyword arguments to pass to custom comparators.

Returns:

Boolean Spark Column: True if values match according to the rules above, False otherwise.

Return type:

pyspark.sql.Column

Example

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([
...     (1, 1.1, "ABC", None),
...     (2, 2.0, "DEF ", 4.0),
...     (3, None, "ghi", 5.0)
... ], ["id", "col1", "col2", "col3"])
>>> # Compare numeric columns with tolerance
>>> df = df.withColumn("col1_equals_col3", columns_equal(df, "col1", "col3", rel_tol=0.1))
>>> # Compare string columns ignoring case and spaces
>>> df = df.withColumn("col2_normalized", columns_equal(df, "col2", "col2", ignore_spaces=True, ignore_case=True))

Note

Starting in version 0.18.0, the behavior of this function was changed so rather than returning a DataFrame a Column expression is returned.

datacompy.spark.decimal_comparator()

Check equality with decimal(X, Y) types.

Otherwise treated as the string “decimal”.

datacompy.spark.get_merged_columns(original_df: DataFrame, merged_df: DataFrame, suffix: str) List[str]

Get the columns from an original dataframe, in the new merged dataframe.

Parameters:
  • original_df (pyspark.sql.DataFrame) – The original, pre-merge dataframe

  • merged_df (pyspark.sql.DataFrame) – Post-merge with another dataframe, with suffixes added in.

  • suffix (str) – What suffix was used to distinguish when the original dataframe was overlapping with the other merged dataframe.

Returns:

Column list of the original dataframe pre suffix

Return type:

List[str]

datacompy.utility module

Contains utilities for datacompy.

datacompy.utility.check_module_available(module_available, extra_name)

Create a decorator to check if a module is available.

This is for runtime checks in the library.

Module contents

DataComPy is a package to compare two Pandas DataFrames.

Originally started to be something of a replacement for SAS’s PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it prints out some stats, and lets you tweak how accurate matches have to be). Then extended to carry that functionality over to Spark Dataframes.

class datacompy.BaseCompare

Bases: ABC

Base comparison class.

abstractmethod all_columns_match() bool

Check if all columns match.

abstractmethod all_mismatch(ignore_matching_cols: bool = False) Any

Get all rows that mismatch.

abstractmethod all_rows_overlap() bool

Check if all rows overlap.

columns_with_mismatches() list[str]

Return a list of column names where at least one row has a mismatch.

This method identifies columns that have differences between df1 and df2, excluding the join columns. This is useful for identifying problematic columns and potentially rerunning comparisons on a subset of the data.

Returns:

A list of column names that have at least one mismatch.

Return type:

list[str]

Examples

>>> compare = PandasCompare(df1, df2, join_columns=['id'])
>>> mismatched_cols = compare.columns_with_mismatches()
>>> print(mismatched_cols)
['col_a', 'col_b']
abstractmethod count_matching_rows() int

Count the number of matching rows.

abstract property df1: Any

Get the first dataframe.

abstractmethod df1_unq_columns() OrderedSet[str]

Get columns that are unique to df1.

abstract property df2: Any

Get the second dataframe.

abstractmethod df2_unq_columns() OrderedSet[str]

Get columns that are unique to df2.

abstractmethod intersect_columns() OrderedSet[str]

Get columns that are shared between the two dataframes.

abstractmethod intersect_rows_match() bool

Check if the intersection of rows match.

abstractmethod matches(ignore_extra_columns: bool = False) bool

Check if the dataframes match.

only_join_columns() bool

Boolean on if the only columns are the join columns.

abstractmethod report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str

Return a string representation of a report.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

  • template_path (str, optional) – Path to a custom Jinja2 template file to use for report generation. If None, the default template will be used.

Returns:

The report, formatted according to the template.

Return type:

str

reveal_sensitive_columns() None

Reveals all sensitive columns.

Notes

  • This re-runs the full comparison to restore original values.

  • Revealing sensitive columns when there aren’t any is treated as a NOP to avoid redundant computations.

abstractmethod sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) Any

Get a sample of rows that mismatch.

property sensitive_columns: List[str] | None

Get the list of sensitive columns.

abstractmethod subset() bool

Check if one dataframe is a subset of the other.

class datacompy.PandasCompare(df1: DataFrame, df2: DataFrame, join_columns: List[str] | str | None = None, on_index: bool = False, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, custom_comparators: List[BaseComparator] | None = None)

Bases: BaseCompare

Comparison class to be used to compare whether two dataframes as equal.

Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • df1 (pandas DataFrame) – First dataframe to check

  • df2 (pandas DataFrame) – Second dataframe to check

  • join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used. join_columns or on_index must be set, but not both.

  • on_index (bool, optional) – If True, the index will be used to join the two dataframes. If both join_columns and on_index are provided, an exception will be raised.

  • abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

  • df2_name (str, optional) – A string name for the second dataframe

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns). Excludes categoricals.

  • ignore_case (bool, optional) – Flag to ignore the case of string columns. Excludes categoricals.

  • cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case

  • custom_comparators (list of BaseComparator, optional) – A list of custom comparator classes to use to compare columns.

all_columns_match() bool

Whether the columns all match in the dataframes.

all_mismatch(ignore_matching_cols: bool = False) DataFrame

Get all rows with any columns that have a mismatch.

Returns all df1 and df2 versions of the columns and join columns.

Parameters:

ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.

Returns:

All rows of the intersection dataframe, containing any columns, that don’t match.

Return type:

Pandas.DataFrame

all_rows_overlap() bool

Whether the rows are all present in both dataframes.

Returns:

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

Return type:

bool

count_matching_rows() int

Count the number of rows match (on overlapping fields).

Returns:

Number of matching rows

Return type:

int

property df1: DataFrame

Get the first dataframe.

df1_unq_columns() OrderedSet[str]

Get columns that are unique to df1.

property df2: DataFrame

Get the second dataframe.

df2_unq_columns() OrderedSet[str]

Get columns that are unique to df2.

hide_sensitive_columns(sensitive_columns: List[str]) None

Hides sensitive columns of df1 or df2 if applicable in the compare.

intersect_columns() OrderedSet[str]

Get columns that are shared between the two dataframes.

intersect_rows_match() bool

Check whether the intersect rows all match.

matches(ignore_extra_columns: bool = False) bool

Return True or False if the dataframes match.

Parameters:

ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.

Returns:

True or False if the dataframes match.

Return type:

bool

report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str

Return a string representation of a report.

The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

  • template_path (str, optional) –

    Path to a custom Jinja2 template file to use for report generation. If None, the default template will be used. The template receives the following context variables:

    • column_summary: Dict with column statistics including: common_columns, df1_unique, df2_unique, df1_name, df2_name

    • row_summary: Dict with row statistics including: match_columns, equal_rows, unequal_rows

    • column_comparison: Dict with column comparison statistics including: unequal_columns, equal_columns, unequal_values

    • mismatch_stats: Dict containing:
      • stats: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)

      • samples: Sample rows with mismatched values

      • has_samples: Boolean indicating if there are any mismatch samples

      • has_mismatches: Boolean indicating if there are any mismatches

    • df1_unique_rows: Dict with unique rows in df1 including: has_rows, rows, columns

    • df2_unique_rows: Dict with unique rows in df2 including: has_rows, rows, columns

Returns:

The report, formatted according to the template.

Return type:

str

Examples

Basic usage with default template:

>>> compare = datacompy.Compare(df1, df2, join_columns=['id'])
>>> report = compare.report()
>>> print(report)

Using a custom template file:

>>> # Create a custom template file (custom_report.j2)
>>> with open('custom_report.j2', 'w') as f:
...     f.write('''
...     Comparison Report
...     ================
...
...     DataFrames: {{ df1_name }} vs {{ df2_name }}
...
...     Shape Summary:
...     - {{ df1_name }}: {{ df1_shape[0] }} rows x {{ df1_shape[1] }} columns
...     - {{ df2_name }}: {{ df2_shape[0] }} rows x {{ df2_shape[1] }} columns
...
...     {% if mismatch_stats %}
...     Mismatched Columns ({{ mismatch_stats|length }}):
...     {% for col in mismatch_stats %}
...     - {{ col.column }} ({{ col.unequal_cnt }} mismatches)
...     {% endfor %}
...     {% else %}
...     No mismatches found in any columns!
...     {% endif %}
...     ''')
...
>>> # Generate report with custom template
>>> report = compare.report(template_path='custom_report.j2')
>>> print(report)
sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame | None

Return sample mismatches.

Gets a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.

Parameters:
  • column (str) – The raw column name (i.e. without _df1 appended)

  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)

Returns:

  • Pandas.DataFrame – A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.

  • None – When the column being requested is not an intersecting column between dataframes.

subset() bool

Return True if dataframe 2 is a subset of dataframe 1.

Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.

Returns:

True if dataframe 2 is a subset of dataframe 1.

Return type:

bool

class datacompy.PolarsCompare(df1: DataFrame, df2: DataFrame, join_columns: List[str] | str, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, custom_comparators: List[BaseComparator] | None = None)

Bases: BaseCompare

Comparison class to be used to compare whether two dataframes as equal.

Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • df1 (Polars DataFrame) – First dataframe to check

  • df2 (Polars DataFrame) – Second dataframe to check

  • join_columns (list or str) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.

  • abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default”

  • df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

  • df2_name (str, optional) – A string name for the second dataframe

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns). Excludes categoricals.

  • ignore_case (bool, optional) – Flag to ignore the case of string columns. Excludes categoricals.

  • cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case

  • custom_comparators (list of BaseComparator, optional) – A list of custom comparator classes to use to compare columns.

all_columns_match() bool

Whether the columns all match in the dataframes.

all_mismatch(ignore_matching_cols: bool = False) DataFrame

Get all rows with any columns that have a mismatch.

Returns all df1 and df2 versions of the columns and join columns.

Parameters:

ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.

Returns:

All rows of the intersection dataframe, containing any columns, that don’t match.

Return type:

Polars.DataFrame

all_rows_overlap() bool

Whether the rows are all present in both dataframes.

Returns:

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

Return type:

bool

count_matching_rows() int

Count the number of rows match (on overlapping fields).

Returns:

Number of matching rows

Return type:

int

property df1: DataFrame

Get the first dataframe.

df1_unq_columns() OrderedSet[str]

Get columns that are unique to df1.

property df2: DataFrame

Get the second dataframe.

df2_unq_columns() OrderedSet[str]

Get columns that are unique to df2.

hide_sensitive_columns(sensitive_columns: List[str]) None

Hides sensitive columns of df1 or df2 if applicable in the compare.

intersect_columns() OrderedSet[str]

Get columns that are shared between the two dataframes.

intersect_rows_match() bool

Check whether the intersect rows all match.

matches(ignore_extra_columns: bool = False) bool

Return True or False if the dataframes match.

Parameters:

ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.

Returns:

True or False if the dataframes match.

Return type:

bool

report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str

Return a string representation of a report.

The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

  • template_path (str, optional) –

    Path to a custom Jinja2 template file to use for report generation. If None, the default template will be used. The template receives the following context variables:

    • column_summary: Dict with column statistics including: common_columns, df1_unique, df2_unique, df1_name, df2_name

    • row_summary: Dict with row statistics including: match_columns, equal_rows, unequal_rows

    • column_comparison: Dict with column comparison statistics including: unequal_columns, equal_columns, unequal_values

    • mismatch_stats: Dict containing:
      • stats: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)

      • samples: Sample rows with mismatched values

      • has_samples: Boolean indicating if there are any mismatch samples

      • has_mismatches: Boolean indicating if there are any mismatches

    • df1_unique_rows: Dict with unique rows in df1 including: has_rows, rows, columns

    • df2_unique_rows: Dict with unique rows in df2 including: has_rows, rows, columns

Returns:

The report, formatted according to the template.

Return type:

str

sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame | None

Return sample mismatches.

Get a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.

Parameters:
  • column (str) – The raw column name (i.e. without _df1 appended)

  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)

Returns:

  • Polars.DataFrame – A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.

  • None – When the column being requested is not an intersecting column between dataframes.

subset() bool

Return True if dataframe 2 is a subset of dataframe 1.

Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.

Returns:

True if dataframe 2 is a subset of dataframe 1.

Return type:

bool

class datacompy.SnowflakeCompare(session: Session, df1: str | DataFrame, df2: str | DataFrame, join_columns: List[str] | str | None, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str | None = None, df2_name: str | None = None, ignore_spaces: bool = False, ignore_case: bool = False, custom_comparators: List[BaseComparator] | None = None)

Bases: BaseCompare

Comparison class to be used to compare whether two Snowpark dataframes are equal.

df1 and df2 can refer to either a Snowpark dataframe or the name of a valid Snowflake table. The data structures which df1 and df2 represent must contain all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • session (snowflake.snowpark.session) – Session with the required connection session info for user and targeted tables

  • df1 (Union[str, sp.Dataframe]) – First table to check, provided either as the table’s name or as a Snowpark DF.

  • df2 (Union[str, sp.Dataframe]) – Second table to check, provided either as the table’s name or as a Snowpark DF.

  • join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.

  • abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • df1_name (str, optional) – A string name for the first dataframe. If used alongside a snowflake table, overrides the default convention of naming the dataframe after the table.

  • df2_name (str, optional) – A string name for the second dataframe.

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns).

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • custom_comparators (list of BaseComparator, optional) – A list of custom comparator classes to use to compare columns.

all_columns_match() bool

Whether the columns all match in the dataframes.

Returns:

  • bool

  • True if all columns in df1 are in df2 and vice versa

all_mismatch(ignore_matching_cols: bool = False) DataFrame

Get all rows with any columns that have a mismatch.

Returns all df1 and df2 versions of the columns and join columns.

Parameters:

ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.

Returns:

All rows of the intersection dataframe, containing any columns, that don’t match.

Return type:

sp.DataFrame

all_rows_overlap() bool

Whether the rows are all present in both dataframes.

Returns:

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

Return type:

bool

count_matching_rows() int

Count the number of rows match (on overlapping fields).

Returns:

Number of matching rows

Return type:

int

property df1: DataFrame

Get the first dataframe.

df1_unq_columns() OrderedSet[str]

Get columns that are unique to df1.

property df2: DataFrame

Get the second dataframe.

df2_unq_columns() OrderedSet[str]

Get columns that are unique to df2.

hide_sensitive_columns(sensitive_columns: List[str]) None

Hides sensitive columns of df1 or df2 if applicable in the compare.

Parameters:

sensitive_columns (List[str]) – List of column names to hide. Column names are case-insensitive and will be normalized to uppercase. Quoted identifiers (e.g. ‘“col”’) will have quotes stripped before matching.

intersect_columns() OrderedSet[str]

Get columns that are shared between the two dataframes.

intersect_rows_match() bool

Check whether the intersect rows all match.

matches(ignore_extra_columns: bool = False) bool

Return True or False if the dataframes match.

Parameters:

ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.

Returns:

True or False if the dataframes match.

Return type:

bool

report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str

Return a string representation of a report.

The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

  • template_path (str, optional) –

    Path to a custom Jinja2 template file to use for report generation. If None, the default template will be used. The template receives the following context variables:

    • column_summary: Dict with column statistics including: common_columns, df1_unique, df2_unique, df1_name, df2_name

    • row_summary: Dict with row statistics including: match_columns, equal_rows, unequal_rows

    • column_comparison: Dict with column comparison statistics including: unequal_columns, equal_columns, unequal_values

    • mismatch_stats: Dict containing:
      • stats: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)

      • samples: Sample rows with mismatched values

      • has_samples: Boolean indicating if there are any mismatch samples

      • has_mismatches: Boolean indicating if there are any mismatches

    • df1_unique_rows: Dict with unique rows in df1 including: has_rows, rows, columns

    • df2_unique_rows: Dict with unique rows in df2 including: has_rows, rows, columns

Returns:

The report, formatted according to the template.

Return type:

str

sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame

Return sample mismatches.

Gets a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.

Parameters:
  • column (str) – The raw column name (i.e. without _df1 appended)

  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)

Returns:

A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.

Return type:

sp.DataFrame

subset() bool

Return True if dataframe 2 is a subset of dataframe 1.

Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.

Returns:

True if dataframe 2 is a subset of dataframe 1.

Return type:

bool

class datacompy.SparkSQLCompare(spark_session: SparkSession, df1: DataFrame, df2: DataFrame, join_columns: List[str] | str, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, custom_comparators: List[BaseComparator] | None = None, cache_intermediates: bool = True)

Bases: BaseCompare

Comparison class to be used to compare whether two Spark SQL dataframes are equal.

Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • spark_session (pyspark.sql.SparkSession) – A SparkSession to be used to execute Spark commands in the comparison.

  • df1 (pyspark.sql.DataFrame) – First dataframe to check

  • df2 (pyspark.sql.DataFrame) – Second dataframe to check

  • join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.

  • abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.

  • df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

  • df2_name (str, optional) – A string name for the second dataframe

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case

  • custom_comparators (list of BaseComparator, optional) – A list of custom comparator classes to use to compare columns.

  • cache_intermediates (bool, optional) – Flag to enable/disable caching of intermediate DataFrames. Set to False when using Databricks Serverless or other environments that don’t support caching. Defaults to True.

all_columns_match() bool

Whether the columns all match in the dataframes.

Returns:

True if all columns in df1 are in df2 and vice versa

Return type:

bool

all_mismatch(ignore_matching_cols: bool = False) DataFrame

Get all rows with any columns that have a mismatch.

Returns all df1 and df2 versions of the columns and join columns.

Parameters:

ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.

Returns:

All rows of the intersection dataframe, containing any columns, that don’t match.

Return type:

pyspark.sql.DataFrame

all_rows_overlap() bool

Whether the rows are all present in both dataframes.

Returns:

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

Return type:

bool

count_matching_rows() int

Count the number of rows match (on overlapping fields).

Returns:

Number of matching rows

Return type:

int

property df1: DataFrame

Get the first dataframe.

df1_unq_columns() OrderedSet[str]

Get columns that are unique to df1.

property df2: DataFrame

Get the second dataframe.

df2_unq_columns() OrderedSet[str]

Get columns that are unique to df2.

hide_sensitive_columns(sensitive_columns: List[str]) None

Hides sensitive columns of df1 or df2 if applicable in the compare.

intersect_columns() OrderedSet[str]

Get columns that are shared between the two dataframes.

intersect_rows_match() bool

Check whether the intersect rows all match.

matches(ignore_extra_columns: bool = False) bool

Return True or False if the dataframes match.

Parameters:

ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.

report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str

Return a string representation of a report.

The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

  • template_path (str, optional) –

    Path to a custom Jinja2 template file to use for report generation. If None, the default template will be used. The template receives the following context variables:

    • column_summary: Dict with column statistics including: common_columns, df1_unique, df2_unique, df1_name, df2_name

    • row_summary: Dict with row statistics including: match_columns, equal_rows, unequal_rows

    • column_comparison: Dict with column comparison statistics including: unequal_columns, equal_columns, unequal_values

    • mismatch_stats: Dict containing:
      • stats: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)

      • samples: Sample rows with mismatched values

      • has_samples: Boolean indicating if there are any mismatch samples

      • has_mismatches: Boolean indicating if there are any mismatches

    • df1_unique_rows: Dict with unique rows in df1 including: has_rows, rows, columns

    • df2_unique_rows: Dict with unique rows in df2 including: has_rows, rows, columns

Returns:

The report, formatted according to the template.

Return type:

str

sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame

Return sample mismatches.

Gets a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.

Parameters:
  • column (str) – The raw column name (i.e. without _df1 appended)

  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)

Returns:

A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.

Return type:

pyspark.sql.DataFrame

subset() bool

Return True if dataframe 2 is a subset of dataframe 1.

Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.

Returns:

True if dataframe 2 is a subset of dataframe 1.

Return type:

bool