datacompy package#

Submodules#

datacompy.base module#

Compare two Pandas DataFrames

Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.

class datacompy.base.BaseCompare#

Bases: ABC

abstract all_columns_match() bool#
abstract all_mismatch(ignore_matching_cols: bool = False) Any#
abstract all_rows_overlap() bool#
abstract count_matching_rows() int#
abstract property df1: Any#
abstract df1_unq_columns() OrderedSet[str]#

Get columns that are unique to df1

abstract property df2: Any#
abstract df2_unq_columns() OrderedSet[str]#

Get columns that are unique to df2

abstract intersect_columns() OrderedSet[str]#

Get columns that are shared between the two dataframes

abstract intersect_rows_match() bool#
abstract matches(ignore_extra_columns: bool = False) bool#
abstract report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None) str#
abstract sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) Any#
abstract subset() bool#

datacompy.core module#

Compare two Pandas DataFrames

Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.

class datacompy.core.Compare(df1: DataFrame, df2: DataFrame, join_columns: List[str] | str | None = None, on_index: bool = False, abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True)#

Bases: BaseCompare

Comparison class to be used to compare whether two dataframes as equal.

Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • df1 (pandas DataFrame) – First dataframe to check

  • df2 (pandas DataFrame) – Second dataframe to check

  • join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.

  • on_index (bool, optional) – If True, the index will be used to join the two dataframes. If both join_columns and on_index are provided, an exception will be raised.

  • abs_tol (float, optional) – Absolute tolerance between two values.

  • rel_tol (float, optional) – Relative tolerance between two values.

  • df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

  • df2_name (str, optional) – A string name for the second dataframe

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case

Variables:
  • df1_unq_rows (pandas DataFrame) – All records that are only in df1 (based on a join on join_columns)

  • df2_unq_rows (pandas DataFrame) – All records that are only in df2 (based on a join on join_columns)

all_columns_match() bool#

Whether the columns all match in the dataframes

all_mismatch(ignore_matching_cols: bool = False) DataFrame#

All rows with any columns that have a mismatch. Returns all df1 and df2 versions of the columns and join columns.

Parameters:

ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.

Returns:

All rows of the intersection dataframe, containing any columns, that don’t match.

Return type:

Pandas.DataFrame

all_rows_overlap() bool#

Whether the rows are all present in both dataframes

Returns:

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

Return type:

bool

count_matching_rows() int#

Count the number of rows match (on overlapping fields)

Returns:

Number of matching rows

Return type:

int

property df1: DataFrame#
df1_unq_columns() OrderedSet[str]#

Get columns that are unique to df1

property df2: DataFrame#
df2_unq_columns() OrderedSet[str]#

Get columns that are unique to df2

intersect_columns() OrderedSet[str]#

Get columns that are shared between the two dataframes

intersect_rows_match() bool#

Check whether the intersect rows all match

matches(ignore_extra_columns: bool = False) bool#

Return True or False if the dataframes match.

Parameters:

ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.

Returns:

True or False if the dataframes match.

Return type:

bool

report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None) str#

Returns a string representation of a report. The representation can then be printed or saved to a file.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

Returns:

The report, formatted kinda nicely.

Return type:

str

sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame#

Returns a sample sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.

Parameters:
  • column (str) – The raw column name (i.e. without _df1 appended)

  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)

Returns:

A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.

Return type:

Pandas.DataFrame

subset() bool#

Return True if dataframe 2 is a subset of dataframe 1.

Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.

Returns:

True if dataframe 2 is a subset of dataframe 1.

Return type:

bool

datacompy.core.calculate_max_diff(col_1: pd.Series[Any], col_2: pd.Series[Any]) float#

Get a maximum difference between two columns

Parameters:
  • col_1 (Pandas.Series) – The first column

  • col_2 (Pandas.Series) – The second column

Returns:

Numeric field, or zero.

Return type:

Numeric

datacompy.core.columns_equal(col_1: pd.Series[Any], col_2: pd.Series[Any], rel_tol: float = 0, abs_tol: float = 0, ignore_spaces: bool = False, ignore_case: bool = False) pd.Series[bool]#

Compares two columns from a dataframe, returning a True/False series, with the same index as column 1.

  • Two nulls (np.nan) will evaluate to True.

  • A null and a non-null value will evaluate to False.

  • Numeric values will use the relative and absolute tolerances.

  • Decimal values (decimal.Decimal) will attempt to be converted to floats before comparing

  • Non-numeric values (i.e. where np.isclose can’t be used) will just trigger True on two nulls or exact matches.

Parameters:
  • col_1 (Pandas.Series) – The first column to look at

  • col_2 (Pandas.Series) – The second column

  • rel_tol (float, optional) – Relative tolerance

  • abs_tol (float, optional) – Absolute tolerance

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

Returns:

A series of Boolean values. True == the values match, False == the values don’t match.

Return type:

pandas.Series

datacompy.core.compare_string_and_date_columns(col_1: pd.Series[Any], col_2: pd.Series[Any]) pd.Series[bool]#

Compare a string column and date column, value-wise. This tries to convert a string column to a date column and compare that way.

Parameters:
  • col_1 (Pandas.Series) – The first column to look at

  • col_2 (Pandas.Series) – The second column

Returns:

A series of Boolean values. True == the values match, False == the values don’t match.

Return type:

pandas.Series

datacompy.core.generate_id_within_group(dataframe: DataFrame, join_columns: List[str]) pd.Series[int]#

Generate an ID column that can be used to deduplicate identical rows. The series generated is the order within a unique group, and it handles nulls.

Parameters:
  • dataframe (Pandas.DataFrame) – The dataframe to operate on

  • join_columns (list) – List of strings which are the join columns

Returns:

The ID column that’s unique in each group.

Return type:

Pandas.Series

datacompy.core.get_merged_columns(original_df: DataFrame, merged_df: DataFrame, suffix: str) List[str]#

Gets the columns from an original dataframe, in the new merged dataframe

Parameters:
  • original_df (Pandas.DataFrame) – The original, pre-merge dataframe

  • merged_df (Pandas.DataFrame) – Post-merge with another dataframe, with suffixes added in.

  • suffix (str) – What suffix was used to distinguish when the original dataframe was overlapping with the other merged dataframe.

datacompy.core.render(filename: str, *fields: int | float | str) str#

Renders out an individual template. This basically just reads in a template file, and applies .format() on the fields.

Parameters:
  • filename (str) – The file that contains the template. Will automagically prepend the templates directory before opening

  • fields (list) – Fields to be rendered out in the template

Returns:

The fully rendered out file.

Return type:

str

datacompy.core.temp_column_name(*dataframes: DataFrame) str#

Gets a temp column name that isn’t included in columns of any dataframes

Parameters:

dataframes (list of Pandas.DataFrame) – The DataFrames to create a temporary column name for

Returns:

String column name that looks like ‘_temp_x’ for some integer x

Return type:

str

datacompy.fugue module#

Compare two DataFrames that are supported by Fugue

datacompy.fugue.all_columns_match(df1: AnyDataFrame, df2: AnyDataFrame) bool#

Whether the columns all match in the dataframes

Parameters:
  • df1 (AnyDataFrame) – First dataframe to check

  • df2 (AnyDataFrame) – Second dataframe to check

Returns:

Boolean indicating whether the columns all match in the dataframes

Return type:

bool

datacompy.fugue.all_rows_overlap(df1: AnyDataFrame, df2: AnyDataFrame, join_columns: str | List[str], abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, parallelism: int | None = None, strict_schema: bool = False) bool#

Check if the rows are all present in both dataframes

Parameters:
  • df1 (AnyDataFrame) – First dataframe to check

  • df2 (AnyDataFrame) – Second dataframe to check

  • join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.

  • abs_tol (float, optional) – Absolute tolerance between two values.

  • rel_tol (float, optional) – Relative tolerance between two values.

  • df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

  • df2_name (str, optional) – A string name for the second dataframe

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case

  • parallelism (int, optional) – An integer representing the amount of parallelism. Entering a value for this will force to use of Fugue over just vanilla Pandas

  • strict_schema (bool, optional) – The schema must match exactly if set to True. This includes the names and types. Allows for a fast fail.

Returns:

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

Return type:

bool

datacompy.fugue.intersect_columns(df1: AnyDataFrame, df2: AnyDataFrame) OrderedSet[str]#

Get columns that are shared between the two dataframes

Parameters:
  • df1 (AnyDataFrame) – First dataframe to check

  • df2 (AnyDataFrame) – Second dataframe to check

Returns:

Set of that are shared between the two dataframes

Return type:

OrderedSet

datacompy.fugue.is_match(df1: AnyDataFrame, df2: AnyDataFrame, join_columns: str | List[str], abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, parallelism: int | None = None, strict_schema: bool = False) bool#

Check whether two dataframes match.

Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • df1 (AnyDataFrame) – First dataframe to check

  • df2 (AnyDataFrame) – Second dataframe to check

  • join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.

  • abs_tol (float, optional) – Absolute tolerance between two values.

  • rel_tol (float, optional) – Relative tolerance between two values.

  • df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

  • df2_name (str, optional) – A string name for the second dataframe

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case

  • parallelism (int, optional) – An integer representing the amount of parallelism. Entering a value for this will force to use of Fugue over just vanilla Pandas

  • strict_schema (bool, optional) – The schema must match exactly if set to True. This includes the names and types. Allows for a fast fail.

Returns:

Returns boolean as to if the DataFrames match.

Return type:

bool

datacompy.fugue.report(df1: AnyDataFrame, df2: AnyDataFrame, join_columns: str | List[str], abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, sample_count: int = 10, column_count: int = 10, html_file: str | None = None, parallelism: int | None = None) str#

Returns a string representation of a report. The representation can then be printed or saved to a file.

Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • df1 (AnyDataFrame) – First dataframe to check

  • df2 (AnyDataFrame) – Second dataframe to check

  • join_columns (list or str) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.

  • abs_tol (float, optional) – Absolute tolerance between two values.

  • rel_tol (float, optional) – Relative tolerance between two values.

  • df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

  • df2_name (str, optional) – A string name for the second dataframe

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case

  • parallelism (int, optional) – An integer representing the amount of parallelism. Entering a value for this will force to use of Fugue over just vanilla Pandas

  • strict_schema (bool, optional) – The schema must match exactly if set to True. This includes the names and types. Allows for a fast fail.

  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

Returns:

The report, formatted kinda nicely.

Return type:

str

datacompy.fugue.unq_columns(df1: AnyDataFrame, df2: AnyDataFrame) OrderedSet[str]#

Get columns that are unique to df1

Parameters:
  • df1 (AnyDataFrame) – First dataframe to check

  • df2 (AnyDataFrame) – Second dataframe to check

Returns:

Set of columns that are unique to df1

Return type:

OrderedSet

datacompy.polars module#

Compare two Polars DataFrames

Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.

class datacompy.polars.PolarsCompare(df1: DataFrame, df2: DataFrame, join_columns: List[str] | str, abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True)#

Bases: BaseCompare

Comparison class to be used to compare whether two dataframes as equal.

Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).

Parameters:
  • df1 (Polars DataFrame) – First dataframe to check

  • df2 (Polars DataFrame) – Second dataframe to check

  • join_columns (list or str) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.

  • abs_tol (float, optional) – Absolute tolerance between two values.

  • rel_tol (float, optional) – Relative tolerance between two values.

  • df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.

  • df2_name (str, optional) – A string name for the second dataframe

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

  • cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case

Variables:
  • df1_unq_rows (Polars DataFrame) – All records that are only in df1 (based on a join on join_columns)

  • df2_unq_rows (Polars DataFrame) – All records that are only in df2 (based on a join on join_columns)

all_columns_match() bool#

Whether the columns all match in the dataframes

all_mismatch(ignore_matching_cols: bool = False) DataFrame#

All rows with any columns that have a mismatch. Returns all df1 and df2 versions of the columns and join columns.

Parameters:

ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.

Returns:

All rows of the intersection dataframe, containing any columns, that don’t match.

Return type:

Polars.DataFrame

all_rows_overlap() bool#

Whether the rows are all present in both dataframes

Returns:

True if all rows in df1 are in df2 and vice versa (based on existence for join option)

Return type:

bool

count_matching_rows() int#

Count the number of rows match (on overlapping fields)

Returns:

Number of matching rows

Return type:

int

property df1: DataFrame#
df1_unq_columns() OrderedSet[str]#

Get columns that are unique to df1

property df2: DataFrame#
df2_unq_columns() OrderedSet[str]#

Get columns that are unique to df2

intersect_columns() OrderedSet[str]#

Get columns that are shared between the two dataframes

intersect_rows_match() bool#

Check whether the intersect rows all match

matches(ignore_extra_columns: bool = False) bool#

Return True or False if the dataframes match.

Parameters:

ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.

Returns:

True or False if the dataframes match.

Return type:

bool

report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None) str#

Returns a string representation of a report. The representation can then be printed or saved to a file.

Parameters:
  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.

  • html_file (str, optional) – HTML file name to save report output to. If None the file creation will be skipped.

Returns:

The report, formatted kinda nicely.

Return type:

str

sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame#

Returns a sample sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.

Parameters:
  • column (str) – The raw column name (i.e. without _df1 appended)

  • sample_count (int, optional) – The number of sample records to return. Defaults to 10.

  • for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)

Returns:

A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.

Return type:

Polars.DataFrame

subset() bool#

Return True if dataframe 2 is a subset of dataframe 1.

Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.

Returns:

True if dataframe 2 is a subset of dataframe 1.

Return type:

bool

datacompy.polars.calculate_max_diff(col_1: Series, col_2: Series) float#

Get a maximum difference between two columns

Parameters:
  • col_1 (Polars.Series) – The first column

  • col_2 (Polars.Series) – The second column

Returns:

Numeric field, or zero.

Return type:

Numeric

datacompy.polars.columns_equal(col_1: Series, col_2: Series, rel_tol: float = 0, abs_tol: float = 0, ignore_spaces: bool = False, ignore_case: bool = False) Series#

Compares two columns from a dataframe, returning a True/False series, with the same index as column 1.

  • Two nulls (np.nan) will evaluate to True.

  • A null and a non-null value will evaluate to False.

  • Numeric values will use the relative and absolute tolerances.

  • Decimal values (decimal.Decimal) will attempt to be converted to floats before comparing

  • Non-numeric values (i.e. where np.isclose can’t be used) will just trigger True on two nulls or exact matches.

Parameters:
  • col_1 (Polars.Series) – The first column to look at

  • col_2 (Polars.Series) – The second column

  • rel_tol (float, optional) – Relative tolerance

  • abs_tol (float, optional) – Absolute tolerance

  • ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns

  • ignore_case (bool, optional) – Flag to ignore the case of string columns

Returns:

A series of Boolean values. True == the values match, False == the values don’t match.

Return type:

Polars.Series

datacompy.polars.compare_string_and_date_columns(col_1: Series, col_2: Series) Series#

Compare a string column and date column, value-wise. This tries to convert a string column to a date column and compare that way.

Parameters:
  • col_1 (Polars.Series) – The first column to look at

  • col_2 (Polars.Series) – The second column

Returns:

A series of Boolean values. True == the values match, False == the values don’t match.

Return type:

Polars.Series

datacompy.polars.generate_id_within_group(dataframe: DataFrame, join_columns: List[str]) Series#

Generate an ID column that can be used to deduplicate identical rows. The series generated is the order within a unique group, and it handles nulls.

Parameters:
  • dataframe (Polars.DataFrame) – The dataframe to operate on

  • join_columns (list) – List of strings which are the join columns

Returns:

The ID column that’s unique in each group.

Return type:

Polars.Series

datacompy.polars.get_merged_columns(original_df: DataFrame, merged_df: DataFrame, suffix: str) List[str]#

Gets the columns from an original dataframe, in the new merged dataframe

Parameters:
  • original_df (Polars.DataFrame) – The original, pre-merge dataframe

  • merged_df (Polars.DataFrame) – Post-merge with another dataframe, with suffixes added in.

  • suffix (str) – What suffix was used to distinguish when the original dataframe was overlapping with the other merged dataframe.

datacompy.polars.render(filename: str, *fields: int | float | str) str#

Renders out an individual template. This basically just reads in a template file, and applies .format() on the fields.

Parameters:
  • filename (str) – The file that contains the template. Will automagically prepend the templates directory before opening

  • fields (list) – Fields to be rendered out in the template

Returns:

The fully rendered out file.

Return type:

str

datacompy.polars.temp_column_name(*dataframes: DataFrame) str#

Gets a temp column name that isn’t included in columns of any dataframes

Parameters:

dataframes (list of Polars.DataFrame) – The DataFrames to create a temporary column name for

Returns:

String column name that looks like ‘_temp_x’ for some integer x

Return type:

str

datacompy.spark module#

class datacompy.spark.MatchType(value)#

Bases: Enum

An enumeration.

KNOWN_DIFFERENCE = 2#
MATCH = 1#
MISMATCH = 0#
class datacompy.spark.SparkCompare(spark_session: SparkSession, base_df: DataFrame, compare_df: DataFrame, join_columns: List[str | Tuple[str, str]], column_mapping: List[Tuple[str, str]] | None = None, cache_intermediates: bool = False, known_differences: List[Dict[str, Any]] | None = None, rel_tol: float = 0, abs_tol: float = 0, show_all_columns: bool = False, match_rates: bool = False)#

Bases: object

Comparison class used to compare two Spark Dataframes.

Extends the Compare functionality to the wide world of Spark and out-of-memory data.

Parameters:
  • spark_session (pyspark.sql.SparkSession) – A SparkSession to be used to execute Spark commands in the comparison.

  • base_df (pyspark.sql.DataFrame) – The dataframe to serve as a basis for comparison. While you will ultimately get the same results comparing A to B as you will comparing B to A, by convention base_df should be the canonical, gold standard reference dataframe in the comparison.

  • compare_df (pyspark.sql.DataFrame) – The dataframe to be compared against base_df.

  • join_columns (list) – A list of columns comprising the join key(s) of the two dataframes. If the column names are the same in the two dataframes, the names of the columns can be given as strings. If the names differ, the join_columns list should include tuples of the form (base_column_name, compare_column_name).

  • column_mapping (list[tuple], optional) – If columns to be compared have different names in the base and compare dataframes, a list should be provided in columns_mapping consisting of tuples of the form (base_column_name, compare_column_name) for each set of differently-named columns to be compared against each other.

  • cache_intermediates (bool, optional) – Whether or not SparkCompare will cache intermediate dataframes (such as the deduplicated version of dataframes, or the joined comparison). This will take a large amount of cache, proportional to the size of your dataframes, but will significantly speed up performance, as multiple steps will not have to recompute transformations. False by default.

  • known_differences (list[dict], optional) –

    A list of dictionaries that define transformations to apply to the compare dataframe to match values when there are known differences between base and compare. The dictionaries should contain:

    • name: A name that describes the transformation

    • types: The types that the transformation should be applied to.

      This prevents certain transformations from being applied to types that don’t make sense and would cause exceptions.

    • transformation: A Spark SQL statement to apply to the column

      in the compare dataset. The string “{input}” will be replaced by the variable in question.

  • abs_tol (float, optional) – Absolute tolerance between two values.

  • rel_tol (float, optional) – Relative tolerance between two values.

  • show_all_columns (bool, optional) – If true, all columns will be shown in the report including columns with a 100% match rate.

  • match_rates (bool, optional) – If true, match rates by column will be shown in the column summary.

Returns:

Instance of a SparkCompare object, ready to do some comparin’. Note that if cache_intermediates=True, this instance will already have done some work deduping the input dataframes. If cache_intermediates=False, the instantiation of this object is lazy.

Return type:

SparkCompare

property base_row_count: int#

Get the count of rows in the de-duped base dataframe

Type:

int

property columns_compared: List[str]#

Get columns to be compared in both dataframes (all columns in both excluding the join key(s)

Type:

list[str]

property columns_in_both: Set[str]#

Get columns in both dataframes

Type:

set[str]

property columns_only_base: Set[str]#

Get columns that are unique to the base dataframe

Type:

set[str]

property columns_only_compare: Set[str]#

Get columns that are unique to the compare dataframe

Type:

set[str]

property common_row_count: int#

Get the count of rows in common between base and compare dataframes

Type:

int

property compare_row_count: int#

Get the count of rows in the de-duped compare dataframe

Type:

int

report(file: ~typing.TextIO = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>) None#

Creates a comparison report and prints it to the file specified (stdout by default).

Parameters:

file (file, optional) – A filehandle to write the report to. By default, this is sys.stdout, printing the report to stdout. You can also redirect this to an output file, as in the example.

Examples

>>> with open('my_report.txt', 'w') as report_file:
...     comparison.report(file=report_file)
property rows_both_all: DataFrame | None#

Returns all rows in both dataframes

Type:

pyspark.sql.DataFrame

property rows_both_mismatch: DataFrame | None#

Returns all rows in both dataframes that have mismatches

Type:

pyspark.sql.DataFrame

property rows_only_base: DataFrame#

Returns rows only in the base dataframe

Type:

pyspark.sql.DataFrame

property rows_only_compare: DataFrame | None#

Returns rows only in the compare dataframe

Type:

pyspark.sql.DataFrame

datacompy.spark.decimal_comparator() str#

Module contents#