datacompy package¶
Subpackages¶
- datacompy.spark package
- Submodules
- datacompy.spark.helper module
- datacompy.spark.sql module
SparkSQLCompare
SparkSQLCompare.all_columns_match()
SparkSQLCompare.all_mismatch()
SparkSQLCompare.all_rows_overlap()
SparkSQLCompare.count_matching_rows()
SparkSQLCompare.df1
SparkSQLCompare.df1_unq_columns()
SparkSQLCompare.df2
SparkSQLCompare.df2_unq_columns()
SparkSQLCompare.intersect_columns()
SparkSQLCompare.intersect_rows_match()
SparkSQLCompare.matches()
SparkSQLCompare.report()
SparkSQLCompare.sample_mismatch()
SparkSQLCompare.subset()
calculate_max_diff()
calculate_null_diff()
columns_equal()
decimal_comparator()
get_merged_columns()
- Module contents
Submodules¶
datacompy.base module¶
Compare two Pandas DataFrames.
Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.
- class datacompy.base.BaseCompare¶
Bases:
ABC
Base comparison class.
- abstract all_columns_match() bool ¶
Check if all columns match.
- abstract all_mismatch(ignore_matching_cols: bool = False) Any ¶
Get all rows that mismatch.
- abstract all_rows_overlap() bool ¶
Check if all rows overlap.
- abstract count_matching_rows() int ¶
Count the number of matching rows.
- abstract property df1: Any¶
Get the first dataframe.
- abstract df1_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df1.
- abstract property df2: Any¶
Get the second dataframe.
- abstract df2_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df2.
- abstract intersect_columns() OrderedSet[str] ¶
Get columns that are shared between the two dataframes.
- abstract intersect_rows_match() bool ¶
Check if the intersection of rows match.
- abstract matches(ignore_extra_columns: bool = False) bool ¶
Check if the dataframes match.
- only_join_columns() bool ¶
Boolean on if the only columns are the join columns.
- abstract report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str ¶
Return a string representation of a report.
- Parameters:
sample_count (int, optional) – The number of sample records to return. Defaults to 10.
column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.
html_file (str, optional) – HTML file name to save report output to. If
None
the file creation will be skipped.template_path (str, optional) – Path to a custom Jinja2 template file to use for report generation. If
None
, the default template will be used.
- Returns:
The report, formatted according to the template.
- Return type:
str
- abstract sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) Any ¶
Get a sample of rows that mismatch.
- abstract subset() bool ¶
Check if one dataframe is a subset of the other.
- datacompy.base.df_to_str(df: Any, sample_count: int | None = None, on_index: bool = False) str ¶
Convert a DataFrame to a string representation.
This is a centralized function to handle DataFrame to string conversion for different DataFrame types (pandas, Spark, Polars, etc.)
- Parameters:
df (Any) – The DataFrame to convert to string. Can be pandas, Spark, or Polars DataFrame.
sample_count (int, optional) – For distributed DataFrames (like Spark), limit the number of rows to convert.
on_index (bool, default False) – If True, include the index in the output.
- Returns:
String representation of the DataFrame
- Return type:
str
- datacompy.base.get_column_tolerance(column: str, tol_dict: Dict[str, float]) float ¶
Return the tolerance value for a given column from a dictionary of tolerances.
- Parameters:
column (str) – The name of the column for which to retrieve the tolerance.
tol_dict (dict of str to float) – Dictionary mapping column names to their tolerance values. May contain a “default” key for columns not explicitly listed.
- Returns:
The tolerance value for the specified column, or the “default” tolerance if the column is not found. Returns 0 if neither the column nor “default” is present in the dictionary.
- Return type:
float
- datacompy.base.render(template_name: str, **context: Any) str ¶
Render a template using Jinja2.
- Parameters:
template_name (str) – The name of the template file to render. This can be: - A filename in the default templates directory (with or without .j2 extension) - A relative path from the default templates directory - An absolute path to a template file
**context (dict) – The context variables to pass to the template
- Returns:
The rendered template
- Return type:
str
- Raises:
FileNotFoundError – If the template file cannot be found in any of the expected locations
- datacompy.base.save_html_report(report: str, html_file: str | Path) None ¶
Save a text report as an HTML file.
- Parameters:
report (str) – The text report to convert to HTML
html_file (str or Path) – The path where the HTML file should be saved
- datacompy.base.temp_column_name(*dataframes) str ¶
Get a temp column name that isn’t included in columns of any dataframes.
- Parameters:
dataframes (list of DataFrames) – The DataFrames to create a temporary column name for
- Returns:
String column name that looks like ‘_temp_x’ for some integer x
- Return type:
str
datacompy.core module¶
Compare two Pandas DataFrames.
Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.
- class datacompy.core.Compare(df1: DataFrame, df2: DataFrame, join_columns: List[str] | str | None = None, on_index: bool = False, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True)¶
Bases:
BaseCompare
Comparison class to be used to compare whether two dataframes as equal.
Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).
- Parameters:
df1 (pandas
DataFrame
) – First dataframe to checkdf2 (pandas
DataFrame
) – Second dataframe to checkjoin_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
on_index (bool, optional) – If True, the index will be used to join the two dataframes. If both
join_columns
andon_index
are provided, an exception will be raised.abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns). Excludes categoricals.
ignore_case (bool, optional) – Flag to ignore the case of string columns. Excludes categoricals.
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
- Variables:
df1_unq_rows (pandas
DataFrame
) – All records that are only in df1 (based on a join on join_columns)df2_unq_rows (pandas
DataFrame
) – All records that are only in df2 (based on a join on join_columns)
- all_columns_match() bool ¶
Whether the columns all match in the dataframes.
- all_mismatch(ignore_matching_cols: bool = False) DataFrame ¶
Get all rows with any columns that have a mismatch.
Returns all df1 and df2 versions of the columns and join columns.
- Parameters:
ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.
- Returns:
All rows of the intersection dataframe, containing any columns, that don’t match.
- Return type:
Pandas.DataFrame
- all_rows_overlap() bool ¶
Whether the rows are all present in both dataframes.
- Returns:
True if all rows in df1 are in df2 and vice versa (based on existence for join option)
- Return type:
bool
- count_matching_rows() int ¶
Count the number of rows match (on overlapping fields).
- Returns:
Number of matching rows
- Return type:
int
- property df1: DataFrame¶
Get the first dataframe.
- df1_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df1.
- property df2: DataFrame¶
Get the second dataframe.
- df2_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df2.
- intersect_columns() OrderedSet[str] ¶
Get columns that are shared between the two dataframes.
- intersect_rows_match() bool ¶
Check whether the intersect rows all match.
- matches(ignore_extra_columns: bool = False) bool ¶
Return True or False if the dataframes match.
- Parameters:
ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.
- Returns:
True or False if the dataframes match.
- Return type:
bool
- report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str ¶
Return a string representation of a report.
The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.
- Parameters:
sample_count (int, optional) – The number of sample records to return. Defaults to 10.
column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.
html_file (str, optional) – HTML file name to save report output to. If
None
the file creation will be skipped.template_path (str, optional) –
Path to a custom Jinja2 template file to use for report generation. If
None
, the default template will be used. The template receives the following context variables:column_summary
: Dict with column statistics including:common_columns
,df1_unique
,df2_unique
,df1_name
,df2_name
row_summary
: Dict with row statistics including:match_columns
,equal_rows
,unequal_rows
column_comparison
: Dict with column comparison statistics including:unequal_columns
,equal_columns
,unequal_values
mismatch_stats
: Dict containing:stats
: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)samples
: Sample rows with mismatched valueshas_samples
: Boolean indicating if there are any mismatch sampleshas_mismatches
: Boolean indicating if there are any mismatches
df1_unique_rows
: Dict with unique rows in df1 including:has_rows
,rows
,columns
df2_unique_rows
: Dict with unique rows in df2 including:has_rows
,rows
,columns
- Returns:
The report, formatted according to the template.
- Return type:
str
Examples
Basic usage with default template:
>>> compare = datacompy.Compare(df1, df2, join_columns=['id']) >>> report = compare.report() >>> print(report)
Using a custom template file:
>>> # Create a custom template file (custom_report.j2) >>> with open('custom_report.j2', 'w') as f: ... f.write(''' ... Comparison Report ... ================ ... ... DataFrames: {{ df1_name }} vs {{ df2_name }} ... ... Shape Summary: ... - {{ df1_name }}: {{ df1_shape[0] }} rows x {{ df1_shape[1] }} columns ... - {{ df2_name }}: {{ df2_shape[0] }} rows x {{ df2_shape[1] }} columns ... ... {% if mismatch_stats %} ... Mismatched Columns ({{ mismatch_stats|length }}): ... {% for col in mismatch_stats %} ... - {{ col.column }} ({{ col.unequal_cnt }} mismatches) ... {% endfor %} ... {% else %} ... No mismatches found in any columns! ... {% endif %} ... ''') ... >>> # Generate report with custom template >>> report = compare.report(template_path='custom_report.j2') >>> print(report)
- sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame | None ¶
Return sample mismatches.
Gets a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.
- Parameters:
column (str) – The raw column name (i.e. without
_df1
appended)sample_count (int, optional) – The number of sample records to return. Defaults to 10.
for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)
- Returns:
Pandas.DataFrame – A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.
None – When the column being requested is not an intersecting column between dataframes.
- subset() bool ¶
Return True if dataframe 2 is a subset of dataframe 1.
Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.
- Returns:
True if dataframe 2 is a subset of dataframe 1.
- Return type:
bool
- datacompy.core.calculate_max_diff(col_1: pd.Series[Any], col_2: pd.Series[Any]) float ¶
Get a maximum difference between two columns.
- Parameters:
col_1 (Pandas.Series) – The first column
col_2 (Pandas.Series) – The second column
- Returns:
Numeric field, or zero.
- Return type:
Numeric
- datacompy.core.columns_equal(col_1: pd.Series[Any], col_2: pd.Series[Any], rel_tol: float = 0, abs_tol: float = 0, ignore_spaces: bool = False, ignore_case: bool = False) pd.Series[bool] ¶
Compare two columns from a dataframe.
Returns a True/False series, with the same index as column 1.
Two nulls (np.nan) will evaluate to True.
A null and a non-null value will evaluate to False.
Numeric values will use the relative and absolute tolerances.
Decimal values (decimal.Decimal) will attempt to be converted to floats before comparing
Non-numeric values (i.e. where np.isclose can’t be used) will just trigger True on two nulls or exact matches.
Notes
As of version
0.14.0
If a column is of a mixed data type the compare will default to returningFalse
.Notes
list
andnp.array
types will be compared row wise usingnp.array_equal
. Depending on the size of your data this might lead to performance issues.All the rows must be of the same type otherwise it is considered “mixed” and will default to being
False
for everything.
- Parameters:
col_1 (Pandas.Series) – The first column to look at
col_2 (Pandas.Series) – The second column
rel_tol (float, optional) – Relative tolerance
abs_tol (float, optional) – Absolute tolerance
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns
ignore_case (bool, optional) – Flag to ignore the case of string columns
- Returns:
A series of Boolean values. True == the values match, False == the values don’t match.
- Return type:
pandas.Series
- datacompy.core.compare_string_and_date_columns(col_1: pd.Series[Any], col_2: pd.Series[Any]) pd.Series[bool] ¶
Compare a string column and date column, value-wise.
This tries to convert a string column to a date column and compare that way.
- Parameters:
col_1 (Pandas.Series) – The first column to look at
col_2 (Pandas.Series) – The second column
- Returns:
A series of Boolean values. True == the values match, False == the values don’t match.
- Return type:
pandas.Series
- datacompy.core.generate_id_within_group(dataframe: DataFrame, join_columns: List[str]) pd.Series[int] ¶
Generate an ID column that can be used to deduplicate identical rows.
The series generated is the order within a unique group, and it handles nulls.
- Parameters:
dataframe (Pandas.DataFrame) – The dataframe to operate on
join_columns (list) – List of strings which are the join columns
- Returns:
The ID column that’s unique in each group.
- Return type:
Pandas.Series
- datacompy.core.get_merged_columns(original_df: DataFrame, merged_df: DataFrame, suffix: str) List[str] ¶
Get the columns from an original dataframe, in the new merged dataframe.
- Parameters:
original_df (Pandas.DataFrame) – The original, pre-merge dataframe
merged_df (Pandas.DataFrame) – Post-merge with another dataframe, with suffixes added in.
suffix (str) – What suffix was used to distinguish when the original dataframe was overlapping with the other merged dataframe.
- datacompy.core.normalize_string_column(column: Series, ignore_spaces: bool, ignore_case: bool) Series ¶
Normalize a string column by converting to upper case and stripping whitespace.
- Parameters:
column (pd.Series) – The column to normalize
ignore_spaces (bool) – Whether to ignore spaces when normalizing
ignore_case (bool) – Whether to ignore case when normalizing
- Returns:
The normalized column
- Return type:
pd.Series
Notes
Will not operate on categorical columns.
datacompy.fugue module¶
Compare two DataFrames that are supported by Fugue.
- datacompy.fugue.all_columns_match(df1: AnyDataFrame, df2: AnyDataFrame) bool ¶
Whether the columns all match in the dataframes.
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to check
- Returns:
Boolean indicating whether the columns all match in the dataframes
- Return type:
bool
- datacompy.fugue.all_rows_overlap(df1: AnyDataFrame, df2: AnyDataFrame, join_columns: str | List[str], abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, parallelism: int | None = None, strict_schema: bool = False) bool ¶
Check if the rows are all present in both dataframes.
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to checkjoin_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float, optional) – Absolute tolerance between two values.
rel_tol (float, optional) – Relative tolerance between two values.
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)
ignore_case (bool, optional) – Flag to ignore the case of string columns
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
parallelism (int, optional) – An integer representing the amount of parallelism. Entering a value for this will force to use of Fugue over just vanilla Pandas
strict_schema (bool, optional) – The schema must match exactly if set to
True
. This includes the names and types. Allows for a fast fail.
- Returns:
True if all rows in df1 are in df2 and vice versa (based on existence for join option)
- Return type:
bool
- datacompy.fugue.count_matching_rows(df1: AnyDataFrame, df2: AnyDataFrame, join_columns: str | List[str], abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, parallelism: int | None = None, strict_schema: bool = False) int ¶
Count the number of rows match (on overlapping fields).
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to checkjoin_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float, optional) – Absolute tolerance between two values.
rel_tol (float, optional) – Relative tolerance between two values.
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)
ignore_case (bool, optional) – Flag to ignore the case of string columns
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
parallelism (int, optional) – An integer representing the amount of parallelism. Entering a value for this will force to use of Fugue over just vanilla Pandas
strict_schema (bool, optional) – The schema must match exactly if set to
True
. This includes the names and types. Allows for a fast fail.
- Returns:
Number of matching rows
- Return type:
int
- datacompy.fugue.intersect_columns(df1: AnyDataFrame, df2: AnyDataFrame) OrderedSet[str] ¶
Get columns that are shared between the two dataframes.
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to check
- Returns:
Set of that are shared between the two dataframes
- Return type:
OrderedSet
- datacompy.fugue.is_match(df1: AnyDataFrame, df2: AnyDataFrame, join_columns: str | List[str], abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, parallelism: int | None = None, strict_schema: bool = False) bool ¶
Check whether two dataframes match.
Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to checkjoin_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float, optional) – Absolute tolerance between two values.
rel_tol (float, optional) – Relative tolerance between two values.
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)
ignore_case (bool, optional) – Flag to ignore the case of string columns
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
parallelism (int, optional) – An integer representing the amount of parallelism. Entering a value for this will force to use of Fugue over just vanilla Pandas
strict_schema (bool, optional) – The schema must match exactly if set to
True
. This includes the names and types. Allows for a fast fail.
- Returns:
Returns boolean as to if the DataFrames match.
- Return type:
bool
- datacompy.fugue.report(df1: AnyDataFrame, df2: AnyDataFrame, join_columns: str | List[str], abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, sample_count: int = 10, column_count: int = 10, html_file: str | None = None, parallelism: int | None = None) str ¶
Return a string representation of a report.
The representation can then be printed or saved to a file.
Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to checkjoin_columns (list or str) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float, optional) – Absolute tolerance between two values.
rel_tol (float, optional) – Relative tolerance between two values.
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)
ignore_case (bool, optional) – Flag to ignore the case of string columns
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
parallelism (int, optional) – An integer representing the amount of parallelism. Entering a value for this will force to use of Fugue over just vanilla Pandas
strict_schema (bool, optional) – The schema must match exactly if set to
True
. This includes the names and types. Allows for a fast fail.sample_count (int, optional) – The number of sample records to return. Defaults to 10.
column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.
html_file (str, optional) – HTML file name to save report output to. If
None
the file creation will be skipped.
- Returns:
The report, formatted kinda nicely.
- Return type:
str
- datacompy.fugue.unq_columns(df1: AnyDataFrame, df2: AnyDataFrame) OrderedSet[str] ¶
Get columns that are unique to df1.
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to check
- Returns:
Set of columns that are unique to df1
- Return type:
OrderedSet
datacompy.polars module¶
Compare two Polars DataFrames.
Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.
- class datacompy.polars.PolarsCompare(df1: DataFrame, df2: DataFrame, join_columns: List[str] | str, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True)¶
Bases:
BaseCompare
Comparison class to be used to compare whether two dataframes as equal.
Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).
- Parameters:
df1 (Polars
DataFrame
) – First dataframe to checkdf2 (Polars
DataFrame
) – Second dataframe to checkjoin_columns (list or str) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default”
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns). Excludes categoricals.
ignore_case (bool, optional) – Flag to ignore the case of string columns. Excludes categoricals.
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
- Variables:
df1_unq_rows (Polars
DataFrame
) – All records that are only in df1 (based on a join on join_columns)df2_unq_rows (Polars
DataFrame
) – All records that are only in df2 (based on a join on join_columns)
- all_columns_match() bool ¶
Whether the columns all match in the dataframes.
- all_mismatch(ignore_matching_cols: bool = False) DataFrame ¶
Get all rows with any columns that have a mismatch.
Returns all df1 and df2 versions of the columns and join columns.
- Parameters:
ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.
- Returns:
All rows of the intersection dataframe, containing any columns, that don’t match.
- Return type:
Polars.DataFrame
- all_rows_overlap() bool ¶
Whether the rows are all present in both dataframes.
- Returns:
True if all rows in df1 are in df2 and vice versa (based on existence for join option)
- Return type:
bool
- count_matching_rows() int ¶
Count the number of rows match (on overlapping fields).
- Returns:
Number of matching rows
- Return type:
int
- property df1: DataFrame¶
Get the first dataframe.
- df1_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df1.
- property df2: DataFrame¶
Get the second dataframe.
- df2_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df2.
- intersect_columns() OrderedSet[str] ¶
Get columns that are shared between the two dataframes.
- intersect_rows_match() bool ¶
Check whether the intersect rows all match.
- matches(ignore_extra_columns: bool = False) bool ¶
Return True or False if the dataframes match.
- Parameters:
ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.
- Returns:
True or False if the dataframes match.
- Return type:
bool
- report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str ¶
Return a string representation of a report.
The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.
- Parameters:
sample_count (int, optional) – The number of sample records to return. Defaults to 10.
column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.
html_file (str, optional) – HTML file name to save report output to. If
None
the file creation will be skipped.template_path (str, optional) –
Path to a custom Jinja2 template file to use for report generation. If
None
, the default template will be used. The template receives the following context variables:column_summary
: Dict with column statistics including:common_columns
,df1_unique
,df2_unique
,df1_name
,df2_name
row_summary
: Dict with row statistics including:match_columns
,equal_rows
,unequal_rows
column_comparison
: Dict with column comparison statistics including:unequal_columns
,equal_columns
,unequal_values
mismatch_stats
: Dict containing:stats
: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)samples
: Sample rows with mismatched valueshas_samples
: Boolean indicating if there are any mismatch sampleshas_mismatches
: Boolean indicating if there are any mismatches
df1_unique_rows
: Dict with unique rows in df1 including:has_rows
,rows
,columns
df2_unique_rows
: Dict with unique rows in df2 including:has_rows
,rows
,columns
- Returns:
The report, formatted according to the template.
- Return type:
str
- sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame | None ¶
Return sample mismatches.
Get a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.
- Parameters:
column (str) – The raw column name (i.e. without
_df1
appended)sample_count (int, optional) – The number of sample records to return. Defaults to 10.
for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)
- Returns:
Polars.DataFrame – A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.
None – When the column being requested is not an intersecting column between dataframes.
- subset() bool ¶
Return True if dataframe 2 is a subset of dataframe 1.
Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.
- Returns:
True if dataframe 2 is a subset of dataframe 1.
- Return type:
bool
- datacompy.polars.calculate_max_diff(col_1: Series, col_2: Series) float ¶
Get a maximum difference between two columns.
- Parameters:
col_1 (Polars.Series) – The first column
col_2 (Polars.Series) – The second column
- Returns:
Numeric field, or zero.
- Return type:
Numeric
- datacompy.polars.columns_equal(col_1: Series, col_2: Series, rel_tol: float = 0, abs_tol: float = 0, ignore_spaces: bool = False, ignore_case: bool = False) Series ¶
Compare two columns from a dataframe.
Returns a True/False series, with the same index as column 1.
Two nulls (np.nan) will evaluate to True.
A null and a non-null value will evaluate to False.
Numeric values will use the relative and absolute tolerances.
Decimal values (decimal.Decimal) will attempt to be converted to floats before comparing
Non-numeric values (i.e. where np.isclose can’t be used) will just trigger True on two nulls or exact matches.
- Parameters:
col_1 (Polars.Series) – The first column to look at
col_2 (Polars.Series) – The second column
rel_tol (float, optional) – Relative tolerance
abs_tol (float, optional) – Absolute tolerance
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns
ignore_case (bool, optional) – Flag to ignore the case of string columns
- Returns:
A series of Boolean values. True == the values match, False == the values don’t match.
- Return type:
Polars.Series
- datacompy.polars.compare_string_and_date_columns(col_1: Series, col_2: Series) Series ¶
Compare a string column and date column, value-wise.
This tries to convert a string column to a date column and compare that way.
- Parameters:
col_1 (Polars.Series) – The first column to look at
col_2 (Polars.Series) – The second column
- Returns:
A series of Boolean values. True == the values match, False == the values don’t match.
- Return type:
Polars.Series
- datacompy.polars.generate_id_within_group(dataframe: DataFrame, join_columns: List[str]) Series ¶
Generate an ID column that can be used to deduplicate identical rows.
The series generated is the order within a unique group, and it handles nulls.
- Parameters:
dataframe (Polars.DataFrame) – The dataframe to operate on
join_columns (list) – List of strings which are the join columns
- Returns:
The ID column that’s unique in each group.
- Return type:
Polars.Series
- datacompy.polars.get_merged_columns(original_df: DataFrame, merged_df: DataFrame, suffix: str) List[str] ¶
Get the columns from an original dataframe, in the new merged dataframe.
- Parameters:
original_df (Polars.DataFrame) – The original, pre-merge dataframe
merged_df (Polars.DataFrame) – Post-merge with another dataframe, with suffixes added in.
suffix (str) – What suffix was used to distinguish when the original dataframe was overlapping with the other merged dataframe.
- datacompy.polars.normalize_string_column(column: Series, ignore_spaces: bool, ignore_case: bool) Series ¶
Normalize a string column by converting to upper case and stripping whitespace.
- Parameters:
column (pl.Series) – The column to normalize
ignore_spaces (bool) – Whether to ignore spaces when normalizing
ignore_case (bool) – Whether to ignore case when normalizing
- Returns:
The normalized column
- Return type:
pl.Series
Notes
Will not operate on categorical columns.
datacompy.snowflake module¶
Compare two Snowpark SQL DataFrames and Snowflake tables.
Originally this package was meant to provide similar functionality to PROC COMPARE in SAS - i.e. human-readable reporting on the difference between two dataframes.
- class datacompy.snowflake.SnowflakeCompare(session: Session, df1: str | DataFrame, df2: str | DataFrame, join_columns: List[str] | str | None, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str | None = None, df2_name: str | None = None, ignore_spaces: bool = False)¶
Bases:
BaseCompare
Comparison class to be used to compare whether two Snowpark dataframes are equal.
df1 and df2 can refer to either a Snowpark dataframe or the name of a valid Snowflake table. The data structures which df1 and df2 represent must contain all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).
- Parameters:
session (snowflake.snowpark.session) – Session with the required connection session info for user and targeted tables
df1 (Union[str, sp.Dataframe]) – First table to check, provided either as the table’s name or as a Snowpark DF.
df2 (Union[str, sp.Dataframe]) – Second table to check, provided either as the table’s name or as a Snowpark DF.
join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
df1_name (str, optional) – A string name for the first dataframe. If used alongside a snowflake table, overrides the default convention of naming the dataframe after the table.
df2_name (str, optional) – A string name for the second dataframe.
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns).
- Variables:
df1_unq_rows (sp.DataFrame) – All records that are only in df1 (based on a join on join_columns)
df2_unq_rows (sp.DataFrame) – All records that are only in df2 (based on a join on join_columns)
- all_columns_match() bool ¶
Whether the columns all match in the dataframes.
- Returns:
bool
True if all columns in df1 are in df2 and vice versa
- all_mismatch(ignore_matching_cols: bool = False) DataFrame ¶
Get all rows with any columns that have a mismatch.
Returns all df1 and df2 versions of the columns and join columns.
- Parameters:
ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.
- Returns:
All rows of the intersection dataframe, containing any columns, that don’t match.
- Return type:
sp.DataFrame
- all_rows_overlap() bool ¶
Whether the rows are all present in both dataframes.
- Returns:
True if all rows in df1 are in df2 and vice versa (based on existence for join option)
- Return type:
bool
- count_matching_rows() int ¶
Count the number of rows match (on overlapping fields).
- Returns:
Number of matching rows
- Return type:
int
- property df1: DataFrame¶
Get the first dataframe.
- df1_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df1.
- property df2: DataFrame¶
Get the second dataframe.
- df2_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df2.
- intersect_columns() OrderedSet[str] ¶
Get columns that are shared between the two dataframes.
- intersect_rows_match() bool ¶
Check whether the intersect rows all match.
- matches(ignore_extra_columns: bool = False) bool ¶
Return True or False if the dataframes match.
- Parameters:
ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.
- Returns:
True or False if the dataframes match.
- Return type:
bool
- report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str ¶
Return a string representation of a report.
The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.
- Parameters:
sample_count (int, optional) – The number of sample records to return. Defaults to 10.
column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.
html_file (str, optional) – HTML file name to save report output to. If
None
the file creation will be skipped.template_path (str, optional) –
Path to a custom Jinja2 template file to use for report generation. If
None
, the default template will be used. The template receives the following context variables:column_summary
: Dict with column statistics including:common_columns
,df1_unique
,df2_unique
,df1_name
,df2_name
row_summary
: Dict with row statistics including:match_columns
,equal_rows
,unequal_rows
column_comparison
: Dict with column comparison statistics including:unequal_columns
,equal_columns
,unequal_values
mismatch_stats
: Dict containing:stats
: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)samples
: Sample rows with mismatched valueshas_samples
: Boolean indicating if there are any mismatch sampleshas_mismatches
: Boolean indicating if there are any mismatches
df1_unique_rows
: Dict with unique rows in df1 including:has_rows
,rows
,columns
df2_unique_rows
: Dict with unique rows in df2 including:has_rows
,rows
,columns
- Returns:
The report, formatted according to the template.
- Return type:
str
- sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame ¶
Return sample mismatches.
Gets a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.
- Parameters:
column (str) – The raw column name (i.e. without
_df1
appended)sample_count (int, optional) – The number of sample records to return. Defaults to 10.
for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)
- Returns:
A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.
- Return type:
sp.DataFrame
- subset() bool ¶
Return True if dataframe 2 is a subset of dataframe 1.
Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.
- Returns:
True if dataframe 2 is a subset of dataframe 1.
- Return type:
bool
- datacompy.snowflake.calculate_max_diff(dataframe: DataFrame, col_1: str, col_2: str) float ¶
Get a maximum difference between two columns.
- Parameters:
dataframe (sp.DataFrame) – DataFrame to do comparison on
col_1 (str) – The first column to look at
col_2 (str) – The second column
- Returns:
max diff
- Return type:
float
- datacompy.snowflake.calculate_null_diff(dataframe: DataFrame, col_1: str, col_2: str) int ¶
Get the null differences between two columns.
- Parameters:
dataframe (sp.DataFrame) – DataFrame to do comparison on
col_1 (str) – The first column to look at
col_2 (str) – The second column
- Returns:
null diff
- Return type:
int
- datacompy.snowflake.columns_equal(dataframe: DataFrame, col_1: str, col_2: str, col_match: str, rel_tol: float = 0, abs_tol: float = 0, ignore_spaces: bool = False) DataFrame ¶
Compare two columns from a dataframe.
Returns a True/False series with the same index as column 1.
Two nulls (np.nan) will evaluate to True.
A null and a non-null value will evaluate to False.
Numeric values will use the relative and absolute tolerances.
Decimal values (decimal.Decimal) will attempt to be converted to floats before comparing
Non-numeric values (i.e. where np.isclose can’t be used) will just trigger True on two nulls or exact matches.
- Parameters:
dataframe (sp.DataFrame) – DataFrame to do comparison on
col_1 (str) – The first column to look at
col_2 (str) – The second column
col_match (str) – The matching column denoting if the compare was a match or not
rel_tol (float, optional) – Relative tolerance
abs_tol (float, optional) – Absolute tolerance
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns
- Returns:
A column of boolean values are added. True == the values match, False == the values don’t match.
- Return type:
sp.DataFrame
- datacompy.snowflake.get_merged_columns(original_df: DataFrame, merged_df: DataFrame, suffix: str) List[str] ¶
Get the columns from an original dataframe, in the new merged dataframe.
- Parameters:
original_df (sp.DataFrame) – The original, pre-merge dataframe
merged_df (sp.DataFrame) – Post-merge with another dataframe, with suffixes added in.
suffix (str) – What suffix was used to distinguish when the original dataframe was overlapping with the other merged dataframe.
- Returns:
Column list of the original dataframe pre suffix
- Return type:
List[str]
- datacompy.snowflake.temp_column_name(*dataframes) str ¶
Get a temp column name that isn’t included in columns of any dataframes.
- Parameters:
dataframes (list of DataFrames) – The DataFrames to create a temporary column name for
- Returns:
String column name that looks like ‘_temp_x’ for some integer x
- Return type:
str
Module contents¶
DataComPy is a package to compare two Pandas DataFrames.
Originally started to be something of a replacement for SAS’s PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it prints out some stats, and lets you tweak how accurate matches have to be). Then extended to carry that functionality over to Spark Dataframes.
- class datacompy.BaseCompare¶
Bases:
ABC
Base comparison class.
- abstract all_columns_match() bool ¶
Check if all columns match.
- abstract all_mismatch(ignore_matching_cols: bool = False) Any ¶
Get all rows that mismatch.
- abstract all_rows_overlap() bool ¶
Check if all rows overlap.
- abstract count_matching_rows() int ¶
Count the number of matching rows.
- abstract property df1: Any¶
Get the first dataframe.
- abstract df1_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df1.
- abstract property df2: Any¶
Get the second dataframe.
- abstract df2_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df2.
- abstract intersect_columns() OrderedSet[str] ¶
Get columns that are shared between the two dataframes.
- abstract intersect_rows_match() bool ¶
Check if the intersection of rows match.
- abstract matches(ignore_extra_columns: bool = False) bool ¶
Check if the dataframes match.
- only_join_columns() bool ¶
Boolean on if the only columns are the join columns.
- abstract report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str ¶
Return a string representation of a report.
- Parameters:
sample_count (int, optional) – The number of sample records to return. Defaults to 10.
column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.
html_file (str, optional) – HTML file name to save report output to. If
None
the file creation will be skipped.template_path (str, optional) – Path to a custom Jinja2 template file to use for report generation. If
None
, the default template will be used.
- Returns:
The report, formatted according to the template.
- Return type:
str
- abstract sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) Any ¶
Get a sample of rows that mismatch.
- abstract subset() bool ¶
Check if one dataframe is a subset of the other.
- class datacompy.Compare(df1: DataFrame, df2: DataFrame, join_columns: List[str] | str | None = None, on_index: bool = False, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True)¶
Bases:
BaseCompare
Comparison class to be used to compare whether two dataframes as equal.
Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).
- Parameters:
df1 (pandas
DataFrame
) – First dataframe to checkdf2 (pandas
DataFrame
) – Second dataframe to checkjoin_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
on_index (bool, optional) – If True, the index will be used to join the two dataframes. If both
join_columns
andon_index
are provided, an exception will be raised.abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns). Excludes categoricals.
ignore_case (bool, optional) – Flag to ignore the case of string columns. Excludes categoricals.
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
- Variables:
df1_unq_rows (pandas
DataFrame
) – All records that are only in df1 (based on a join on join_columns)df2_unq_rows (pandas
DataFrame
) – All records that are only in df2 (based on a join on join_columns)
- all_columns_match() bool ¶
Whether the columns all match in the dataframes.
- all_mismatch(ignore_matching_cols: bool = False) DataFrame ¶
Get all rows with any columns that have a mismatch.
Returns all df1 and df2 versions of the columns and join columns.
- Parameters:
ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.
- Returns:
All rows of the intersection dataframe, containing any columns, that don’t match.
- Return type:
Pandas.DataFrame
- all_rows_overlap() bool ¶
Whether the rows are all present in both dataframes.
- Returns:
True if all rows in df1 are in df2 and vice versa (based on existence for join option)
- Return type:
bool
- count_matching_rows() int ¶
Count the number of rows match (on overlapping fields).
- Returns:
Number of matching rows
- Return type:
int
- property df1: DataFrame¶
Get the first dataframe.
- df1_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df1.
- property df2: DataFrame¶
Get the second dataframe.
- df2_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df2.
- intersect_columns() OrderedSet[str] ¶
Get columns that are shared between the two dataframes.
- intersect_rows_match() bool ¶
Check whether the intersect rows all match.
- matches(ignore_extra_columns: bool = False) bool ¶
Return True or False if the dataframes match.
- Parameters:
ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.
- Returns:
True or False if the dataframes match.
- Return type:
bool
- report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str ¶
Return a string representation of a report.
The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.
- Parameters:
sample_count (int, optional) – The number of sample records to return. Defaults to 10.
column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.
html_file (str, optional) – HTML file name to save report output to. If
None
the file creation will be skipped.template_path (str, optional) –
Path to a custom Jinja2 template file to use for report generation. If
None
, the default template will be used. The template receives the following context variables:column_summary
: Dict with column statistics including:common_columns
,df1_unique
,df2_unique
,df1_name
,df2_name
row_summary
: Dict with row statistics including:match_columns
,equal_rows
,unequal_rows
column_comparison
: Dict with column comparison statistics including:unequal_columns
,equal_columns
,unequal_values
mismatch_stats
: Dict containing:stats
: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)samples
: Sample rows with mismatched valueshas_samples
: Boolean indicating if there are any mismatch sampleshas_mismatches
: Boolean indicating if there are any mismatches
df1_unique_rows
: Dict with unique rows in df1 including:has_rows
,rows
,columns
df2_unique_rows
: Dict with unique rows in df2 including:has_rows
,rows
,columns
- Returns:
The report, formatted according to the template.
- Return type:
str
Examples
Basic usage with default template:
>>> compare = datacompy.Compare(df1, df2, join_columns=['id']) >>> report = compare.report() >>> print(report)
Using a custom template file:
>>> # Create a custom template file (custom_report.j2) >>> with open('custom_report.j2', 'w') as f: ... f.write(''' ... Comparison Report ... ================ ... ... DataFrames: {{ df1_name }} vs {{ df2_name }} ... ... Shape Summary: ... - {{ df1_name }}: {{ df1_shape[0] }} rows x {{ df1_shape[1] }} columns ... - {{ df2_name }}: {{ df2_shape[0] }} rows x {{ df2_shape[1] }} columns ... ... {% if mismatch_stats %} ... Mismatched Columns ({{ mismatch_stats|length }}): ... {% for col in mismatch_stats %} ... - {{ col.column }} ({{ col.unequal_cnt }} mismatches) ... {% endfor %} ... {% else %} ... No mismatches found in any columns! ... {% endif %} ... ''') ... >>> # Generate report with custom template >>> report = compare.report(template_path='custom_report.j2') >>> print(report)
- sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame | None ¶
Return sample mismatches.
Gets a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.
- Parameters:
column (str) – The raw column name (i.e. without
_df1
appended)sample_count (int, optional) – The number of sample records to return. Defaults to 10.
for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)
- Returns:
Pandas.DataFrame – A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.
None – When the column being requested is not an intersecting column between dataframes.
- subset() bool ¶
Return True if dataframe 2 is a subset of dataframe 1.
Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.
- Returns:
True if dataframe 2 is a subset of dataframe 1.
- Return type:
bool
- class datacompy.PolarsCompare(df1: DataFrame, df2: DataFrame, join_columns: List[str] | str, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True)¶
Bases:
BaseCompare
Comparison class to be used to compare whether two dataframes as equal.
Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).
- Parameters:
df1 (Polars
DataFrame
) – First dataframe to checkdf2 (Polars
DataFrame
) – Second dataframe to checkjoin_columns (list or str) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default”
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns). Excludes categoricals.
ignore_case (bool, optional) – Flag to ignore the case of string columns. Excludes categoricals.
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
- Variables:
df1_unq_rows (Polars
DataFrame
) – All records that are only in df1 (based on a join on join_columns)df2_unq_rows (Polars
DataFrame
) – All records that are only in df2 (based on a join on join_columns)
- all_columns_match() bool ¶
Whether the columns all match in the dataframes.
- all_mismatch(ignore_matching_cols: bool = False) DataFrame ¶
Get all rows with any columns that have a mismatch.
Returns all df1 and df2 versions of the columns and join columns.
- Parameters:
ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.
- Returns:
All rows of the intersection dataframe, containing any columns, that don’t match.
- Return type:
Polars.DataFrame
- all_rows_overlap() bool ¶
Whether the rows are all present in both dataframes.
- Returns:
True if all rows in df1 are in df2 and vice versa (based on existence for join option)
- Return type:
bool
- count_matching_rows() int ¶
Count the number of rows match (on overlapping fields).
- Returns:
Number of matching rows
- Return type:
int
- property df1: DataFrame¶
Get the first dataframe.
- df1_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df1.
- property df2: DataFrame¶
Get the second dataframe.
- df2_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df2.
- intersect_columns() OrderedSet[str] ¶
Get columns that are shared between the two dataframes.
- intersect_rows_match() bool ¶
Check whether the intersect rows all match.
- matches(ignore_extra_columns: bool = False) bool ¶
Return True or False if the dataframes match.
- Parameters:
ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.
- Returns:
True or False if the dataframes match.
- Return type:
bool
- report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str ¶
Return a string representation of a report.
The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.
- Parameters:
sample_count (int, optional) – The number of sample records to return. Defaults to 10.
column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.
html_file (str, optional) – HTML file name to save report output to. If
None
the file creation will be skipped.template_path (str, optional) –
Path to a custom Jinja2 template file to use for report generation. If
None
, the default template will be used. The template receives the following context variables:column_summary
: Dict with column statistics including:common_columns
,df1_unique
,df2_unique
,df1_name
,df2_name
row_summary
: Dict with row statistics including:match_columns
,equal_rows
,unequal_rows
column_comparison
: Dict with column comparison statistics including:unequal_columns
,equal_columns
,unequal_values
mismatch_stats
: Dict containing:stats
: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)samples
: Sample rows with mismatched valueshas_samples
: Boolean indicating if there are any mismatch sampleshas_mismatches
: Boolean indicating if there are any mismatches
df1_unique_rows
: Dict with unique rows in df1 including:has_rows
,rows
,columns
df2_unique_rows
: Dict with unique rows in df2 including:has_rows
,rows
,columns
- Returns:
The report, formatted according to the template.
- Return type:
str
- sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame | None ¶
Return sample mismatches.
Get a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.
- Parameters:
column (str) – The raw column name (i.e. without
_df1
appended)sample_count (int, optional) – The number of sample records to return. Defaults to 10.
for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)
- Returns:
Polars.DataFrame – A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.
None – When the column being requested is not an intersecting column between dataframes.
- subset() bool ¶
Return True if dataframe 2 is a subset of dataframe 1.
Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.
- Returns:
True if dataframe 2 is a subset of dataframe 1.
- Return type:
bool
- class datacompy.SnowflakeCompare(session: Session, df1: str | DataFrame, df2: str | DataFrame, join_columns: List[str] | str | None, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str | None = None, df2_name: str | None = None, ignore_spaces: bool = False)¶
Bases:
BaseCompare
Comparison class to be used to compare whether two Snowpark dataframes are equal.
df1 and df2 can refer to either a Snowpark dataframe or the name of a valid Snowflake table. The data structures which df1 and df2 represent must contain all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).
- Parameters:
session (snowflake.snowpark.session) – Session with the required connection session info for user and targeted tables
df1 (Union[str, sp.Dataframe]) – First table to check, provided either as the table’s name or as a Snowpark DF.
df2 (Union[str, sp.Dataframe]) – Second table to check, provided either as the table’s name or as a Snowpark DF.
join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
df1_name (str, optional) – A string name for the first dataframe. If used alongside a snowflake table, overrides the default convention of naming the dataframe after the table.
df2_name (str, optional) – A string name for the second dataframe.
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns).
- Variables:
df1_unq_rows (sp.DataFrame) – All records that are only in df1 (based on a join on join_columns)
df2_unq_rows (sp.DataFrame) – All records that are only in df2 (based on a join on join_columns)
- all_columns_match() bool ¶
Whether the columns all match in the dataframes.
- Returns:
bool
True if all columns in df1 are in df2 and vice versa
- all_mismatch(ignore_matching_cols: bool = False) DataFrame ¶
Get all rows with any columns that have a mismatch.
Returns all df1 and df2 versions of the columns and join columns.
- Parameters:
ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.
- Returns:
All rows of the intersection dataframe, containing any columns, that don’t match.
- Return type:
sp.DataFrame
- all_rows_overlap() bool ¶
Whether the rows are all present in both dataframes.
- Returns:
True if all rows in df1 are in df2 and vice versa (based on existence for join option)
- Return type:
bool
- count_matching_rows() int ¶
Count the number of rows match (on overlapping fields).
- Returns:
Number of matching rows
- Return type:
int
- property df1: DataFrame¶
Get the first dataframe.
- df1_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df1.
- property df2: DataFrame¶
Get the second dataframe.
- df2_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df2.
- intersect_columns() OrderedSet[str] ¶
Get columns that are shared between the two dataframes.
- intersect_rows_match() bool ¶
Check whether the intersect rows all match.
- matches(ignore_extra_columns: bool = False) bool ¶
Return True or False if the dataframes match.
- Parameters:
ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.
- Returns:
True or False if the dataframes match.
- Return type:
bool
- report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str ¶
Return a string representation of a report.
The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.
- Parameters:
sample_count (int, optional) – The number of sample records to return. Defaults to 10.
column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.
html_file (str, optional) – HTML file name to save report output to. If
None
the file creation will be skipped.template_path (str, optional) –
Path to a custom Jinja2 template file to use for report generation. If
None
, the default template will be used. The template receives the following context variables:column_summary
: Dict with column statistics including:common_columns
,df1_unique
,df2_unique
,df1_name
,df2_name
row_summary
: Dict with row statistics including:match_columns
,equal_rows
,unequal_rows
column_comparison
: Dict with column comparison statistics including:unequal_columns
,equal_columns
,unequal_values
mismatch_stats
: Dict containing:stats
: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)samples
: Sample rows with mismatched valueshas_samples
: Boolean indicating if there are any mismatch sampleshas_mismatches
: Boolean indicating if there are any mismatches
df1_unique_rows
: Dict with unique rows in df1 including:has_rows
,rows
,columns
df2_unique_rows
: Dict with unique rows in df2 including:has_rows
,rows
,columns
- Returns:
The report, formatted according to the template.
- Return type:
str
- sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame ¶
Return sample mismatches.
Gets a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.
- Parameters:
column (str) – The raw column name (i.e. without
_df1
appended)sample_count (int, optional) – The number of sample records to return. Defaults to 10.
for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)
- Returns:
A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.
- Return type:
sp.DataFrame
- subset() bool ¶
Return True if dataframe 2 is a subset of dataframe 1.
Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.
- Returns:
True if dataframe 2 is a subset of dataframe 1.
- Return type:
bool
- class datacompy.SparkSQLCompare(spark_session: SparkSession, df1: DataFrame, df2: DataFrame, join_columns: List[str] | str, abs_tol: float | Dict[str, float] = 0, rel_tol: float | Dict[str, float] = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True)¶
Bases:
BaseCompare
Comparison class to be used to compare whether two Spark SQL dataframes are equal.
Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).
- Parameters:
spark_session (pyspark.sql.SparkSession) – A
SparkSession
to be used to execute Spark commands in the comparison.df1 (pyspark.sql.DataFrame) – First dataframe to check
df2 (pyspark.sql.DataFrame) – Second dataframe to check
join_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float or dict, optional) – Absolute tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
rel_tol (float or dict, optional) – Relative tolerance between two values. Can be either a float value applied to all columns, or a dictionary mapping column names to specific tolerance values. The special key “default” in the dictionary specifies the tolerance for columns not explicitly listed.
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)
ignore_case (bool, optional) – Flag to ignore the case of string columns
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
- Variables:
df1_unq_rows (pyspark.sql.DataFrame) – All records that are only in df1 (based on a join on join_columns)
df2_unq_rows (pyspark.sql.DataFrame) – All records that are only in df2 (based on a join on join_columns)
intersect_rows (pyspark.sql.DataFrame) – All records that are in both df1 and df2
- all_columns_match() bool ¶
Whether the columns all match in the dataframes.
- Returns:
True if all columns in df1 are in df2 and vice versa
- Return type:
bool
- all_mismatch(ignore_matching_cols: bool = False) DataFrame ¶
Get all rows with any columns that have a mismatch.
Returns all df1 and df2 versions of the columns and join columns.
- Parameters:
ignore_matching_cols (bool, optional) – Whether showing the matching columns in the output or not. The default is False.
- Returns:
All rows of the intersection dataframe, containing any columns, that don’t match.
- Return type:
pyspark.sql.DataFrame
- all_rows_overlap() bool ¶
Whether the rows are all present in both dataframes.
- Returns:
True if all rows in df1 are in df2 and vice versa (based on existence for join option)
- Return type:
bool
- count_matching_rows() int ¶
Count the number of rows match (on overlapping fields).
- Returns:
Number of matching rows
- Return type:
int
- property df1: DataFrame¶
Get the first dataframe.
- df1_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df1.
- property df2: DataFrame¶
Get the second dataframe.
- df2_unq_columns() OrderedSet[str] ¶
Get columns that are unique to df2.
- intersect_columns() OrderedSet[str] ¶
Get columns that are shared between the two dataframes.
- intersect_rows_match() bool ¶
Check whether the intersect rows all match.
- matches(ignore_extra_columns: bool = False) bool ¶
Return True or False if the dataframes match.
- Parameters:
ignore_extra_columns (bool) – Ignores any columns in one dataframe and not in the other.
- report(sample_count: int = 10, column_count: int = 10, html_file: str | None = None, template_path: str | None = None) str ¶
Return a string representation of a report.
The representation can then be printed or saved to a file. You can customize the report’s appearance by providing a custom Jinja2 template.
- Parameters:
sample_count (int, optional) – The number of sample records to return. Defaults to 10.
column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.
html_file (str, optional) – HTML file name to save report output to. If
None
the file creation will be skipped.template_path (str, optional) –
Path to a custom Jinja2 template file to use for report generation. If
None
, the default template will be used. The template receives the following context variables:column_summary
: Dict with column statistics including:common_columns
,df1_unique
,df2_unique
,df1_name
,df2_name
row_summary
: Dict with row statistics including:match_columns
,equal_rows
,unequal_rows
column_comparison
: Dict with column comparison statistics including:unequal_columns
,equal_columns
,unequal_values
mismatch_stats
: Dict containing:stats
: List of dicts with column mismatch statistics (column, match, mismatch, null_diff, etc.)samples
: Sample rows with mismatched valueshas_samples
: Boolean indicating if there are any mismatch sampleshas_mismatches
: Boolean indicating if there are any mismatches
df1_unique_rows
: Dict with unique rows in df1 including:has_rows
,rows
,columns
df2_unique_rows
: Dict with unique rows in df2 including:has_rows
,rows
,columns
- Returns:
The report, formatted according to the template.
- Return type:
str
- sample_mismatch(column: str, sample_count: int = 10, for_display: bool = False) DataFrame ¶
Return sample mismatches.
Gets a sub-dataframe which contains the identifying columns, and df1 and df2 versions of the column.
- Parameters:
column (str) – The raw column name (i.e. without
_df1
appended)sample_count (int, optional) – The number of sample records to return. Defaults to 10.
for_display (bool, optional) – Whether this is just going to be used for display (overwrite the column names)
- Returns:
A sample of the intersection dataframe, containing only the “pertinent” columns, for rows that don’t match on the provided column.
- Return type:
pyspark.sql.DataFrame
- subset() bool ¶
Return True if dataframe 2 is a subset of dataframe 1.
Dataframe 2 is considered a subset if all of its columns are in dataframe 1, and all of its rows match rows in dataframe 1 for the shared columns.
- Returns:
True if dataframe 2 is a subset of dataframe 1.
- Return type:
bool
- datacompy.all_columns_match(df1: AnyDataFrame, df2: AnyDataFrame) bool ¶
Whether the columns all match in the dataframes.
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to check
- Returns:
Boolean indicating whether the columns all match in the dataframes
- Return type:
bool
- datacompy.all_rows_overlap(df1: AnyDataFrame, df2: AnyDataFrame, join_columns: str | List[str], abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, parallelism: int | None = None, strict_schema: bool = False) bool ¶
Check if the rows are all present in both dataframes.
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to checkjoin_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float, optional) – Absolute tolerance between two values.
rel_tol (float, optional) – Relative tolerance between two values.
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)
ignore_case (bool, optional) – Flag to ignore the case of string columns
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
parallelism (int, optional) – An integer representing the amount of parallelism. Entering a value for this will force to use of Fugue over just vanilla Pandas
strict_schema (bool, optional) – The schema must match exactly if set to
True
. This includes the names and types. Allows for a fast fail.
- Returns:
True if all rows in df1 are in df2 and vice versa (based on existence for join option)
- Return type:
bool
- datacompy.calculate_max_diff(col_1: pd.Series[Any], col_2: pd.Series[Any]) float ¶
Get a maximum difference between two columns.
- Parameters:
col_1 (Pandas.Series) – The first column
col_2 (Pandas.Series) – The second column
- Returns:
Numeric field, or zero.
- Return type:
Numeric
- datacompy.columns_equal(col_1: pd.Series[Any], col_2: pd.Series[Any], rel_tol: float = 0, abs_tol: float = 0, ignore_spaces: bool = False, ignore_case: bool = False) pd.Series[bool] ¶
Compare two columns from a dataframe.
Returns a True/False series, with the same index as column 1.
Two nulls (np.nan) will evaluate to True.
A null and a non-null value will evaluate to False.
Numeric values will use the relative and absolute tolerances.
Decimal values (decimal.Decimal) will attempt to be converted to floats before comparing
Non-numeric values (i.e. where np.isclose can’t be used) will just trigger True on two nulls or exact matches.
Notes
As of version
0.14.0
If a column is of a mixed data type the compare will default to returningFalse
.Notes
list
andnp.array
types will be compared row wise usingnp.array_equal
. Depending on the size of your data this might lead to performance issues.All the rows must be of the same type otherwise it is considered “mixed” and will default to being
False
for everything.
- Parameters:
col_1 (Pandas.Series) – The first column to look at
col_2 (Pandas.Series) – The second column
rel_tol (float, optional) – Relative tolerance
abs_tol (float, optional) – Absolute tolerance
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns
ignore_case (bool, optional) – Flag to ignore the case of string columns
- Returns:
A series of Boolean values. True == the values match, False == the values don’t match.
- Return type:
pandas.Series
- datacompy.compare_string_and_date_columns(col_1: pd.Series[Any], col_2: pd.Series[Any]) pd.Series[bool] ¶
Compare a string column and date column, value-wise.
This tries to convert a string column to a date column and compare that way.
- Parameters:
col_1 (Pandas.Series) – The first column to look at
col_2 (Pandas.Series) – The second column
- Returns:
A series of Boolean values. True == the values match, False == the values don’t match.
- Return type:
pandas.Series
- datacompy.count_matching_rows(df1: AnyDataFrame, df2: AnyDataFrame, join_columns: str | List[str], abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, parallelism: int | None = None, strict_schema: bool = False) int ¶
Count the number of rows match (on overlapping fields).
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to checkjoin_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float, optional) – Absolute tolerance between two values.
rel_tol (float, optional) – Relative tolerance between two values.
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)
ignore_case (bool, optional) – Flag to ignore the case of string columns
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
parallelism (int, optional) – An integer representing the amount of parallelism. Entering a value for this will force to use of Fugue over just vanilla Pandas
strict_schema (bool, optional) – The schema must match exactly if set to
True
. This includes the names and types. Allows for a fast fail.
- Returns:
Number of matching rows
- Return type:
int
- datacompy.df_to_str(df: Any, sample_count: int | None = None, on_index: bool = False) str ¶
Convert a DataFrame to a string representation.
This is a centralized function to handle DataFrame to string conversion for different DataFrame types (pandas, Spark, Polars, etc.)
- Parameters:
df (Any) – The DataFrame to convert to string. Can be pandas, Spark, or Polars DataFrame.
sample_count (int, optional) – For distributed DataFrames (like Spark), limit the number of rows to convert.
on_index (bool, default False) – If True, include the index in the output.
- Returns:
String representation of the DataFrame
- Return type:
str
- datacompy.generate_id_within_group(dataframe: DataFrame, join_columns: List[str]) pd.Series[int] ¶
Generate an ID column that can be used to deduplicate identical rows.
The series generated is the order within a unique group, and it handles nulls.
- Parameters:
dataframe (Pandas.DataFrame) – The dataframe to operate on
join_columns (list) – List of strings which are the join columns
- Returns:
The ID column that’s unique in each group.
- Return type:
Pandas.Series
- datacompy.get_merged_columns(original_df: DataFrame, merged_df: DataFrame, suffix: str) List[str] ¶
Get the columns from an original dataframe, in the new merged dataframe.
- Parameters:
original_df (Pandas.DataFrame) – The original, pre-merge dataframe
merged_df (Pandas.DataFrame) – Post-merge with another dataframe, with suffixes added in.
suffix (str) – What suffix was used to distinguish when the original dataframe was overlapping with the other merged dataframe.
- datacompy.intersect_columns(df1: AnyDataFrame, df2: AnyDataFrame) OrderedSet[str] ¶
Get columns that are shared between the two dataframes.
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to check
- Returns:
Set of that are shared between the two dataframes
- Return type:
OrderedSet
- datacompy.is_match(df1: AnyDataFrame, df2: AnyDataFrame, join_columns: str | List[str], abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, parallelism: int | None = None, strict_schema: bool = False) bool ¶
Check whether two dataframes match.
Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to checkjoin_columns (list or str, optional) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float, optional) – Absolute tolerance between two values.
rel_tol (float, optional) – Relative tolerance between two values.
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)
ignore_case (bool, optional) – Flag to ignore the case of string columns
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
parallelism (int, optional) – An integer representing the amount of parallelism. Entering a value for this will force to use of Fugue over just vanilla Pandas
strict_schema (bool, optional) – The schema must match exactly if set to
True
. This includes the names and types. Allows for a fast fail.
- Returns:
Returns boolean as to if the DataFrames match.
- Return type:
bool
- datacompy.render(template_name: str, **context: Any) str ¶
Render a template using Jinja2.
- Parameters:
template_name (str) – The name of the template file to render. This can be: - A filename in the default templates directory (with or without .j2 extension) - A relative path from the default templates directory - An absolute path to a template file
**context (dict) – The context variables to pass to the template
- Returns:
The rendered template
- Return type:
str
- Raises:
FileNotFoundError – If the template file cannot be found in any of the expected locations
- datacompy.report(df1: AnyDataFrame, df2: AnyDataFrame, join_columns: str | List[str], abs_tol: float = 0, rel_tol: float = 0, df1_name: str = 'df1', df2_name: str = 'df2', ignore_spaces: bool = False, ignore_case: bool = False, cast_column_names_lower: bool = True, sample_count: int = 10, column_count: int = 10, html_file: str | None = None, parallelism: int | None = None) str ¶
Return a string representation of a report.
The representation can then be printed or saved to a file.
Both df1 and df2 should be dataframes containing all of the join_columns, with unique column names. Differences between values are compared to abs_tol + rel_tol * abs(df2[‘value’]).
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to checkjoin_columns (list or str) – Column(s) to join dataframes on. If a string is passed in, that one column will be used.
abs_tol (float, optional) – Absolute tolerance between two values.
rel_tol (float, optional) – Relative tolerance between two values.
df1_name (str, optional) – A string name for the first dataframe. This allows the reporting to print out an actual name instead of “df1”, and allows human users to more easily track the dataframes.
df2_name (str, optional) – A string name for the second dataframe
ignore_spaces (bool, optional) – Flag to strip whitespace (including newlines) from string columns (including any join columns)
ignore_case (bool, optional) – Flag to ignore the case of string columns
cast_column_names_lower (bool, optional) – Boolean indicator that controls of column names will be cast into lower case
parallelism (int, optional) – An integer representing the amount of parallelism. Entering a value for this will force to use of Fugue over just vanilla Pandas
strict_schema (bool, optional) – The schema must match exactly if set to
True
. This includes the names and types. Allows for a fast fail.sample_count (int, optional) – The number of sample records to return. Defaults to 10.
column_count (int, optional) – The number of columns to display in the sample records output. Defaults to 10.
html_file (str, optional) – HTML file name to save report output to. If
None
the file creation will be skipped.
- Returns:
The report, formatted kinda nicely.
- Return type:
str
- datacompy.save_html_report(report: str, html_file: str | Path) None ¶
Save a text report as an HTML file.
- Parameters:
report (str) – The text report to convert to HTML
html_file (str or Path) – The path where the HTML file should be saved
- datacompy.temp_column_name(*dataframes) str ¶
Get a temp column name that isn’t included in columns of any dataframes.
- Parameters:
dataframes (list of DataFrames) – The DataFrames to create a temporary column name for
- Returns:
String column name that looks like ‘_temp_x’ for some integer x
- Return type:
str
- datacompy.unq_columns(df1: AnyDataFrame, df2: AnyDataFrame) OrderedSet[str] ¶
Get columns that are unique to df1.
- Parameters:
df1 (
AnyDataFrame
) – First dataframe to checkdf2 (
AnyDataFrame
) – Second dataframe to check
- Returns:
Set of columns that are unique to df1
- Return type:
OrderedSet