Snowpark/Snowflake Usage

For SnowflakeCompare

  • on_index is not supported.

  • Joining is done using EQUAL_NULL which is the equality test that is safe for null values.

  • Compares snowflake.snowpark.DataFrame, which can be provided as either raw Snowflake dataframes or as the names of full names of valid snowflake tables, which we will process into Snowpark dataframes.

SnowflakeCompare setup

There are two ways to specify input dataframes for SnowflakeCompare

Provide Snowpark dataframes

from snowflake.snowpark import Session
from snowflake.snowpark import Row
import datetime
import datacompy.snowflake as sp

connection_parameters = {
    ...
}
session = Session.builder.configs(connection_parameters).create()

data1 = [
Row(acct_id=10000001234, dollar_amt=123.45, name='George Maharis', float_fld=14530.1555,
    date_fld=datetime.date(2017, 1, 1)),
Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=1.0,
    date_fld=datetime.date(2017, 1, 1)),
Row(acct_id=10000001236, dollar_amt=1345.0, name='George Bluth', float_fld=None,
    date_fld=datetime.date(2017, 1, 1)),
Row(acct_id=10000001237, dollar_amt=123456.0, name='Bob Loblaw', float_fld=345.12,
    date_fld=datetime.date(2017, 1, 1)),
Row(acct_id=10000001239, dollar_amt=1.05, name='Lucille Bluth', float_fld=None,
    date_fld=datetime.date(2017, 1, 1)),
]

data2 = [
    Row(acct_id=10000001234, dollar_amt=123.4, name='George Michael Bluth', float_fld=14530.155),
    Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=None),
    Row(acct_id=None, dollar_amt=1345.0, name='George Bluth', float_fld=1.0),
    Row(acct_id=10000001237, dollar_amt=123456.0, name='Robert Loblaw', float_fld=345.12),
    Row(acct_id=10000001238, dollar_amt=1.05, name='Loose Seal Bluth', float_fld=111.0),
]

df_1 = session.createDataFrame(data1)
df_2 = session.createDataFrame(data2)

compare = sp.SnowflakeCompare(
    session,
    df_1,
    df_2,
    join_columns=['acct_id'],
    rel_tol=1e-03,
    abs_tol=1e-04,
)
compare.matches(ignore_extra_columns=False)

# This method prints out a human-readable report summarizing and sampling differences
print(compare.report())

Provide the full name (db.schema.table_name) of valid Snowflake tables

Given the dataframes from the prior examples…

df_1.write.mode("overwrite").save_as_table("toy_table_1")
df_2.write.mode("overwrite").save_as_table("toy_table_2")

compare = sp.SnowflakeCompare(
    session,
    f"{db}.{schema}.toy_table_1",
    f"{db}.{schema}.toy_table_2",
    join_columns=['acct_id'],
    rel_tol=1e-03,
    abs_tol=1e-04,
)
compare.matches(ignore_extra_columns=False)

# This method prints out a human-readable report summarizing and sampling differences
print(compare.report())

Reports

A report is generated by calling report(), which returns a string. Here is a sample report generated by datacompy for the two tables above, joined on acct_id (Note: the names for your dataframes are extracted from the name of the provided Snowflake table. If you chose to directly use Snowpark dataframes, then the names will default to DF1 and DF2.):

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

DataFrame  Columns  Rows
0       DF1        5     5
1       DF2        4     5

Column Summary
--------------

Number of columns in common: 4
Number of columns in DF1 but not in DF2: 1
Number of columns in DF2 but not in DF1: 0

Row Summary
-----------

Matched on: ACCT_ID
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 4
Number of rows in DF1 but not in DF2: 1
Number of rows in DF2 but not in DF1: 1

Number of rows with some compared columns unequal: 4
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 6

Columns with Unequal Values or Types
------------------------------------

    Column         DF1 dtype         DF2 dtype  # Unequal  Max Diff  # Null Diff
0  DOLLAR_AMT            double            double          1    0.0500            0
2   FLOAT_FLD            double            double          3    0.0005            2
1        NAME  string(16777216)  string(16777216)          2       NaN            0

Sample Rows with Unequal Values
-------------------------------

    ACCT_ID  DOLLAR_AMT (DF1)  DOLLAR_AMT (DF2)
0  10000001234            123.45             123.4

    ACCT_ID      NAME (DF1)            NAME (DF2)
0  10000001234  George Maharis  George Michael Bluth
1  10000001237      Bob Loblaw         Robert Loblaw

    ACCT_ID  FLOAT_FLD (DF1)  FLOAT_FLD (DF2)
0  10000001234       14530.1555        14530.155
1  10000001235           1.0000              NaN
2  10000001236              NaN            1.000

Sample Rows Only in DF1 (First 10 Columns)
------------------------------------------

ACCT_ID_DF1  DOLLAR_AMT_DF1       NAME_DF1  FLOAT_FLD_DF1 DATE_FLD_DF1
0  10000001239            1.05  Lucille Bluth            NaN   2017-01-01

Sample Rows Only in DF2 (First 10 Columns)
------------------------------------------

ACCT_ID_DF2  DOLLAR_AMT_DF2          NAME_DF2  FLOAT_FLD_DF2
0  10000001238            1.05  Loose Seal Bluth          111.0

Convenience Methods

There are a few convenience methods and attributes available after the comparison has been run:

compare.intersect_rows[['name_df1', 'name_df2', 'name_match']].show()
# --------------------------------------------------------
# |"NAME_DF1"      |"NAME_DF2"            |"NAME_MATCH"  |
# --------------------------------------------------------
# |George Maharis  |George Michael Bluth  |False         |
# |Michael Bluth   |Michael Bluth         |True          |
# |George Bluth    |George Bluth          |True          |
# |Bob Loblaw      |Robert Loblaw         |False         |
# --------------------------------------------------------

compare.df1_unq_rows.show()
# ---------------------------------------------------------------------------------------
# |"ACCT_ID_DF1"  |"DOLLAR_AMT_DF1"  |"NAME_DF1"     |"FLOAT_FLD_DF1"  |"DATE_FLD_DF1"  |
# ---------------------------------------------------------------------------------------
# |10000001239    |1.05              |Lucille Bluth  |NULL             |2017-01-01      |
# ---------------------------------------------------------------------------------------

compare.df2_unq_rows.show()
# -------------------------------------------------------------------------
# |"ACCT_ID_DF2"  |"DOLLAR_AMT_DF2"  |"NAME_DF2"        |"FLOAT_FLD_DF2"  |
# -------------------------------------------------------------------------
# |10000001238    |1.05              |Loose Seal Bluth  |111.0            |
# -------------------------------------------------------------------------

print(compare.intersect_columns())
# OrderedSet(['acct_id', 'dollar_amt', 'name', 'float_fld'])

print(compare.df1_unq_columns())
# OrderedSet(['date_fld'])

print(compare.df2_unq_columns())
# OrderedSet()

Duplicate rows

Datacompy will try to handle rows that are duplicate in the join columns. It does this behind the scenes by generating a unique ID within each unique group of the join columns. For example, if you have two dataframes you’re trying to join on acct_id:

acct_id

name

1

George Maharis

1

Michael Bluth

2

George Bluth

acct_id

name

1

George Maharis

1

Michael Bluth

1

Tony Wonder

2

George Bluth

Datacompy will generate a unique temporary ID for joining:

acct_id

name

temp_id

1

George Maharis

0

1

Michael Bluth

1

2

George Bluth

0

acct_id

name

temp_id

1

George Maharis

0

1

Michael Bluth

1

1

Tony Wonder

2

2

George Bluth

0

And then merge the two dataframes on a combination of the join_columns you specified and the temporary ID, before dropping the temp_id again. So the first two rows in the first dataframe will match the first two rows in the second dataframe, and the third row in the second dataframe will be recognized as uniquely in the second.

Additional considerations

  • It is strongly recommended against joining on float columns or any column with floating point precision. Columns joining tables are compared on the basis of an exact comparison, therefore if the values comparing your float columns are not exact, you will likely get unexpected results.

  • Case-sensitive columns are only partially supported. We essentially treat case-sensitive columns as if they are case-insensitive. Therefore you may use case-sensitive columns as long as you don’t have several columns with the same name differentiated only be case sensitivity.