View this notebook on GitHub

Dataloader with Popmon Reports

This demo is to cover the usage of popmon with the dataloader from the dataprofiler

This demo covers the followings:

- How to install popmon
- Comparison of the dynamic dataloader from dataprofiler to the
    standard dataloader used in pandas
- Popmon's usage example using both dataloaders
- Dataprofiler's examples using both dataloaders
- Usage of the pm_stability_report function (popmon reports)

How to Install Popmon

To install popmon you can use the command below:

pip3 install popmon

From here, we can import the libararies needed for this demo.

[ ]:
import os
import sys
try:
    sys.path.insert(0, '..')
    import dataprofiler as dp
except ImportError:
    import dataprofiler as dp
import pandas as pd
import popmon  # noqa

Comparison of Dataloaders

First, we have the original pandas dataloading which works for specific file types. This is good for if the data format is known ahead of time but is less useful for more dynamic cases.

[ ]:
def popmon_dataloader(path, time_index):
    # Load pm dataframe (Can only read csvs unless reader option is changed)
    if not time_index is None:
        pm_data = pd.read_csv(path, parse_dates=[time_index])
    else:
        time_index = True
        pm_data = pd.read_csv(path)
    return pm_data

Next, we have the dataprofiler’s dataloader. This allows for the dynamic loading of different data formats which is super useful when the data format is not know ahead of time. This is intended to be an improvement on the dataloader standardly used in pandas.

[ ]:
def dp_dataloader(path):
    # Datalaoder from dataprofiler used
    dp_data = dp.Data(path)

    # Profiler used to ensure proper label for datetime even
    # when null values exist
    profiler_options = dp.ProfilerOptions()
    profiler_options.set({'*.is_enabled': False,  # Runs first disabling all options in profiler
                          '*.datetime.is_enabled': True})
    profile = dp.Profiler(dp_data, options=profiler_options)

    # convert any time/datetime types from strings to actual datatime type
    for ind, col in enumerate(dp_data.data.columns):
        if profile.profile[ind].profile.get('data_type') == 'datetime':
            dp_data.data[col] = pd.to_datetime(dp_data.data[col])

    return dp_data.data

Popmon’s usage example using both dataloaders

Next, we’ll download a dataset from the resources component

[ ]:
import gzip
import shutil
popmon_tutorial_data = popmon.resources.data("flight_delays.csv.gz")
with gzip.open(popmon_tutorial_data, 'rb') as f_in:
    with open('./flight_delays.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Finally we read in the data with popmon and print the report to a file

[ ]:
# Default csv from popmon example
path = "./flight_delays.csv"
time_index = "DATE"
report_output_dir = "./popmon_output/flight_delays_full"
if not os.path.exists(report_output_dir):
    os.makedirs(report_output_dir)

[ ]:
pm_data = popmon_dataloader(path, time_index)

report_pm_loader = pm_data.pm_stability_report(
    time_axis=time_index,
    time_width="1w",
    time_offset="2015-07-02",
    extended_report=False,
    pull_rules={"*_pull": [10, 7, -7, -10]},
)

# Save popmon reports
report_pm_loader.to_file(os.path.join(report_output_dir, "popmon_loader_report.html"))
print("Report printed at:", os.path.join(report_output_dir, "popmon_loader_report.html"))

We then do the same for the dataprofiler loader

[ ]:
dp_dataframe = dp_dataloader(path)
# Generate pm report using dp dataloader
report_dp_loader = dp_dataframe.pm_stability_report(
    time_axis=time_index,
    time_width="1w",
    time_offset="2015-07-02",
    extended_report=False,
    pull_rules={"*_pull": [10, 7, -7, -10]},
)

# Save popmon reports
report_dp_loader.to_file(os.path.join(report_output_dir, "dataprofiler_loader_report.html"))
print("Report printed at:", os.path.join(report_output_dir, "dataprofiler_loader_report.html"))

Examples of data

Next, We’ll use some data from the test files of the data profiler to compare the dynamic loading of the dataprofiler’s data loader to that of the standard pandas approach.

Dataprofiler’s examples using both dataloaders

To execute this properly, simply choose one of the 3 examples below and then run the report generation below.

[ ]:
# Default csv from popmon example (mini version)
path = "../dataprofiler/tests/data/csv/flight_delays.csv"
time_index = "DATE"
report_output_dir = "./popmon_output/flight_delays_mini"
[ ]:
# Random csv from dataprofiler tests
path = "../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv"
time_index = "datetime"
report_output_dir = "./popmon_output/aws_honeypot_marx_geo"
[ ]:
# Random json file from dataprofiler tests
path = "../dataprofiler/tests/data/json/math.json"

time_index = "data.9"
report_output_dir = "./popmon_output/math"

Run the block below to create an output directory for your popmon reports.

[ ]:
if not os.path.exists(report_output_dir):
    os.makedirs(report_output_dir)
dp_dataframe = dp_dataloader(path)

Report comparison

We generate reports using different sets of data from the dataprofiler and pandas below using dataprofiler’s dataloader and popmons report generator

The dataprofiler’s dataloader can seemlessly switch between data formats and generate reports with the exact same code in place.

[ ]:
# Generate pm report using dp dataloader
report_dp_loader = dp_dataframe.pm_stability_report(
    time_axis=time_index,
    time_width="1w",
    time_offset="2015-07-02",
    extended_report=False,
    pull_rules={"*_pull": [10, 7, -7, -10]},
)

If the dataloaders are valid, you can see the reports and compare them at the output directory specified in the printout below each report generation block (the two code blocks below).

[ ]:
# Save dp reports
report_dp_loader.to_file(os.path.join(report_output_dir, "dataprofiler_loader_report.html"))
print("Report printed at:", os.path.join(report_output_dir, "dataprofiler_loader_report.html"))