View this notebook on GitHub

Graph Pipeline Demo

DataProfiler can also load and profile graph datasets. Similarly to the rest of DataProfiler profilers, this is split into two components: - GraphData - GraphProfiler

We will demo the use of this graph pipeline.

First, let’s import the libraries needed for this example.

[17]:
import os
import sys
import pandas as pd
import pprint
sys.path.insert(0, '..')

import dataprofiler as dp
data_path = "../dataprofiler/tests/data"

We now input our dataset into the generic DataProfiler pipeline:

[ ]:
data = dp.Data(os.path.join(data_path, "csv/graph_data_csv_identify.csv"))
profile = dp.Profiler(data)

report = profile.report()

pp = pprint.PrettyPrinter(sort_dicts=False, compact=True)
pp.pprint(report)

We notice that the Data class automatically detected the input file as graph data. The GraphData class is able to differentiate between tabular and graph csv data. After Data matches the input file as graph data, GraphData does the necessary work to load the csv data into a NetworkX Graph.

Profiler runs GraphProfiler when graph data is input (or when data_type="graph" is specified). The report() function outputs the profile for the user.

Profile

The profile skeleton looks like this:

profile = {
    "num_nodes": ...,
    "num_edges": ...,
    "categorical_attributes": ...,
    "continuous_attributes": ...,
    "avg_node_degree": ...,
    "global_max_component_size": ...,
    "continuous_distribution": ...,
    "categorical_distribution": ...,
    "times": ...,
}

Description of properties in profile: - num_nodes: number of nodes in the graph - num_edges: number of edges in the graph - categorical_attributes: list of categorical edge attributes - continuous_attributes: list of continuous edge attributes - avg_node_degree: average degree of nodes in the graph - global_max_component_size: size of largest global max component in the graph - continuous_distribution: dictionary of statistical properties for each continuous attribute - categorical_distribution: dictionary of statistical properties for each categorical attribute

The continuous_distribution and categorical_distribution dictionaries list statistical properties for each edge attribute in the graph:

continuous_distribution = {
    "name": ...,
    "scale": ...,
    "properties": ...,
}
categorical_distribution = {
    "bin_counts": ...,
    "bin_edges": ...,
}

Description of each attribute: - Continuous distribution: - name: name of the distribution - scale: negative log likelihood used to scale distributions and compare them in GraphProfiler - properties: list of distribution props - Categorical distribution: - bin_counts: histogram bin counts - bin_edges: histogram bin edges

properties lists the following distribution properties: [optional: shape, loc, scale, mean, variance, skew, kurtosis]. The list can be either 6 length or 7 length depending on the distribution (extra shape parameter): - 6 length: norm, uniform, expon, logistic - 7 length: gamma, lognorm - gamma: shape=a (float) - lognorm: shape=s (float)

For more information on shape parameters a and s: https://docs.scipy.org/doc/scipy/tutorial/stats.html#shape-parameters

Conclusion

We have shown the graph pipeline in the DataProfiler. It works similarly to the current DataProfiler implementation.