View this notebook on GitHub

ColumnName Labeler Tutorial

This notebook teaches how to use the existing ColumnNameModel:

  1. Loading and utilizing the pre-existing ColumnNameModel

  2. Run the labeler

First, let’s import the libraries needed for this example.

[ ]:
import os
import sys
import json
from pprint import pprint

import pandas as pd

try:
    import dataprofiler as dp
except ImportError:
    sys.path.insert(0, '../..')
    import dataprofiler as dp

Loading and predicting using a pre-existing model using load_from_library

The easiest option for users is to load_from_library by specifying the name for the labeler in the resources/ folder. Quickly import and start predicting with any model from the Data Profiler’s library of models available.

[ ]:
labeler_from_library = dp.DataLabeler.load_from_library('column_name_labeler')
[ ]:
labeler_from_library.predict(data=["ssn"])

Loading and using the pre-existing column name labeler using load_with_components

For example purposes here, we will import the exsting ColumnName labeler via the load_with_components command from the dp.DataLabeler. This shows a bit more of the details of the data labeler’s flow.

[ ]:
parameters = {
            "true_positive_dict": [
                {"attribute": "ssn", "label": "ssn"},
                {"attribute": "suffix", "label": "name"},
                {"attribute": "my_home_address", "label": "address"},
            ],
            "false_positive_dict": [
                {
                    "attribute": "contract_number",
                    "label": "ssn",
                },
                {
                    "attribute": "role",
                    "label": "name",
                },
                {
                    "attribute": "send_address",
                    "label": "address",
                },
            ],
            "negative_threshold_config": 50,
            "positive_threshold_config": 85,
            "include_label": True,
        }

label_mapping = {"ssn": 1, "name": 2, "address": 3}
[ ]:
# pre processor
preprocessor = dp.labelers.data_processing.DirectPassPreprocessor()

# model
from dataprofiler.labelers.column_name_model import ColumnNameModel
model = ColumnNameModel(
    parameters=parameters,
    label_mapping=label_mapping,
)


# post processor
postprocessor = dp.labelers.data_processing.ColumnNameModelPostprocessor()
[ ]:
data_labeler = dp.DataLabeler.load_with_components(
    preprocessor=preprocessor,
    model=model,
    postprocessor=postprocessor,
)
data_labeler.model.help()
[ ]:
pprint(data_labeler.label_mapping)
[ ]:
pprint(data_labeler.model._parameters)

Predicting with the ColumnName labeler

In the prediction below, the data will be passed into to stages in the background - 1) compare_negative: The idea behind the compare_negative is to first filter out any possibility of flagging a false positive in the model prediction. In this step, the confidence value is checked and if the similarity is too close to being a false positive, that particular string in the data is removed and not returned to the compare_positive. - 2) compare_positive: Finally the data is passed to the compare_positive step and checked for similarity with the the true_positive_dict values. Again, during this stage the positive_threshold_config is used to filter the results to only those data values that are greater than or equal to the positive_threshold_config provided by the user.

[ ]:
# evaluate a prediction using the default parameters
data_labeler.predict(data=["ssn", "name", "address"])

Replacing the parameters in the existing labeler

We can achieve this by: 1. Setting the label mapping to the new labels 2. Setting the model parameters which include: true_positive_dict, false_positive_dict, negative_threshold_config, positive_threshold_config, and include_label

where true_positive_dict and false_positive_dict are lists of dicts, negative_threshold_config and positive_threshold_config are integer values between 0 and 100, and include_label is a boolean value that determines if the output should include the prediction labels or only the confidence values.

Below, we created 4 labels where other is the default_label.

[ ]:
data_labeler.set_labels({'other': 0, "funky_one": 1, "funky_two": 2, "funky_three": 3})
data_labeler.model.set_params(
    true_positive_dict= [
                {"attribute": "ssn", "label": "funky_one"},
                {"attribute": "suffix", "label": "funky_two"},
                {"attribute": "my_home_address", "label": "funky_three"},
            ],
    false_positive_dict=[
                {
                    "attribute": "contract_number",
                    "label": "ssn",
                },
                {
                    "attribute": "role",
                    "label": "name",
                },
                {
                    "attribute": "not_my_address",
                    "label": "address",
                },
            ],
    negative_threshold_config=50,
    positive_threshold_config=85,
    include_label=True,
)
data_labeler.label_mapping

Predicting with the new labels

Here we are testing the predict() method with brand new labels for label_mapping. As we can see the new labels flow throught to the output of the data labeler.

[ ]:
data_labeler.predict(data=["ssn", "suffix"], predict_options=dict(show_confidences=True))

Saving the Data Labeler for future use

[ ]:
if not os.path.isdir('new_column_name_labeler'):
    os.mkdir('new_column_name_labeler')
data_labeler.save_to_disk('new_column_name_labeler')

Loading the saved Data Labeler

[ ]:
saved_labeler = dp.DataLabeler.load_from_disk('new_column_name_labeler')
[ ]:
# ensuring the parametesr are what we saved.
print("label_mapping:")
pprint(saved_labeler.label_mapping)
print("\nmodel parameters:")
pprint(saved_labeler.model._parameters)
print()
print("postprocessor: " + saved_labeler.postprocessor.__class__.__name__)
[ ]:
# predicting with the loaded labeler.
saved_labeler.predict(["ssn", "name", "address"])