ColumnName Labeler Tutorial¶
This notebook teaches how to use the existing ColumnNameModel:
Loading and utilizing the pre-existing
ColumnNameModelRun the labeler
First, let’s import the libraries needed for this example.
[ ]:
import os
import sys
import json
from pprint import pprint
import pandas as pd
try:
import dataprofiler as dp
except ImportError:
sys.path.insert(0, '../..')
import dataprofiler as dp
Loading and predicting using a pre-existing model using load_from_library¶
The easiest option for users is to load_from_library by specifying the name for the labeler in the resources/ folder. Quickly import and start predicting with any model from the Data Profiler’s library of models available.
[ ]:
labeler_from_library = dp.DataLabeler.load_from_library('column_name_labeler')
[ ]:
labeler_from_library.predict(data=["ssn"])
Loading and using the pre-existing column name labeler using load_with_components¶
For example purposes here, we will import the exsting ColumnName labeler via the load_with_components command from the dp.DataLabeler. This shows a bit more of the details of the data labeler’s flow.
[ ]:
parameters = {
"true_positive_dict": [
{"attribute": "ssn", "label": "ssn"},
{"attribute": "suffix", "label": "name"},
{"attribute": "my_home_address", "label": "address"},
],
"false_positive_dict": [
{
"attribute": "contract_number",
"label": "ssn",
},
{
"attribute": "role",
"label": "name",
},
{
"attribute": "send_address",
"label": "address",
},
],
"negative_threshold_config": 50,
"positive_threshold_config": 85,
"include_label": True,
}
label_mapping = {"ssn": 1, "name": 2, "address": 3}
[ ]:
# pre processor
preprocessor = dp.labelers.data_processing.DirectPassPreprocessor()
# model
from dataprofiler.labelers.column_name_model import ColumnNameModel
model = ColumnNameModel(
parameters=parameters,
label_mapping=label_mapping,
)
# post processor
postprocessor = dp.labelers.data_processing.ColumnNameModelPostprocessor()
[ ]:
data_labeler = dp.DataLabeler.load_with_components(
preprocessor=preprocessor,
model=model,
postprocessor=postprocessor,
)
data_labeler.model.help()
[ ]:
pprint(data_labeler.label_mapping)
[ ]:
pprint(data_labeler.model._parameters)
Predicting with the ColumnName labeler¶
In the prediction below, the data will be passed into to stages in the background
compare_negative: The idea behind thecompare_negativeis to first filter out any possibility of flagging a false positive in the model prediction. In this step, the confidence value is checked and if the similarity is too close to being a false positive, that particular string in thedatais removed and not returned to thecompare_positive.
compare_positive: Finally thedatais passed to thecompare_positivestep and checked for similarity with the thetrue_positive_dictvalues. Again, during this stage thepositive_threshold_configis used to filter the results to only thosedatavalues that are greater than or equal to thepositive_threshold_configprovided by the user.
[ ]:
# evaluate a prediction using the default parameters
data_labeler.predict(data=["ssn", "name", "address"])
Replacing the parameters in the existing labeler¶
We can achieve this by:
Setting the label mapping to the new labels
Setting the model parameters which include:
true_positive_dict,false_positive_dict,negative_threshold_config,positive_threshold_config, andinclude_label
where true_positive_dict and false_positive_dict are lists of dicts, negative_threshold_config and positive_threshold_config are integer values between 0 and 100, and include_label is a boolean value that determines if the output should include the prediction labels or only the confidence values.
Below, we created 4 labels where other is the default_label.
[ ]:
data_labeler.set_labels({'other': 0, "funky_one": 1, "funky_two": 2, "funky_three": 3})
data_labeler.model.set_params(
true_positive_dict= [
{"attribute": "ssn", "label": "funky_one"},
{"attribute": "suffix", "label": "funky_two"},
{"attribute": "my_home_address", "label": "funky_three"},
],
false_positive_dict=[
{
"attribute": "contract_number",
"label": "ssn",
},
{
"attribute": "role",
"label": "name",
},
{
"attribute": "not_my_address",
"label": "address",
},
],
negative_threshold_config=50,
positive_threshold_config=85,
include_label=True,
)
data_labeler.label_mapping
Predicting with the new labels¶
Here we are testing the predict() method with brand new labels for label_mapping. As we can see the new labels flow throught to the output of the data labeler.
[ ]:
data_labeler.predict(data=["ssn", "suffix"], predict_options=dict(show_confidences=True))
Saving the Data Labeler for future use¶
[ ]:
if not os.path.isdir('new_column_name_labeler'):
os.mkdir('new_column_name_labeler')
data_labeler.save_to_disk('new_column_name_labeler')
Loading the saved Data Labeler¶
[ ]:
saved_labeler = dp.DataLabeler.load_from_disk('new_column_name_labeler')
[ ]:
# ensuring the parametesr are what we saved.
print("label_mapping:")
pprint(saved_labeler.label_mapping)
print("\nmodel parameters:")
pprint(saved_labeler.model._parameters)
print()
print("postprocessor: " + saved_labeler.postprocessor.__class__.__name__)
[ ]:
# predicting with the loaded labeler.
saved_labeler.predict(["ssn", "name", "address"])