ColumnName Labeler Tutorial¶
This notebook teaches how to use the existing ColumnNameModel
:
Loading and utilizing the pre-existing
ColumnNameModel
Run the labeler
First, let’s import the libraries needed for this example.
[ ]:
import os
import sys
import json
from pprint import pprint
import pandas as pd
try:
import dataprofiler as dp
except ImportError:
sys.path.insert(0, '../..')
import dataprofiler as dp
Loading and predicting using a pre-existing model using load_from_library
¶
The easiest option for users is to load_from_library
by specifying the name for the labeler in the resources/
folder. Quickly import and start predicting with any model from the Data Profiler’s library of models available.
[ ]:
labeler_from_library = dp.DataLabeler.load_from_library('column_name_labeler')
[ ]:
labeler_from_library.predict(data=["ssn"])
Loading and using the pre-existing column name labeler using load_with_components
¶
For example purposes here, we will import the exsting ColumnName
labeler via the load_with_components
command from the dp.DataLabeler
. This shows a bit more of the details of the data labeler’s flow.
[ ]:
parameters = {
"true_positive_dict": [
{"attribute": "ssn", "label": "ssn"},
{"attribute": "suffix", "label": "name"},
{"attribute": "my_home_address", "label": "address"},
],
"false_positive_dict": [
{
"attribute": "contract_number",
"label": "ssn",
},
{
"attribute": "role",
"label": "name",
},
{
"attribute": "send_address",
"label": "address",
},
],
"negative_threshold_config": 50,
"positive_threshold_config": 85,
"include_label": True,
}
label_mapping = {"ssn": 1, "name": 2, "address": 3}
[ ]:
# pre processor
preprocessor = dp.labelers.data_processing.DirectPassPreprocessor()
# model
from dataprofiler.labelers.column_name_model import ColumnNameModel
model = ColumnNameModel(
parameters=parameters,
label_mapping=label_mapping,
)
# post processor
postprocessor = dp.labelers.data_processing.ColumnNameModelPostprocessor()
[ ]:
data_labeler = dp.DataLabeler.load_with_components(
preprocessor=preprocessor,
model=model,
postprocessor=postprocessor,
)
data_labeler.model.help()
[ ]:
pprint(data_labeler.label_mapping)
[ ]:
pprint(data_labeler.model._parameters)
Predicting with the ColumnName labeler¶
In the prediction below, the data will be passed into to stages in the background - 1) compare_negative
: The idea behind the compare_negative
is to first filter out any possibility of flagging a false positive in the model prediction. In this step, the confidence value is checked and if the similarity is too close to being a false positive, that particular string in the data
is removed and not returned to the compare_positive
. - 2) compare_positive
: Finally the data
is
passed to the compare_positive
step and checked for similarity with the the true_positive_dict
values. Again, during this stage the positive_threshold_config
is used to filter the results to only those data
values that are greater than or equal to the positive_threshold_config
provided by the user.
[ ]:
# evaluate a prediction using the default parameters
data_labeler.predict(data=["ssn", "name", "address"])
Replacing the parameters in the existing labeler¶
We can achieve this by: 1. Setting the label mapping to the new labels 2. Setting the model parameters which include: true_positive_dict
, false_positive_dict
, negative_threshold_config
, positive_threshold_config
, and include_label
where true_positive_dict
and false_positive_dict
are lists
of dicts
, negative_threshold_config
and positive_threshold_config
are integer values between 0
and 100
, and include_label
is a boolean
value that determines if the output should include the prediction labels or only the confidence values.
Below, we created 4 labels where other
is the default_label
.
[ ]:
data_labeler.set_labels({'other': 0, "funky_one": 1, "funky_two": 2, "funky_three": 3})
data_labeler.model.set_params(
true_positive_dict= [
{"attribute": "ssn", "label": "funky_one"},
{"attribute": "suffix", "label": "funky_two"},
{"attribute": "my_home_address", "label": "funky_three"},
],
false_positive_dict=[
{
"attribute": "contract_number",
"label": "ssn",
},
{
"attribute": "role",
"label": "name",
},
{
"attribute": "not_my_address",
"label": "address",
},
],
negative_threshold_config=50,
positive_threshold_config=85,
include_label=True,
)
data_labeler.label_mapping
Predicting with the new labels¶
Here we are testing the predict()
method with brand new labels for label_mapping. As we can see the new labels flow throught to the output of the data labeler.
[ ]:
data_labeler.predict(data=["ssn", "suffix"], predict_options=dict(show_confidences=True))
Saving the Data Labeler for future use¶
[ ]:
if not os.path.isdir('new_column_name_labeler'):
os.mkdir('new_column_name_labeler')
data_labeler.save_to_disk('new_column_name_labeler')
Loading the saved Data Labeler¶
[ ]:
saved_labeler = dp.DataLabeler.load_from_disk('new_column_name_labeler')
[ ]:
# ensuring the parametesr are what we saved.
print("label_mapping:")
pprint(saved_labeler.label_mapping)
print("\nmodel parameters:")
pprint(saved_labeler.model._parameters)
print()
print("postprocessor: " + saved_labeler.postprocessor.__class__.__name__)
[ ]:
# predicting with the loaded labeler.
saved_labeler.predict(["ssn", "name", "address"])