Labeler (Sensitive Data)

In this library, the term data labeling refers to entity recognition.

Builtin to the data profiler is a classifier which evaluates the complex data types of the dataset. For structured data, it determines the complex data type of each column. When running the data profile, it uses the default data labeling model builtin to the library. However, the data labeler allows users to train their own data labeler as well.

Data Labels are determined per cell for structured data (column/row when the profiler is used) or at the character level for unstructured data. This is a list of the default labels.

  • UNKNOWN

  • ADDRESS

  • BAN (bank account number, 10-18 digits)

  • CREDIT_CARD

  • EMAIL_ADDRESS

  • UUID

  • HASH_OR_KEY (md5, sha1, sha256, random hash, etc.)

  • IPV4

  • IPV6

  • MAC_ADDRESS

  • PERSON

  • PHONE_NUMBER

  • SSN

  • URL

  • US_STATE

  • DRIVERS_LICENSE

  • DATE

  • TIME

  • DATETIME

  • INTEGER

  • FLOAT

  • QUANTITY

  • ORDINAL

Identify Entities in Structured Data

Makes predictions and identifying labels:

import dataprofiler as dp

# load data and data labeler
data = dp.Data("your_data.csv")
data_labeler = dp.DataLabeler(labeler_type='structured')

# make predictions and get labels per cell
predictions = data_labeler.predict(data)

Identify Entities in Unstructured Data

Predict which class characters belong to in unstructured text:

import dataprofiler as dp

data_labeler = dp.DataLabeler(labeler_type='unstructured')

# Example sample string, must be in an array (multiple arrays can be passed)
sample = ["Help\tJohn Macklemore\tneeds\tfood.\tPlease\tCall\t555-301-1234."
          "\tHis\tssn\tis\tnot\t334-97-1234. I'm a BAN: 000043219499392912.\n"]

# Prediction what class each character belongs to
model_predictions = data_labeler.predict(
    sample, predict_options=dict(show_confidences=True))

# Predictions / confidences are at the character level
final_results = model_predictions["pred"]
final_confidences = model_predictions["conf"]

It’s also possible to change output formats, output similar to a SpaCy format:

import dataprofiler as dp

data_labeler = dp.DataLabeler(labeler_type='unstructured', trainable=True)

# Example sample string, must be in an array (multiple arrays can be passed)
sample = ["Help\tJohn Macklemore\tneeds\tfood.\tPlease\tCall\t555-301-1234."
          "\tHis\tssn\tis\tnot\t334-97-1234. I'm a BAN: 000043219499392912.\n"]

# Set the output to the NER format (start position, end position, label)
data_labeler.set_params(
    { 'postprocessor': { 'output_format':'ner', 'use_word_level_argmax':True } }
)

results = data_labeler.predict(sample)

print(results)

Train a New Data Labeler

Mechanism for training your own data labeler on their own set of structured data (tabular):

import dataprofiler as dp

# Will need one column with a default label of UNKNOWN
data = dp.Data("your_file.csv")

data_labeler = dp.train_structured_labeler(
    data=data,
    save_dirpath="/path/to/save/labeler",
    epochs=2
)

data_labeler.save_to_disk("my/save/path") # Saves the data labeler for reuse

Load an Existing Data Labeler

Mechanism for loading an existing data_labeler:

import dataprofiler as dp

data_labeler = dp.DataLabeler(
    labeler_type='structured', dirpath="/path/to/my/labeler")

# get information about the parameters/inputs/output formats for the DataLabeler
data_labeler.help()

Extending a Data Labeler with Transfer Learning

Extending or changing labels of a data labeler w/ transfer learning: Note: By default, a labeler loaded will not be trainable. In order to load a trainable DataLabeler, the user must set trainable=True or load a labeler using the TrainableDataLabeler class.

The following illustrates how to change the labels:

import dataprofiler as dp

labels = ['label1', 'label2', ...]  # new label set can also be an encoding dict
data = dp.Data("your_file.csv")  # contains data with new labels

# load default structured Data Labeler w/ trainable set to True
data_labeler = dp.DataLabeler(labeler_type='structured', trainable=True)

# this will use transfer learning to retrain the data labeler on your new
# dataset and labels.
# NOTE: data must be in an acceptable format for the preprocessor to interpret.
#       please refer to the preprocessor/model for the expected data format.
#       Currently, the DataLabeler cannot take in Tabular data, but requires
#       data to be ingested with two columns [X, y] where X is the samples and
#       y is the labels.
model_results = data_labeler.fit(x=data['samples'], y=data['labels'],
                                 validation_split=0.2, epochs=2, labels=labels)

# final_results, final_confidences are a list of results for each epoch
epoch_id = 0
final_results = model_results[epoch_id]["pred"]
final_confidences = model_results[epoch_id]["conf"]

The following illustrates how to extend the labels:

import dataprofiler as dp

new_labels = ['label1', 'label2', ...]
data = dp.Data("your_file.csv")  # contains data with new labels

# load default structured Data Labeler w/ trainable set to True
data_labeler = dp.DataLabeler(labeler_type='structured', trainable=True)

# this will maintain current labels and model weights, but extend the model's
# labels
for label in new_labels:
    data_labeler.add_label(label)

# NOTE: a user can also add a label which maps to the same index as an existing
# label
# data_labeler.add_label(label, same_as='<label_name>')

# For a trainable model, the user must then train the model to be able to
# continue using the labeler since the model's graph has likely changed
# NOTE: data must be in an acceptable format for the preprocessor to interpret.
#       please refer to the preprocessor/model for the expected data format.
#       Currently, the DataLabeler cannot take in Tabular data, but requires
#       data to be ingested with two columns [X, y] where X is the samples and
#       y is the labels.
model_results = data_labeler.fit(x=data['samples'], y=data['labels'],
                                 validation_split=0.2, epochs=2)

# final_results, final_confidences are a list of results for each epoch
epoch_id = 0
final_results = model_results[epoch_id]["pred"]
final_confidences = model_results[epoch_id]["conf"]

Changing pipeline parameters:

import dataprofiler as dp

# load default Data Labeler
data_labeler = dp.DataLabeler(labeler_type='structured')

# change parameters of specific component
data_labeler.preprocessor.set_params({'param1': 'value1'})

# change multiple simultaneously.
data_labeler.set_params({
    'preprocessor':  {'param1': 'value1'},
    'model':         {'param2': 'value2'},
    'postprocessor': {'param3': 'value3'}
})

Build Your Own Data Labeler

The DataLabeler has 3 main components: preprocessor, model, and postprocessor. To create your own DataLabeler, each one would have to be created or an existing component can be reused.

Given a set of the 3 components, you can construct your own DataLabeler:

Option for swapping out specific components of an existing labeler.

import dataprofiler as dp
from dataprofiler.labelers.character_level_cnn_model import \
    CharacterLevelCnnModel
from dataprofiler.labelers.data_processing import \
    StructCharPreprocessor, StructCharPostprocessor

model = CharacterLevelCnnModel(...)
preprocessor = StructCharPreprocessor(...)
postprocessor = StructCharPostprocessor(...)

data_labeler = dp.DataLabeler(labeler_type='structured')
data_labeler.set_preprocessor(preprocessor)
data_labeler.set_model(model)
data_labeler.set_postprocessor(postprocessor)

# check for basic compatibility between the processors and the model
data_labeler.check_pipeline()

Model Component

In order to create your own model component for data labeling, you can utilize the BaseModel class from dataprofiler.labelers.base_model and overriding the abstract class methods.

Reviewing CharacterLevelCnnModel from dataprofiler.labelers.character_level_cnn_model illustrates the functions which need an override.

  1. __init__: specifying default parameters and calling base __init__

  2. _validate_parameters: validating parameters given by user during setting

  3. _need_to_reconstruct_model: flag for when to reconstruct a model (i.e. parameters change or labels change require a model reconstruction)

  4. _construct_model: initial construction of the model given the parameters

  5. _reconstruct_model: updates model architecture for new label set while maintaining current model weights

  6. fit: mechanism for the model to learn given training data

  7. predict: mechanism for model to make predictions on data

  8. details: prints a summary of the model construction

  9. save_to_disk: saves model and model parameters to disk

  10. load_from_disk: loads model given a path on disk

Preprocessor Component

In order to create your own preprocessor component for data labeling, you can utilize the BaseDataPreprocessor class from dataprofiler.labelers.data_processing and override the abstract class methods.

Reviewing StructCharPreprocessor from dataprofiler.labelers.data_processing illustrates the functions which need an override.

  1. __init__: passing parameters to the base class and executing any extraneous calculations to be saved as parameters

  2. _validate_parameters: validating parameters given by user during setting

  3. process: takes in the user data and converts it into an digestible, iterable format for the model

  4. set_params (optional): if a parameter requires processing before setting, a user can override this function to assist with setting the parameter

  5. _save_processor (optional): if a parameter is not JSON serializable, a user can override this function to assist in saving the processor and its parameters

  6. load_from_disk (optional): if a parameter(s) is not JSON serializable, a user can override this function to assist in loading the processor

Postprocessor Component

The postprocessor is nearly identical to the preprocessor except it handles the output of the model for processing. In order to create your own postprocessor component for data labeling, you can utilize the BaseDataPostprocessor class from dataprofiler.labelers.data_processing and override the abstract class methods.

Reviewing StructCharPostprocessor from dataprofiler.labelers.data_processing illustrates the functions which need an override.

  1. __init__: passing parameters to the base class and executing any extraneous calculations to be saved as parameters

  2. _validate_parameters: validating parameters given by user during setting

  3. process: takes in the output of the model and processes for output to the user

  4. set_params (optional): if a parameter requires processing before setting, a user can override this function to assist with setting the parameter

  5. _save_processor (optional): if a parameter is not JSON serializable, a user can override this function to assist in saving the processor and its parameters

  6. load_from_disk (optional): if a parameter(s) is not JSON serializable, a user can override this function to assist in loading the processor