.. _data_labeling:

Labeler (Sensitive Data)
************************

In this library, the term *data labeling* refers to entity recognition.

Builtin to the data profiler is a classifier which evaluates the complex data types of the dataset.
For structured data, it determines the complex data type of each column. When
running the data profile, it uses the default data labeling model builtin to the
library. However, the data labeler allows users to train their own data labeler
as well.

*Data Labels* are determined per cell for structured data (column/row when 
the *profiler* is used) or at the character level for unstructured data. This
is a list of the default labels.

* UNKNOWN
* ADDRESS
* BAN (bank account number, 10-18 digits)
* CREDIT_CARD
* EMAIL_ADDRESS
* UUID 
* HASH_OR_KEY (md5, sha1, sha256, random hash, etc.)
* IPV4
* IPV6
* MAC_ADDRESS
* PERSON
* PHONE_NUMBER
* SSN
* URL
* US_STATE
* DRIVERS_LICENSE
* DATE
* TIME
* DATETIME
* INTEGER
* FLOAT
* QUANTITY
* ORDINAL


Identify Entities in Structured Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Makes predictions and identifying labels:

.. code-block:: python

    import dataprofiler as dp

    # load data and data labeler
    data = dp.Data("your_data.csv")
    data_labeler = dp.DataLabeler(labeler_type='structured')

    # make predictions and get labels per cell
    predictions = data_labeler.predict(data)

Identify Entities in Unstructured Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Predict which class characters belong to in unstructured text:

.. code-block:: python

    import dataprofiler as dp

    data_labeler = dp.DataLabeler(labeler_type='unstructured')

    # Example sample string, must be in an array (multiple arrays can be passed)
    sample = ["Help\tJohn Macklemore\tneeds\tfood.\tPlease\tCall\t555-301-1234."
              "\tHis\tssn\tis\tnot\t334-97-1234. I'm a BAN: 000043219499392912.\n"]

    # Prediction what class each character belongs to
    model_predictions = data_labeler.predict(
        sample, predict_options=dict(show_confidences=True))

    # Predictions / confidences are at the character level
    final_results = model_predictions["pred"]
    final_confidences = model_predictions["conf"]

It's also possible to change output formats, output similar to a **SpaCy** format:

.. code-block:: python

    import dataprofiler as dp

    data_labeler = dp.DataLabeler(labeler_type='unstructured', trainable=True)

    # Example sample string, must be in an array (multiple arrays can be passed)
    sample = ["Help\tJohn Macklemore\tneeds\tfood.\tPlease\tCall\t555-301-1234."
              "\tHis\tssn\tis\tnot\t334-97-1234. I'm a BAN: 000043219499392912.\n"]

    # Set the output to the NER format (start position, end position, label)
    data_labeler.set_params(
        { 'postprocessor': { 'output_format':'ner', 'use_word_level_argmax':True } } 
    )

    results = data_labeler.predict(sample)

    print(results)

Train a New Data Labeler
~~~~~~~~~~~~~~~~~~~~~~~~

Mechanism for training your own data labeler on their own set of structured data
(tabular):

.. code-block:: python
    
    import dataprofiler as dp

    # Will need one column with a default label of UNKNOWN
    data = dp.Data("your_file.csv")

    data_labeler = dp.train_structured_labeler(
        data=data,
        save_dirpath="/path/to/save/labeler",
        epochs=2
    )

    data_labeler.save_to_disk("my/save/path") # Saves the data labeler for reuse

Load an Existing Data Labeler
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mechanism for loading an existing data_labeler:

.. code-block:: python

    import dataprofiler as dp

    data_labeler = dp.DataLabeler(
        labeler_type='structured', dirpath="/path/to/my/labeler")

    # get information about the parameters/inputs/output formats for the DataLabeler
    data_labeler.help()

Extending a Data Labeler with Transfer Learning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Extending or changing labels of a data labeler w/ transfer learning:
Note: By default, **a labeler loaded will not be trainable**. In order to load a 
trainable DataLabeler, the user must set `trainable=True` or load a labeler 
using the `TrainableDataLabeler` class.

The following illustrates how to change the labels:

.. code-block:: python

    import dataprofiler as dp

    labels = ['label1', 'label2', ...]  # new label set can also be an encoding dict
    data = dp.Data("your_file.csv")  # contains data with new labels

    # load default structured Data Labeler w/ trainable set to True
    data_labeler = dp.DataLabeler(labeler_type='structured', trainable=True)

    # this will use transfer learning to retrain the data labeler on your new 
    # dataset and labels.
    # NOTE: data must be in an acceptable format for the preprocessor to interpret.
    #       please refer to the preprocessor/model for the expected data format.
    #       Currently, the DataLabeler cannot take in Tabular data, but requires 
    #       data to be ingested with two columns [X, y] where X is the samples and 
    #       y is the labels.
    model_results = data_labeler.fit(x=data['samples'], y=data['labels'], 
                                     validation_split=0.2, epochs=2, labels=labels)

    # final_results, final_confidences are a list of results for each epoch
    epoch_id = 0
    final_results = model_results[epoch_id]["pred"]
    final_confidences = model_results[epoch_id]["conf"]

The following illustrates how to extend the labels:

.. code-block:: python

    import dataprofiler as dp

    new_labels = ['label1', 'label2', ...]
    data = dp.Data("your_file.csv")  # contains data with new labels

    # load default structured Data Labeler w/ trainable set to True
    data_labeler = dp.DataLabeler(labeler_type='structured', trainable=True)

    # this will maintain current labels and model weights, but extend the model's 
    # labels
    for label in new_labels:
        data_labeler.add_label(label)
    
    # NOTE: a user can also add a label which maps to the same index as an existing 
    # label
    # data_labeler.add_label(label, same_as='<label_name>')

    # For a trainable model, the user must then train the model to be able to 
    # continue using the labeler since the model's graph has likely changed
    # NOTE: data must be in an acceptable format for the preprocessor to interpret.
    #       please refer to the preprocessor/model for the expected data format.
    #       Currently, the DataLabeler cannot take in Tabular data, but requires 
    #       data to be ingested with two columns [X, y] where X is the samples and 
    #       y is the labels.
    model_results = data_labeler.fit(x=data['samples'], y=data['labels'], 
                                     validation_split=0.2, epochs=2)

    # final_results, final_confidences are a list of results for each epoch
    epoch_id = 0
    final_results = model_results[epoch_id]["pred"]
    final_confidences = model_results[epoch_id]["conf"]


Changing pipeline parameters:

.. code-block:: python

    import dataprofiler as dp

    # load default Data Labeler
    data_labeler = dp.DataLabeler(labeler_type='structured')

    # change parameters of specific component
    data_labeler.preprocessor.set_params({'param1': 'value1'})

    # change multiple simultaneously.
    data_labeler.set_params({
        'preprocessor':  {'param1': 'value1'},
        'model':         {'param2': 'value2'},
        'postprocessor': {'param3': 'value3'}
    })


Build Your Own Data Labeler
===========================

The DataLabeler has 3 main components: preprocessor, model, and postprocessor. 
To create your own DataLabeler, each one would have to be created or an 
existing component can be reused.

Given a set of the 3 components, you can construct your own DataLabeler:

.. code-block:: python
    from dataprofiler.labelers.base_data_labeler import BaseDataLabeler, \
                                                        TrainableDataLabeler
    from dataprofiler.labelers.character_level_cnn_model import CharacterLevelCnnModel
    from dataprofiler.labelers.data_processing import \
         StructCharPreprocessor, StructCharPostprocessor

    # load a non-trainable data labeler
    model = CharacterLevelCnnModel(...)
    preprocessor = StructCharPreprocessor(...)
    postprocessor = StructCharPostprocessor(...)

    data_labeler = BaseDataLabeler.load_with_components(
        preprocessor=preprocessor, model=model, postprocessor=postprocessor)

    # check for basic compatibility between the processors and the model
    data_labeler.check_pipeline()


    # load trainable data labeler
    data_labeler = TrainableDataLabeler.load_with_components(
        preprocessor=preprocessor, model=model, postprocessor=postprocessor)

    # check for basic compatibility between the processors and the model
    data_labeler.check_pipeline()

Option for swapping out specific components of an existing labeler.

.. code-block:: python

    import dataprofiler as dp
    from dataprofiler.labelers.character_level_cnn_model import \
        CharacterLevelCnnModel
    from dataprofiler.labelers.data_processing import \
        StructCharPreprocessor, StructCharPostprocessor

    model = CharacterLevelCnnModel(...)
    preprocessor = StructCharPreprocessor(...)
    postprocessor = StructCharPostprocessor(...)
    
    data_labeler = dp.DataLabeler(labeler_type='structured')
    data_labeler.set_preprocessor(preprocessor)
    data_labeler.set_model(model)
    data_labeler.set_postprocessor(postprocessor)
    
    # check for basic compatibility between the processors and the model
    data_labeler.check_pipeline()


Model Component
~~~~~~~~~~~~~~~

In order to create your own model component for data labeling, you can utilize 
the `BaseModel` class from `dataprofiler.labelers.base_model` and
overriding the abstract class methods.

Reviewing `CharacterLevelCnnModel` from 
`dataprofiler.labelers.character_level_cnn_model` illustrates the functions 
which need an override. 

#. `__init__`: specifying default parameters and calling base `__init__`
#. `_validate_parameters`: validating parameters given by user during setting
#. `_need_to_reconstruct_model`: flag for when to reconstruct a model (i.e. 
   parameters change or labels change require a model reconstruction)
#. `_construct_model`: initial construction of the model given the parameters
#. `_reconstruct_model`: updates model architecture for new label set while 
   maintaining current model weights
#. `fit`: mechanism for the model to learn given training data
#. `predict`: mechanism for model to make predictions on data
#. `details`: prints a summary of the model construction
#. `save_to_disk`: saves model and model parameters to disk
#. `load_from_disk`: loads model given a path on disk
  
  
Preprocessor Component
~~~~~~~~~~~~~~~~~~~~~~

In order to create your own preprocessor component for data labeling, you can 
utilize the `BaseDataPreprocessor` class 
from `dataprofiler.labelers.data_processing` and override the abstract class 
methods.

Reviewing `StructCharPreprocessor` from 
`dataprofiler.labelers.data_processing` illustrates the functions which 
need an override.

#. `__init__`: passing parameters to the base class and executing any 
   extraneous calculations to be saved as parameters
#. `_validate_parameters`: validating parameters given by user during
   setting
#. `process`: takes in the user data and converts it into an digestible, 
   iterable format for the model
#. `set_params` (optional): if a parameter requires processing before setting,
   a user can override this function to assist with setting the parameter
#. `_save_processor` (optional): if a parameter is not JSON serializable, a 
   user can override this function to assist in saving the processor and its 
   parameters
#. `load_from_disk` (optional): if a parameter(s) is not JSON serializable, a 
   user can override this function to assist in loading the processor 

Postprocessor Component
~~~~~~~~~~~~~~~~~~~~~~~

The postprocessor is nearly identical to the preprocessor except it handles 
the output of the model for processing. In order to create your own 
postprocessor component for data  labeling, you can utilize the 
`BaseDataPostprocessor` class from  `dataprofiler.labelers.data_processing` 
and override the abstract class methods.

Reviewing `StructCharPostprocessor` from 
`dataprofiler.labelers.data_processing` illustrates the functions which 
need an override.

#. `__init__`: passing parameters to the base class and executing any 
   extraneous calculations to be saved as parameters
#. `_validate_parameters`: validating parameters given by user during
   setting
#. `process`: takes in the output of the model and processes for output to 
   the user
#. `set_params` (optional): if a parameter requires processing before setting,
   a user can override this function to assist with setting the parameter 
#. `_save_processor` (optional): if a parameter is not JSON serializable, a 
   user can override this function to assist in saving the processor and its 
   parameters
#. `load_from_disk` (optional): if a parameter(s) is not JSON serializable, a 
   user can override this function to assist in loading the processor