Building a Regex Data Labeler w/ your own Regex¶

This notebook teaches how to use the existing / create your own regex labeler as well as utilize it for structured data profiling.

Loading and utilizing the pre-existing regex data labeler
Replacing the existing regex rules with your own.
Utilizng a regex data labeler inside of the structured profiler

First, let’s import the libraries needed for this example.

[ ]:

import os
import sys
import json
from pprint import pprint

import pandas as pd

try:
    import dataprofiler as dp
except ImportError:
    sys.path.insert(0, '../..')
    import dataprofiler as dp

Loading and using the pre-existing regex data labeler¶

We can easily import the exsting regex labeler via the load_from_library command from the dp.DataLabeler. This allows us to import models other than the default structured / unstructured labelers which exist in the library.

[ ]:

data_labeler = dp.DataLabeler.load_from_library('regex_model')
data_labeler.model.help()

[ ]:

pprint(data_labeler.label_mapping)

[ ]:

pprint(data_labeler.model._parameters['regex_patterns'])

Predicting with the regex labeler¶

In the prediction below, the default settings will split the predictions by default as it’s aggregation function. In other words, if a string ‘123 Fake St.’ The first character would receive a vote for integer and for address giving both a 50% probability. This is because these regex functions are defined individually and a post prediction aggregation function must be used to get the results.

[ ]:

# evaluate a prediction using the default parameters
data_labeler.predict(['123 Fake St.'])

Replacing the regex rules in the existing labeler¶

We can achieve this by: 1. Setting the label mapping to the new labels 2. Setting the model parameters which include: regex_patterns, default_label, ignore_case, and encapsulators

where regex_patterns is a dict of lists or regex for each label, default_label is the expected default label for the regex, ignore_case tells the model to ignore case during its detection, and encapsulators are generic regex statements placed before (start) and after (end) each regex. Currently, this is used by the default model to capture labels that are within a cell rather than matching the entire cell. (e.g. ’ 123 ’ will still capture 123 as digits).

Below, we created 4 labels where other is the default_label. Additionally, we set enabled case sensitivity such that upper and lower case letters would be detected separately.

[ ]:

data_labeler.set_labels({'other': 0, 'digits':1, 'lowercase_char': 2, 'uppercase_chars': 3})
data_labeler.model.set_params(
    regex_patterns={
        'digits': [r'[+-]?[0-9]+'],
        'lowercase_char': [r'[a-z]+'],
        'uppercase_chars': [r'[A-Z]+'],
    },
    default_label='other',
    ignore_case=False,
)
data_labeler.label_mapping

Predicting with the new regex labels¶

Here we notice the otuput of the predictions gives us a prediction per character for each regex. Note how by default it is matching subtext due to the encapsulators. Where 123 were found to be digits, FAKE was foudn to be upper case, and the whitespaces and St. were other due no single regex being correct.

[ ]:

data_labeler.predict(['123 FAKE St.'])

Below we turn off case-sensitivity and we see how the aggregation funciton splits the votes for characters between the lowercase and uppercase chars.

[ ]:

data_labeler.model.set_params(ignore_case=True)
data_labeler.predict(['123 FAKE St.'])

For the rest of this notebook, we will just use a single regex serach which will capture both upper and lower case chars.

[ ]:

data_labeler.set_labels({'other': 0, 'digits':1, 'chars': 2})
data_labeler.model.set_params(
    regex_patterns={
        'digits': [r'[=-]?[0-9]+'],
        'chars': [r'[a-zA-Z]+'],
    },
    default_label='other',
    ignore_case=False,
)
data_labeler.label_mapping

[ ]:

data_labeler.predict(['123 FAKE St.'])

Adjusting postprocessor properties¶

Below we can look at the possible postprocessor parameters to adjust the aggregation function to the desired output. The previous outputs by default used the split aggregation function, however, below we will show the random aggregation function which will randomly choose a label if multiple labels have a vote for a given character.

data_labeler.postprocessor.help()

[ ]:

data_labeler.postprocessor.set_params(aggregation_func='random')
data_labeler.predict(['123 FAKE St.'], predict_options=dict(show_confidences=True))

Integrating the new Regex labeler into Structured Profiling¶

While the labeler can be used alone, it is also possible to integrate the labeler into the StructuredProfiler with a slight change to its postprocessor. The StructuredProfiler requires a labeler which outputs othe confidence of each label for a given cell being processed. To convert the output of the RegexPostProcessor into said format, we will use the StructRegexPostProcessor. We can create the postprocessor and set the data_labeler’s postprocessor to this value.

[ ]:

from dataprofiler.labelers.data_processing import StructRegexPostProcessor

postprocesor = StructRegexPostProcessor()
data_labeler.set_postprocessor(postprocesor)

Below we will see the output is now one vote per sample.

[ ]:

data_labeler.predict(['123 FAKE St.', '123', 'FAKE'], predict_options=dict(show_confidences=True))

Setting the Structuredprofiler’s DataLabeler¶

We can create a ProfilerOption and set the structured options to have the new data_labeler as its value. We then run the StructuredProfiler with the specified options.

[ ]:

# create and set the option for the regex data labeler to be used at profile time
profile_options = dp.ProfilerOptions()
profile_options.set({'structured_options.data_labeler.data_labeler_object': data_labeler})

# profile the dataset using the suggested regex data labeler
data = pd.DataFrame(
    [['123 FAKE St.', 123, 'this'],
     [123           ,  -9, 'IS'],
     ['...'         , +80, 'A'],
     ['123'         , 202, 'raNDom'],
     ['test'        ,  -1, 'TEST']],
    dtype=object)
profiler = dp.Profiler(data, options=profile_options)

Below we see the first column is given 3 labels as it received multiple votes for said column. However, it was confident on the second and third column which is why it only specified digits and chars respectively.

[ ]:

pprint(profiler.report(
    dict(output_format='compact',
         omit_keys=['data_stats.*.statistics',
                    'data_stats.*.categorical',
                    'data_stats.*.order',
                    'global_stats'])))

Saving the Data Labeler for future use¶

[ ]:

if not os.path.isdir('my_new_regex_labeler'):
    os.mkdir('my_new_regex_labeler')
data_labeler.save_to_disk('my_new_regex_labeler')

Loading the saved Data Labeler¶

[ ]:

saved_labeler = dp.DataLabeler.load_from_disk('my_new_regex_labeler')

[ ]:

# ensuring the parametesr are what we saved.
print("label_mapping:")
pprint(saved_labeler.label_mapping)
print("\nmodel parameters:")
pprint(saved_labeler.model._parameters)
print()
print("postprocessor: " + saved_labeler.postprocessor.__class__.__name__)

[ ]:

# predicting with the loaded labeler.
saved_labeler.predict(['test', '123'])

[ ]: