View this notebook on GitHub

Sensitive data detection using the Data Labeler component of the Data Profiler

In this example, we utilize the Data Labeler component of the Data Profiler to detect the sensitive information for both structured and unstructured data. In addition, we show how to train the Data Labeler on some specific dataset with different list of entities.

First, let’s import the libraries needed for this example.

[ ]:
import os
import sys
import json
import pandas as pd
sys.path.insert(0, '..')
import dataprofiler as dp

Structured data

We use the aws honeypot dataset in the test folder for this example. First, look at the data using the Data Reader class of the Data Profiler.

[ ]:
data = dp.Data("../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv")
df_data = data.data
df_data.head()

The data contains 16 columns, each of which has different data label. Next, we will use the Data Labeler of the Data Profiler to predict the label for each column in this tabular dataset. To use only the labeling functionality, other options of the Data Profiler are disabled.

[ ]:
# set option to only run the data labeler
profile_options = dp.ProfilerOptions()
profile_options.set({"text.is_enabled": False,
                     "int.is_enabled": False,
                     "float.is_enabled": False,
                     "order.is_enabled": False,
                     "category.is_enabled": False,
                     "datetime.is_enabled": False,})

profile = dp.Profiler(data, profiler_options=profile_options)

The results are shown below:

[ ]:
# get the prediction from the data profiler
def get_structured_results(results):
    columns = []
    predictions = []
    for col in results['data_stats']:
        columns.append(col)
        predictions.append(results['data_stats'][col]['data_label'])

    df_results = pd.DataFrame({'Column': columns, 'Prediction': predictions})
    return df_results

results = profile.report()
print(get_structured_results(results))

The results show that the Data Profiler is able to detect sensitive information such as datetime, ipv4, or address.

Train the Data Labeler from scratch

The Data Labeler can be trained from scratch with the new list of labels. Below, we show an example of training the Data Labeler on a dataset with labels given as the columns of that dataset.

[ ]:
# the column 'comment' is changed to UNKNOWN, as the data labeler requires at least one column with label UNKNOWN
df = data.data.rename({'comment': 'UNKNOWN'}, axis=1)

# split data to training and test set
split_ratio = 0.2
df = df.sample(frac=1).reset_index(drop=True)
data_train = df[:int((1 - split_ratio) * len(df))]
data_test = df[int((1 - split_ratio) * len(df)):]

# train a new data labeler with column names as labels
if not os.path.exists('data_labeler_saved'):
    os.makedirs('data_labeler_saved')

data_labeler = dp.train_structured_labeler(
    data=data_train,
    save_dirpath="data_labeler_saved",
    epochs=10
)

The trained Data Labeler is then used by the Data Profiler to provide the prediction on the new dataset.

[ ]:
# predict with the data labeler object
profile_options.set({'data_labeler.data_labeler_object': data_labeler})
profile = dp.Profiler(data_test, profiler_options=profile_options)

# get the prediction from the data profiler
results = profile.report()
print(get_structured_results(results))

Another way to use the trained Data Labeler is through the directory path of the saved labeler.

[ ]:
# predict with the data labeler loaded from path
profile_options.set({'data_labeler.data_labeler_dirpath': 'data_labeler_saved'})
profile = dp.Profiler(data_test, profiler_options=profile_options)

# get the prediction from the data profiler
results = profile.report()
print(get_structured_results(results))

Unstructured data

Beside structured data, the Data Profiler detects the sensitive information on the unstructured text. We use a sample of spam email in Enron email dataset for this demo. As above, we start investigating the content of the given email sample.

[ ]:
# load data
data = "Message-ID: <11111111.1111111111111.JavaMail.evans@thyme>\n" + \
        "Date: Fri, 10 Aug 2005 11:31:37 -0700 (PDT)\n" + \
        "From: w..smith@company.com\n" + \
        "To: john.smith@company.com\n" + \
        "Subject: RE: ABC\n" + \
        "Mime-Version: 1.0\n" + \
        "Content-Type: text/plain; charset=us-ascii\n" + \
        "Content-Transfer-Encoding: 7bit\n" + \
        "X-From: Smith, Mary W. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=SSMITH>\n" + \
        "X-To: Smith, John </O=ENRON/OU=NA/CN=RECIPIENTS/CN=JSMITH>\n" + \
        "X-cc: \n" + \
        "X-bcc: \n" + \
        "X-Folder: \SSMITH (Non-Privileged)\Sent Items\n" + \
        "X-Origin: Smith-S\n" + \
        "X-FileName: SSMITH (Non-Privileged).pst\n\n" + \
        "All I ever saw was the e-mail from the office.\n\n" + \
        "Mary\n\n" + \
        "-----Original Message-----\n" + \
        "From:   Smith, John  \n" + \
        "Sent:   Friday, August 10, 2005 13:07 PM\n" + \
        "To:     Smith, Mary W.\n" + \
        "Subject:        ABC\n\n" + \
        "Have you heard any more regarding the ABC sale? I guess that means that " + \
        "it's no big deal here, but you think they would have send something.\n\n\n" + \
        "John Smith\n" + \
        "123-456-7890\n"

# convert string data to list to feed into the data labeler
data = [data]
print(data[0])

By default, the Data Profiler predicts the results at the character level for unstructured text.

[ ]:
data_labeler = dp.DataLabeler(labeler_type='unstructured')

# make predictions and get labels per character
predictions = data_labeler.predict(data)

# display results
print(predictions['pred'])

In addition to the character-level result, the Data Profiler provides the results at the word level following the standard NER (Named Entity Recognition), e.g., utilized by spaCy.

[ ]:
# convert prediction to word format and ner format
# Set the output to the NER format (start position, end position, label)
data_labeler.set_params(
    { 'postprocessor': { 'output_format':'ner', 'use_word_level_argmax':True } }
)

# make predictions and get labels per character
predictions = data_labeler.predict(data)

# display results
print('\n')
print('=======================Prediction======================\n')
for pred in predictions['pred'][0]:
    print('{}: {}'.format(data[0][pred[0]: pred[1]], pred[2]))
    print('--------------------------------------------------------')

Here, the Data Profiler is able to identify sensitive information such as datetime, email address, person names, and phone number in an email sample.

In summary, the Data Profiler open source library can be used to scan sensitive information in both structured and unstructured data with different file types. It support multiple input formats and output formats at word and character levels. Users can also train the data labeler on their own datasets.