Unstructured Profilers¶
Data profiling - is the process of examining a dataset and collecting statistical or informational summaries about said dataset.
The Profiler class inside the DataProfiler is designed to generate data profiles via the Profiler class, which ingests either a Data class or a Pandas DataFrame.
Currently, the Data class supports loading the following file formats:
Any delimited (CSV, TSV, etc.)
JSON object
Avro
Parquet
Text files
Pandas Series/Dataframe
Once the data is loaded, the Profiler can calculate statistics and predict the entities (via the Labeler) of every column (csv) or key-value (JSON) store as well as dataset wide information, such as the number of nulls, duplicates, etc.
This example will look at specifically the unstructured data types for unstructured profiling. This means that only text files, lists of strings, single column pandas dataframes/series, or DataProfile Data objects in string format will work with the unstructured profiler.
Reporting¶
One of the primary purposes of the Profiler are to quickly identify what is in the dataset. This can be useful for analyzing a dataset prior to use or determining which columns could be useful for a given purpose.
In terms of reporting, there are multiple reporting options:
Pretty: Floats are rounded to four decimal places, and lists are shortened.
Compact: Similar to pretty, but removes detailed statistics
Serializable: Output is json serializable and not prettified
Flat: Nested Output is returned as a flattened dictionary
The Pretty and Compact reports are the two most commonly used reports and includes global_stats
and data_stats
for the given dataset. global_stats
contains overall properties of the data such as samples used and file encoding. data_stats
contains specific properties and statistics for each text sample.
For unstructured profiles, the report looks like this:
"global_stats": {
"samples_used": int,
"empty_line_count": int,
"file_type": string,
"encoding": string
},
"data_stats": {
"data_label": {
"entity_counts": {
"word_level": dict(int),
"true_char_level": dict(int),
"postprocess_char_level": dict(int)
},
"times": dict(float)
},
"statistics": {
"vocab": list(char),
"words": list(string),
"word_count": dict(int),
"times": dict(float)
}
}
[ ]:
import os
import sys
import json
try:
sys.path.insert(0, '..')
import dataprofiler as dp
except ImportError:
import dataprofiler as dp
data_path = "../dataprofiler/tests/data"
# remove extra tf loggin
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
[ ]:
data = dp.Data(os.path.join(data_path, "txt/discussion_reddit.txt"))
profile = dp.Profiler(data)
report = profile.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))
Profiler Type¶
It should be noted, in addition to reading the input data from text files, DataProfiler allows the input data as a pandas dataframe, a pandas series, a list, and Data objects (when an unstructured format is selected) if the Profiler is explicitly chosen as unstructured.
[ ]:
# run data profiler and get the report
import pandas as pd
data = dp.Data(os.path.join(data_path, "csv/SchoolDataSmall.csv"), options={"data_format": "records"})
profile = dp.Profiler(data, profiler_type='unstructured')
report = profile.report(report_options={"output_format":"pretty"})
print(json.dumps(report, indent=4))
Profiler options¶
The DataProfiler has the ability to turn on and off components as needed. This is accomplished via the ProfilerOptions
class.
For example, if a user doesn’t require vocab count information they may desire to turn off the word count functionality.
Below, let’s remove the vocab count and set the stop words.
Full list of options in the Profiler section of the DataProfiler documentation.
[ ]:
data = dp.Data(os.path.join(data_path, "txt/discussion_reddit.txt"))
profile_options = dp.ProfilerOptions()
# Setting multiple options via set
profile_options.set({ "*.vocab.is_enabled": False, "*.is_case_sensitive": True })
# Set options via directly setting them
profile_options.unstructured_options.text.stop_words = ["These", "are", "stop", "words"]
profile = dp.Profiler(data, options=profile_options)
report = profile.report(report_options={"output_format": "compact"})
# Print the report
print(json.dumps(report, indent=4))
Updating Profiles¶
Beyond just profiling, one of the unique aspects of the DataProfiler is the ability to update the profiles. To update appropriately, the schema (columns / keys) must match appropriately.
[ ]:
# Load and profile a CSV file
data = dp.Data(os.path.join(data_path, "txt/sentence-3x.txt"))
profile = dp.Profiler(data)
# Update the profile with new data:
new_data = dp.Data(os.path.join(data_path, "txt/sentence-3x.txt"))
profile.update_profile(new_data)
# Take a peek at the data
print(data.data)
print(new_data.data)
# Report the compact version of the profile
report = profile.report(report_options={"output_format": "compact"})
print(json.dumps(report, indent=4))
Merging Profiles¶
Merging profiles are an alternative method for updating profiles. Particularly, multiple profiles can be generated seperately, then added together with a simple +
command: profile3 = profile1 + profile2
[ ]:
# Load a CSV file with a schema
data1 = dp.Data(os.path.join(data_path, "txt/sentence-3x.txt"))
profile1 = dp.Profiler(data1)
# Load another CSV file with the same schema
data2 = dp.Data(os.path.join(data_path, "txt/sentence-3x.txt"))
profile2 = dp.Profiler(data2)
# Merge the profiles
profile3 = profile1 + profile2
# Report the compact version of the profile
report = profile3.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))
As you can see, the update_profile
function and the +
operator function similarly. The reason the +
operator is important is that it’s possible to save and load profiles, which we cover next.
Differences in Data¶
Can be applied to both structured and unstructured datasets.
Such reports can provide details on the differences between training and validation data like in this pseudo example:
profiler_training = dp.Profiler(training_data)
profiler_testing = dp.Profiler(testing_data)
validation_report = profiler_training.diff(profiler_testing)
[ ]:
from pprint import pprint
# unstructured differences example
data_split_differences = profile1.diff(profile2)
pprint(data_split_differences)
Saving and Loading a Profile¶
Not only can the Profiler create and update profiles, it’s also possible to save, load then manipulate profiles.
[ ]:
# Load data
data = dp.Data(os.path.join(data_path, "txt/sentence-3x.txt"))
# Generate a profile
profile = dp.Profiler(data)
# Save a profile to disk for later (saves as pickle file)
profile.save(filepath="my_profile.pkl")
# Load a profile from disk
loaded_profile = dp.Profiler.load("my_profile.pkl")
# Report the compact version of the profile
report = profile.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))
With the ability to save and load profiles, profiles can be generated via multiple machines then merged. Further, profiles can be stored and later used in applications such as change point detection, synthetic data generation, and more.
[ ]:
# Load a multiple files via the Data class
filenames = ["txt/sentence-3x.txt",
"txt/sentence.txt"]
data_objects = []
for filename in filenames:
data_objects.append(dp.Data(os.path.join(data_path, filename)))
print(data_objects)
# Generate and save profiles
for i in range(len(data_objects)):
profile = dp.Profiler(data_objects[i])
report = profile.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))
profile.save(filepath="data-"+str(i)+".pkl")
# Load profiles and add them together
profile = None
for i in range(len(data_objects)):
if profile is None:
profile = dp.Profiler.load("data-"+str(i)+".pkl")
else:
profile += dp.Profiler.load("data-"+str(i)+".pkl")
# Report the compact version of the profile
report = profile.report(report_options={"output_format":"compact"})
print(json.dumps(report, indent=4))