Data Profiler - What’s in your data?¶
This introductory jupyter notebook demonstrates the basic usages of the Data Profiler. The library is designed to easily detect sensitive data and gather statistics on your datasets with just several lines of code. The Data Profiler can handle several different data types including: CSV (or any delimited file), JSON, Parquet, AVRO, and text. Additionally, there are a plethora of options to customize your profile. This library also has the ability to update profiles from multiple batches of large datasets, or merge multiple profiles. In particular, this example covers the followings:
- Basic usage of the Data Profiler
- The data reader class
- Profiler options
- Updating profiles and merging profiles
First, let’s import the libraries needed for this example.
[ ]:
import os
import sys
import json
import pandas as pd
import matplotlib.pyplot as plt
sys.path.insert(0, '..')
import dataprofiler as dp
data_path = "../dataprofiler/tests/data"
Basic Usage of the Data Profiler¶
This section shows the basic example of the Data Profiler. A CSV dataset is read using the data reader, then the Data object is given to the Data Profiler to detect sensitive data and obtain the statistics.
[ ]:
# use data reader to read input data
data = dp.Data(os.path.join(data_path, "csv/aws_honeypot_marx_geo.csv"))
print(data.data.head())
# run data profiler and get the report
profile = dp.Profiler(data)
report = profile.report(report_options={"output_format":"compact"})
# print the report
print(json.dumps(report, indent=4))
The report includes global_stats
and data_stats
for the given dataset. The former contains overall properties of the data such as number of rows/columns, null ratio, duplicate ratio, while the latter contains specific properties and statistics for each column such as detected data label, min, max, mean, variance, etc. In this example, the compact
format of the report is used to shorten the full list of the results. To get more results related to detailed predictions at the entity
level from the Data Labeler component or histogram results, the format pretty
should be used.
Data reader class¶
DataProfiler can detect multiple file types including CSV (or any delimited file), JSON, Parquet, AVRO, and text. The example below shows that it successfully detects data types from multiple categories regardless of the file extensions.
[ ]:
# use data reader to read input data with different file types
csv_files = [
"csv/aws_honeypot_marx_geo.csv",
"csv/all-strings-skip-header-author.csv", # csv files with the author/description on the first line
"csv/sparse-first-and-last-column-empty-first-row.txt", # csv file with the .txt extension
]
json_files = [
"json/complex_nested.json",
"json/honeypot_intentially_mislabeled_file.csv", # json file with the .csv extension
]
parquet_files = [
"parquet/nation.dict.parquet",
"parquet/nation.plain.intentionally_mislabled_file.csv", # parquet file with the .csv extension
]
avro_files = [
"avro/userdata1.avro",
"avro/userdata1_intentionally_mislabled_file.json", # avro file with the .json extension
]
text_files = [
"txt/discussion_reddit.txt",
]
all_files = {
"csv": csv_files,
"json": json_files,
"parquet": parquet_files,
"avro": avro_files,
"text": text_files
}
for file_type in all_files:
print(file_type)
for file in all_files[file_type]:
data = dp.Data(os.path.join(data_path, file))
print("{:<85} {:<15}".format(file, data.data_type))
print("\n")
The Data
class detects the file type and uses one of the following classes: CSVData
, JSONData
, ParquetData
, AVROData
, TextData
. Users can call these specific classes directly if desired. For example, below we provide a collection of data with different types, each of them is processed by the corresponding data class.
[ ]:
# use individual data reader classes
from dataprofiler.data_readers.csv_data import CSVData
from dataprofiler.data_readers.json_data import JSONData
from dataprofiler.data_readers.parquet_data import ParquetData
from dataprofiler.data_readers.avro_data import AVROData
from dataprofiler.data_readers.text_data import TextData
csv_files = "csv/aws_honeypot_marx_geo.csv"
json_files = "json/complex_nested.json"
parquet_files = "parquet/nation.dict.parquet"
avro_files = "avro/userdata1.avro"
text_files = "txt/discussion_reddit.txt"
all_files = {
"csv": [csv_files, CSVData],
"json": [json_files, JSONData],
"parquet": [parquet_files, ParquetData],
"avro": [avro_files, AVROData],
"text": [text_files, TextData],
}
for file_type in all_files:
file, data_reader = all_files[file_type]
data = data_reader(os.path.join(data_path, file))
print("File name {}\n".format(file))
if file_type == "text":
print(data.data[0][:1000]) # print the first 1000 characters
else:
print(data.data)
print('===============================================================================')
In addition to reading the input data from multiple file types, the Data Profiler allows the input data as a dataframe.
[ ]:
# run data profiler and get the report
my_dataframe = pd.DataFrame([[1, 2.0],[1, 2.2],[-1, 3]], columns=["col_int", "col_float"])
profile = dp.Profiler(my_dataframe)
report = profile.report(report_options={"output_format":"compact"})
# Print the report
print(json.dumps(report, indent=4))
Structured Profiler vs. Unstructured Profiler¶
The profiler will infer what type of statistics to generate (structured or unstructured) based on the input. However, you can explicitly specify profile type as well. Here is an example of the the profiler explicitly calling the structured profile and the unstructured profile.
[ ]:
# Using the structured profiler
data = dp.Data(os.path.join(data_path, "csv/aws_honeypot_marx_geo.csv"))
profile = dp.Profiler(data, profiler_type='structured')
report = profile.report(report_options={"output_format": "pretty"})
print(json.dumps(report, indent=4))
# Using the unstructured profiler
my_dataframe = pd.DataFrame([["Sample1"],["Sample2"],["Sample3"]], columns=["Text_Samples"])
profile = dp.Profiler(my_dataframe, profiler_type='unstructured')
report = profile.report(report_options={"output_format":"pretty"})
print(json.dumps(report, indent=4))
Profiler options¶
The Data Profiler can enable/disable statistics and modify features through profiler options. For example, if the users only want the statistics information, they may turn off the Data Labeler functionality. Below, let’s remove the histogram and data labeler component while running Data Profiler.
[ ]:
profile_options = dp.ProfilerOptions()
profile_options.set({"histogram_and_quantiles.is_enabled": False,
"data_labeler.is_enabled": False,})
profile = dp.Profiler(my_dataframe, options=profile_options)
report = profile.report(report_options={"output_format":"pretty"})
# Print the report
print(json.dumps(report, indent=4))
Besides toggling on and off features, other options like the data labeler sample size or histogram bin method can be directly set and validated as shown here:
[ ]:
profile_options = dp.ProfilerOptions()
profile_options.structured_options.data_labeler.sample_size = 1
profile_options.structured_options.int.histogram_and_quantiles.bin_count_or_method = "rice"
# An error will raise if the options are set incorrectly.
profile_options.validate()
profile = dp.Profiler(my_dataframe, options=profile_options)
report = profile.report(report_options={"output_format":"pretty"})
# Print the report
print(json.dumps(report, indent=4))
Update profiles¶
One of the interesting features of the Data Profiler is the ability to update profiles from batches of data, which allows for data streaming usage. In this section, the original dataset is separated into two batches with equal size. Each batch is then updated with Data Profiler sequentially.
After the update, we expect the resulted profiles give the same statistics as the profiles updated from the full dataset. We will verify that through some properties in global_stats
of the profiles including column_count
, row_count
, row_is_null_ratio
, duplicate_row_count
.
[ ]:
# read the input data and devide it into two equal halves
data = dp.Data(os.path.join(data_path, "csv/aws_honeypot_marx_geo.csv"))
df = data.data
df1 = df.iloc[:int(len(df)/2)]
df2 = df.iloc[int(len(df)/2):]
# Update the profile with the first half
profile = dp.Profiler(df1)
# Update the profile with the second half
profile.update_profile(df2)
# Update profile with the full dataset
profile_full = dp.Profiler(df)
report = profile.report(report_options={"output_format":"compact"})
report_full = profile_full.report(report_options={"output_format":"compact"})
# print the report
print(json.dumps(report, indent=4))
print(json.dumps(report_full, indent=4))
You can see that the profiles are exactly the same whether they are broken into several updates or not.
Merge profiles¶
In addition to the profile update, Data Profiler provides the merging functionality which allows users to combine the profiles updated from multiple locations. This enables Data Profiler to be used in a distributed computing environment. Below, we assume that the two aforementioned halves of the original dataset come from two different machines. Each of them is then updated with the Data Profiler on the same machine, then the resulted profiles are merged.
As with the profile update, we expect the merged profiles give the same statistics as the profiles updated from the full dataset.
[ ]:
# Update the profile with the first half
profile1 = dp.Profiler(df1)
# Update the profile with the second half
profile2 = dp.Profiler(df2)
# merge profiles
profile_merge = profile1 + profile2
# check results of the merged profile
report_merge = profile.report(report_options={"output_format":"compact"})
# print the report
print(json.dumps(report_merge, indent=4))
print(json.dumps(report_full, indent=4))
You can see that the profiles are exactly the same!
Conclusion¶
We have walked through some basic examples of Data Profiler usage, with different input data types and profiling options. We also work with update and merging functionality of the Data Profiler, which make it applicable for data streaming and distributed environment. Interested users can try with different datasets and functionalities as desired.