Profile Builder¶
coding=utf-8
Build model for a dataset by identifying type of column along with its respective parameters.
- class dataprofiler.profilers.profile_builder.StructuredColProfiler(df_series=None, sample_size=None, min_sample_size=5000, sampling_ratio=0.2, min_true_samples=None, sample_ids=None, pool=None, options=None)¶
Bases:
object
Instantiate the StructuredColProfiler class for a given column.
- Parameters
df_series (pandas.core.series.Series) – Data to be profiled
sample_size (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
sample_ids (list(list)) – Randomized list of sample indices
pool (multiprocessing.Pool) – pool utilized for multiprocessing
options (StructuredOptions Object) – Options for the structured profiler.
- update_column_profilers(clean_sampled_df, pool)¶
Calculates type statistics and labels dataset
- Parameters
clean_sampled_df (Pandas.Series) – sampled series with none types dropped
pool (multiprocessing.pool) – pool utilized for multiprocessing
- diff(other_profile, options=None)¶
Finds the difference between 2 StructuredCols and returns the report
- Parameters
other_profile (StructuredColProfiler) – Structured col finding the difference with this one.
options (dict) – options to change results of the difference
- Returns
difference of the structured column
- Return type
dict
- report(remove_disabled_flag=False)¶
- property profile¶
- update_profile(df_series, sample_size=None, min_true_samples=None, sample_ids=None, pool=None)¶
Update the column profiler
- Parameters
df_series (pandas.core.series.Series) – Data to be profiled
sample_size (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
sample_ids (list(list)) – Randomized list of sample indices
pool (multiprocessing.Pool) – pool utilized for multiprocessing
- static clean_data_and_get_base_stats(df_series, sample_size, null_values=None, min_true_samples=None, sample_ids=None)¶
Identify null characters and return them in a dictionary as well as remove any nulls in column.
- Parameters
df_series (pandas.core.series.Series) – a given column
sample_size (int) – Number of samples to use in generating the profile
null_values (dict[str, re.FLAG]) – Dictionary mapping null values to regex flag where the key represents the null value to remove from the data and the flag represents the regex flag to apply
min_true_samples (int) – Minimum number of samples required for the profiler
sample_ids (list(list)) – Randomized list of sample indices
- Returns
updated column with null removed and dictionary of null parameters
- Return type
pd.Series, dict
- class dataprofiler.profilers.profile_builder.BaseProfiler(data, samples_per_update=None, min_true_samples=0, options=None)¶
Bases:
object
Instantiate the BaseProfiler class
- Parameters
data (Data class object) – Data to be profiled
samples_per_update (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
- Returns
Profiler
- diff(other_profile, options=None)¶
Finds the difference of two profiles :param other_profile: profile being added to this one. :type other_profile: BaseProfiler :return: diff of the two profiles :rtype: dict
- property profile¶
Returns the stored profiles for the given profiler.
- Returns
None
- report(report_options=None)¶
Returns the profile report based on all profiled data fed into the profiler. User can specify the output_formats: (pretty, compact, serializable, flat).
- Pretty: floats are rounded to four decimal places, and lists are
shortened.
- Compact: Similar to pretty, but removes detailed statistics such as
runtimes, label probabilities, index locations of null types, etc.
Serializable: Output is json serializable and not prettified Flat: Nested output is returned as a flattened dictionary
- Variables
report_options – optional format changes to the report dict(output_format=<FORMAT>)
- Returns
dictionary report
- Return type
dict
- update_profile(data, sample_size=None, min_true_samples=None)¶
Update the profile for data provided. User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.
- Parameters
data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled
sample_size (int) – number of samples to profile from the data
min_true_samples – minimum number of non-null samples to profile
:type min_true_samples :return: None
- save(filepath=None)¶
Save profiler to disk
- Parameters
filepath (String) – Path of file to save to
- Returns
None
- classmethod load(filepath)¶
Load profiler from disk
- Parameters
filepath (String) – Path of file to load from
- Returns
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type
- class dataprofiler.profilers.profile_builder.UnstructuredProfiler(data, samples_per_update=None, min_true_samples=0, options=None)¶
Bases:
dataprofiler.profilers.profile_builder.BaseProfiler
Instantiate the UnstructuredProfiler class
- Parameters
data (Data class object) – Data to be profiled
samples_per_update (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
- Returns
UnstructuredProfiler
- diff(other_profile, options=None)¶
Finds the difference between 2 unstuctured profiles and returns the report.
- Parameters
other_profile (UnstructuredProfiler) – profile finding the difference with this one.
options (dict) – options to impact the results of the diff
- Returns
difference of the profiles
- Return type
dict
- report(report_options=None)¶
Returns the unstructured report based on all profiled data fed into the profiler. User can specify the output_formats: (pretty, compact, serializable, flat).
- Pretty: floats are rounded to four decimal places, and lists are
shortened.
- Compact: Similar to pretty, but removes detailed statistics such as
runtimes, label probabilities, index locations of null types, etc.
Serializable: Output is json serializable and not prettified Flat: Nested output is returned as a flattened dictionary
- Variables
report_options – optional format changes to the report dict(output_format=<FORMAT>)
- Returns
dictionary report
- Return type
dict
- save(filepath=None)¶
Save profiler to disk
- Parameters
filepath (String) – Path of file to save to
- Returns
None
- classmethod load(filepath)¶
Load profiler from disk
- Parameters
filepath (String) – Path of file to load from
- Returns
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type
- property profile¶
Returns the stored profiles for the given profiler.
- Returns
None
- update_profile(data, sample_size=None, min_true_samples=None)¶
Update the profile for data provided. User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.
- Parameters
data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled
sample_size (int) – number of samples to profile from the data
min_true_samples – minimum number of non-null samples to profile
:type min_true_samples :return: None
- class dataprofiler.profilers.profile_builder.StructuredProfiler(data, samples_per_update=None, min_true_samples=0, options=None)¶
Bases:
dataprofiler.profilers.profile_builder.BaseProfiler
Instantiate the StructuredProfiler class
- Parameters
data (Data class object) – Data to be profiled
samples_per_update (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
- Returns
StructuredProfiler
- diff(other_profile, options=None)¶
Finds the difference between 2 Profiles and returns the report
- Parameters
other_profile (StructuredProfiler) – profile finding the difference with this one
options (dict) – options to change results of the difference
- Returns
difference of the profiles
- Return type
dict
- report(report_options=None)¶
Returns the profile report based on all profiled data fed into the profiler. User can specify the output_formats: (pretty, compact, serializable, flat).
- Pretty: floats are rounded to four decimal places, and lists are
shortened.
- Compact: Similar to pretty, but removes detailed statistics such as
runtimes, label probabilities, index locations of null types, etc.
Serializable: Output is json serializable and not prettified Flat: Nested output is returned as a flattened dictionary
- Variables
report_options – optional format changes to the report dict(output_format=<FORMAT>)
- Returns
dictionary report
- Return type
dict
- save(filepath=None)¶
Save profiler to disk
- Parameters
filepath (String) – Path of file to save to
- Returns
None
- classmethod load(filepath)¶
Load profiler from disk
- Parameters
filepath (String) – Path of file to load from
- Returns
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type
- property profile¶
Returns the stored profiles for the given profiler.
- Returns
None
- update_profile(data, sample_size=None, min_true_samples=None)¶
Update the profile for data provided. User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.
- Parameters
data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled
sample_size (int) – number of samples to profile from the data
min_true_samples – minimum number of non-null samples to profile
:type min_true_samples :return: None
- class dataprofiler.profilers.profile_builder.Profiler(data, samples_per_update=None, min_true_samples=0, options=None, profiler_type=None)¶
Bases:
object
Factory class for instantiating Structured and Unstructured Profilers
- Parameters
data (Data class object) – Data to be profiled, type allowed depends on the profiler_type
samples_per_update (int) – Number of samples to use to generate profile
min_true_samples (int) – Min number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
profiler_type (str) – Type of Profiler (“structured”/”unstructured”)
- Returns
BaseProfiler
- classmethod load(filepath)¶
Load profiler from disk
- Parameters
filepath (String) – Path of file to load from
- Returns
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type