Profile Builder¶
Build model for dataset by identifying col type along with its respective params.
- class dataprofiler.profilers.profile_builder.StructuredColProfiler(df_series: Optional[pandas.core.series.Series] = None, sample_size: Optional[int] = None, min_sample_size: int = 5000, sampling_ratio: float = 0.2, min_true_samples: int = 0, sample_ids: Optional[numpy.ndarray] = None, pool: Optional[multiprocessing.pool.Pool] = None, options: Optional[dataprofiler.profilers.profiler_options.StructuredOptions] = None)¶
Bases:
object
For profiling structured data columns.
Instantiate the StructuredColProfiler class for a given column.
- Parameters
df_series (pandas.core.series.Series) – Data to be profiled
sample_size (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
sample_ids (list(list)) – Randomized list of sample indices
pool (multiprocessing.Pool) – pool utilized for multiprocessing
options (StructuredOptions Object) – Options for the structured profiler.
- update_column_profilers(clean_sampled_df: pandas.core.series.Series, pool: Optional[multiprocessing.pool.Pool] = None) None ¶
Calculate type statistics and label dataset.
- Parameters
clean_sampled_df (Pandas.Series) – sampled series with none types dropped
pool (multiprocessing.pool) – pool utilized for multiprocessing
- diff(other_profile: dataprofiler.profilers.profile_builder.StructuredColProfiler, options: Optional[Dict] = None) Dict ¶
Find the difference between 2 StructuredCols and return the report.
- Parameters
other_profile (StructuredColProfiler) – Structured col finding the difference with this one.
options (dict) – options to change results of the difference
- Returns
difference of the structured column
- Return type
dict
- report(remove_disabled_flag: bool = False) collections.OrderedDict ¶
Return profile.
- property profile: Dict¶
Return a report.
- update_profile(df_series: pandas.core.series.Series, sample_size: Optional[int] = None, min_true_samples: Optional[int] = None, sample_ids: Optional[numpy.ndarray] = None, pool: Optional[multiprocessing.pool.Pool] = None) None ¶
Update the column profiler.
- Parameters
df_series (pandas.core.series.Series) – Data to be profiled
sample_size (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
sample_ids (list(list)) – Randomized list of sample indices
pool (multiprocessing.Pool) – pool utilized for multiprocessing
- static clean_data_and_get_base_stats(df_series: pandas.core.series.Series, sample_size: int, null_values: Optional[Dict[str, Union[re.RegexFlag, int]]] = None, min_true_samples: Optional[int] = None, sample_ids: Optional[numpy.ndarray] = None) Tuple[pandas.core.series.Series, Dict] ¶
Identify null characters and return them in a dictionary.
Remove any nulls in column.
- Parameters
df_series (pandas.core.series.Series) – a given column
sample_size (int) – Number of samples to use in generating the profile
null_values (dict[str, re.FLAG]) – Dictionary mapping null values to regex flag where the key represents the null value to remove from the data and the flag represents the regex flag to apply
min_true_samples (int) – Minimum number of samples required for the profiler
sample_ids (list(list)) – Randomized list of sample indices
- Returns
updated column with null removed and dictionary of null parameters
- Return type
pd.Series, dict
- class dataprofiler.profilers.profile_builder.BaseProfiler(data: Optional[dataprofiler.data_readers.data.Data], samples_per_update: Optional[int] = None, min_true_samples: int = 0, options: Optional[dataprofiler.profilers.profiler_options.BaseOption] = None)¶
Bases:
object
Abstract class for profiling data.
Instantiate the BaseProfiler class.
- Parameters
data (Data class object) – Data to be profiled
samples_per_update (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
- Returns
Profiler
- diff(other_profile: dataprofiler.profilers.profile_builder.BaseProfiler, options: Optional[Dict] = None) Dict ¶
Find the difference of two profiles.
- Parameters
other_profile (BaseProfiler) – profile being added to this one.
- Returns
diff of the two profiles
- Return type
dict
- property profile: dataprofiler.profilers.column_profile_compilers.BaseCompiler¶
Return the stored profiles for the given profiler.
- Returns
BaseCompiler
- report(report_options: Optional[Dict] = None) Dict ¶
Return profile report based on all profiled data fed into the profiler.
- User can specify the output_formats: (pretty, compact, serializable, flat).
- Pretty: floats are rounded to four decimal places, and lists are
shortened.
- Compact: Similar to pretty, but removes detailed statistics such as
runtimes, label probabilities, index locations of null types, etc.
Serializable: Output is json serializable and not prettified Flat: Nested output is returned as a flattened dictionary
- Variables
report_options – optional format changes to the report dict(output_format=<FORMAT>)
- Returns
dictionary report
- Return type
dict
- update_profile(data: Union[dataprofiler.data_readers.base_data.BaseData, pandas.core.frame.DataFrame, pandas.core.series.Series], sample_size: Optional[int] = None, min_true_samples: Optional[int] = None) None ¶
Update the profile for data provided.
User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.
- Parameters
data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled
sample_size (int) – number of samples to profile from the data
min_true_samples (int) – minimum number of non-null samples to profile
- Returns
None
- save(filepath: Optional[str] = None) None ¶
Save profiler to disk.
- Parameters
filepath (String) – Path of file to save to
- Returns
None
- classmethod load(filepath: str) dataprofiler.profilers.profile_builder.BaseProfiler ¶
Load profiler from disk.
- Parameters
filepath (String) – Path of file to load from
- Returns
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type
- class dataprofiler.profilers.profile_builder.UnstructuredProfiler(data: dataprofiler.data_readers.data.Data, samples_per_update: Optional[int] = None, min_true_samples: int = 0, options: Optional[dataprofiler.profilers.profiler_options.BaseOption] = None)¶
Bases:
dataprofiler.profilers.profile_builder.BaseProfiler
For profiling unstructured data.
Instantiate the UnstructuredProfiler class.
- Parameters
data (Data class object) – Data to be profiled
samples_per_update (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
- Returns
UnstructuredProfiler
- diff(other_profile: dataprofiler.profilers.profile_builder.UnstructuredProfiler, options: Optional[Dict] = None) Dict ¶
Find difference between 2 unstuctured profiles and return the report.
- Parameters
other_profile (UnstructuredProfiler) – profile finding the difference with this one.
options (dict) – options to impact the results of the diff
- Returns
difference of the profiles
- Return type
dict
- report(report_options: Optional[Dict] = None) Dict ¶
Return unstructured report based on all profiled data fed into profiler.
- User can specify the output_formats: (pretty, compact, serializable, flat).
- Pretty: floats are rounded to four decimal places, and lists are
shortened.
- Compact: Similar to pretty, but removes detailed statistics such as
runtimes, label probabilities, index locations of null types, etc.
Serializable: Output is json serializable and not prettified Flat: Nested output is returned as a flattened dictionary
- Variables
report_options – optional format changes to the report dict(output_format=<FORMAT>)
- Returns
dictionary report
- Return type
dict
- save(filepath: Optional[str] = None) None ¶
Save profiler to disk.
- Parameters
filepath (String) – Path of file to save to
- Returns
None
- classmethod load(filepath: str) dataprofiler.profilers.profile_builder.BaseProfiler ¶
Load profiler from disk.
- Parameters
filepath (String) – Path of file to load from
- Returns
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type
- property profile: dataprofiler.profilers.column_profile_compilers.BaseCompiler¶
Return the stored profiles for the given profiler.
- Returns
BaseCompiler
- update_profile(data: Union[dataprofiler.data_readers.base_data.BaseData, pandas.core.frame.DataFrame, pandas.core.series.Series], sample_size: Optional[int] = None, min_true_samples: Optional[int] = None) None ¶
Update the profile for data provided.
User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.
- Parameters
data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled
sample_size (int) – number of samples to profile from the data
min_true_samples (int) – minimum number of non-null samples to profile
- Returns
None
- class dataprofiler.profilers.profile_builder.StructuredProfiler(data: dataprofiler.data_readers.data.Data, samples_per_update: Optional[int] = None, min_true_samples: int = 0, options: Optional[dataprofiler.profilers.profiler_options.BaseOption] = None)¶
Bases:
dataprofiler.profilers.profile_builder.BaseProfiler
For profiling structured data.
Instantiate the StructuredProfiler class.
- Parameters
data (Data class object) – Data to be profiled
samples_per_update (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
- Returns
StructuredProfiler
- diff(other_profile: dataprofiler.profilers.profile_builder.StructuredProfiler, options: Optional[Dict] = None) Dict ¶
Find the difference between 2 Profiles and return the report.
- Parameters
other_profile (StructuredProfiler) – profile finding the difference with this one
options (dict) – options to change results of the difference
- Returns
difference of the profiles
- Return type
dict
- report(report_options: Optional[Dict] = None) Dict ¶
Return a report.
- save(filepath: Optional[str] = None) None ¶
Save profiler to disk.
- Parameters
filepath (String) – Path of file to save to
- Returns
None
- classmethod load(filepath: str) dataprofiler.profilers.profile_builder.BaseProfiler ¶
Load profiler from disk.
- Parameters
filepath (String) – Path of file to load from
- Returns
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type
- property profile: dataprofiler.profilers.column_profile_compilers.BaseCompiler¶
Return the stored profiles for the given profiler.
- Returns
BaseCompiler
- update_profile(data: Union[dataprofiler.data_readers.base_data.BaseData, pandas.core.frame.DataFrame, pandas.core.series.Series], sample_size: Optional[int] = None, min_true_samples: Optional[int] = None) None ¶
Update the profile for data provided.
User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.
- Parameters
data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled
sample_size (int) – number of samples to profile from the data
min_true_samples (int) – minimum number of non-null samples to profile
- Returns
None
- class dataprofiler.profilers.profile_builder.Profiler(data: dataprofiler.data_readers.data.Data, samples_per_update: Optional[int] = None, min_true_samples: int = 0, options: Optional[dataprofiler.profilers.profiler_options.ProfilerOptions] = None, profiler_type: Optional[str] = None)¶
Bases:
object
For profiling data.
Instantiate Structured and Unstructured Profilers.
This is a factory class.
- Parameters
data (Data class object) – Data to be profiled, type allowed depends on the profiler_type
samples_per_update (int) – Number of samples to use to generate profile
min_true_samples (int) – Min number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
profiler_type (str) – Type of Profiler (“graph”/”structured”/”unstructured”)
- Returns
Union[GraphProfiler, StructuredProfiler, UnstructuredProfiler]
- classmethod load(filepath: str) dataprofiler.profilers.profile_builder.BaseProfiler ¶
Load profiler from disk.
- Parameters
filepath (String) – Path of file to load from
- Returns
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type