dataprofiler.profilers.profile_builder module¶
Build model for dataset by identifying col type along with its respective params.
- class dataprofiler.profilers.profile_builder.StructuredColProfiler(df_series: Series | None = None, sample_size: int | None = None, min_sample_size: int = 5000, sampling_ratio: float = 0.2, min_true_samples: int = 0, sample_ids: ndarray | None = None, pool: Pool | None = None, column_index: int | None = None, options: StructuredOptions | None = None)¶
For profiling structured data columns.
Instantiate the StructuredColProfiler class for a given column.
- Parameters:
df_series (pandas.core.series.Series) – Data to be profiled
sample_size (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
sample_ids (list(list)) – Randomized list of sample indices
pool (multiprocessing.Pool) – pool utilized for multiprocessing
column_index (int) – index of the given column
options (StructuredOptions Object) – Options for the structured profiler.
- update_column_profilers(clean_sampled_df: Series, pool: Pool | None = None) None ¶
Calculate type statistics and label dataset.
- Parameters:
clean_sampled_df (Pandas.Series) – sampled series with none types dropped
pool (multiprocessing.pool) – pool utilized for multiprocessing
- diff(other_profile: StructuredColProfiler, options: dict | None = None) dict ¶
Find the difference between 2 StructuredCols and return the report.
- Parameters:
other_profile (StructuredColProfiler) – Structured col finding the difference with this one.
options (dict) – options to change results of the difference
- Returns:
difference of the structured column
- Return type:
- report(remove_disabled_flag: bool = False) OrderedDict ¶
Return profile.
- classmethod load_from_dict(data, config: dict | None = None) StructuredColProfiler ¶
Parse attribute from json dictionary into self.
- Parameters:
data (dict[string, Any]) – dictionary with attributes and values.
config (Dict | None) – config for loading structured column profiler
- Returns:
Profiler with attributes populated.
- Return type:
- property profile: dict¶
Return a report.
- update_profile(df_series: Series, sample_size: int | None = None, min_true_samples: int | None = None, sample_ids: ndarray | None = None, pool: Pool | None = None) None ¶
Update the column profiler.
- Parameters:
df_series (pandas.core.series.Series) – Data to be profiled
sample_size (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
sample_ids (list(list)) – Randomized list of sample indices
pool (multiprocessing.Pool) – pool utilized for multiprocessing
- static clean_data_and_get_base_stats(df_series: pd.Series, sample_size: int, null_values: dict[str, re.RegexFlag | int] = None, min_true_samples: int = None, sample_ids: np.ndarray | list[list[int]] | None = None) tuple[pd.Series, dict] ¶
Identify null characters and return them in a dictionary.
Remove any nulls in column.
- Parameters:
df_series (pandas.core.series.Series) – a given column
sample_size (int) – Number of samples to use in generating the profile
null_values (Dict[str, Union[re.RegexFlag, int]]) – Dictionary mapping null values to regex flag where the key represents the null value to remove from the data and the flag represents the regex flag to apply
min_true_samples (int) – Minimum number of samples required for the profiler
sample_ids (list(list)) – Randomized list of sample indices
- Returns:
updated column with null removed and dictionary of null parameters
- Return type:
pd.Series, dict
- class dataprofiler.profilers.profile_builder.BaseProfiler(data: Data | None, samples_per_update: int = None, min_true_samples: int = 0, options: BaseOption = None)¶
Abstract class for profiling data.
Instantiate the BaseProfiler class.
- Parameters:
data (Data class object) – Data to be profiled
samples_per_update (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
- Returns:
- diff(other_profile: BaseProfiler, options: dict | None = None) dict ¶
Find the difference of two profiles.
- Parameters:
other_profile (BaseProfiler) – profile being added to this one.
- Returns:
diff of the two profiles
- Return type:
- property profile: BaseCompiler | list[StructuredColProfiler]¶
Return the stored profiles for the given profiler.
- Returns:
BaseCompiler | list[StructuredColProfiler]
- report(report_options: dict | None = None) dict ¶
Return profile report based on all profiled data fed into the profiler.
- User can specify the output_formats: (pretty, compact, serializable, flat).
- Pretty: floats are rounded to four decimal places, and lists are
- Compact: Similar to pretty, but removes detailed statistics such as
runtimes, label probabilities, index locations of null types, etc.
Serializable: Output is json serializable and not prettified Flat: Nested output is returned as a flattened dictionary
- Variables:
report_options – optional format changes to the report dict(output_format=<FORMAT>)
- Returns:
dictionary report
- Return type:
- classmethod load_from_dict(data, config: dict | None = None) BaseProfilerT ¶
Parse attribute from json dictionary into self.
- Parameters:
data (dict[string, Any]) – dictionary with attributes and values.
config (Dict | None) – config for overriding data params when loading from dict
- Returns:
Profiler with attributes populated.
- Return type:
- update_profile(data: data_readers.base_data.BaseData | pd.DataFrame | pd.Series, sample_size: int = None, min_true_samples: int = None) None ¶
Update the profile for data provided.
User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.
- Parameters:
data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled
sample_size (int) – number of samples to profile from the data
min_true_samples (int) – minimum number of non-null samples to profile
- Returns:
- save(filepath: str | None = None, save_method: str = 'pickle') None ¶
Save profiler to disk.
- Parameters:
filepath (String) – Path of file to save to
save_method (String) – The desired saving method (must be “pickle” or “json”)
- Returns:
- classmethod load(filepath: str, load_method: str | None = None) BaseProfiler ¶
Load profiler from disk.
- Parameters:
filepath (String) – Path of file to load from
load_method (Optional[String]) – The desired loading method, default = None
- Returns:
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type:
- class dataprofiler.profilers.profile_builder.UnstructuredProfiler(data: Data, samples_per_update: int | None = None, min_true_samples: int = 0, options: BaseOption | None = None)¶
For profiling unstructured data.
Instantiate the UnstructuredProfiler class.
- Parameters:
data (Data class object) – Data to be profiled
samples_per_update (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
- Returns:
- diff(other_profile: UnstructuredProfiler, options: dict | None = None) dict ¶
Find difference between 2 unstuctured profiles and return the report.
- Parameters:
other_profile (UnstructuredProfiler) – profile finding the difference with this one.
options (dict) – options to impact the results of the diff
- Returns:
difference of the profiles
- Return type:
- property profile: BaseCompiler¶
Return the stored profiles for the given profiler.
- Returns:
- report(report_options: dict | None = None) dict ¶
Return unstructured report based on all profiled data fed into profiler.
- User can specify the output_formats: (pretty, compact, serializable, flat).
- Pretty: floats are rounded to four decimal places, and lists are
- Compact: Similar to pretty, but removes detailed statistics such as
runtimes, label probabilities, index locations of null types, etc.
Serializable: Output is json serializable and not prettified Flat: Nested output is returned as a flattened dictionary
- Variables:
report_options – optional format changes to the report dict(output_format=<FORMAT>)
- Returns:
dictionary report
- Return type:
- classmethod load_from_dict(data, config: dict | None = None)¶
Parse attribute from json dictionary into self.
- Parameters:
data (dict[string, Any]) – dictionary with attributes and values.
config (Dict | None) – config for loading profiler params from dictionary
- Raises:
- save(filepath: str | None = None, save_method: str = 'pickle') None ¶
Save profiler to disk.
- Parameters:
filepath (String) – Path of file to save to
save_method (String) – The desired saving method (“pickle” | “json”)
- Returns:
- classmethod load(filepath: str, load_method: str | None = None) BaseProfiler ¶
Load profiler from disk.
- Parameters:
filepath (String) – Path of file to load from
load_method (Optional[String]) – The desired loading method, default = None
- Returns:
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type:
- update_profile(data: data_readers.base_data.BaseData | pd.DataFrame | pd.Series, sample_size: int = None, min_true_samples: int = None) None ¶
Update the profile for data provided.
User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.
- Parameters:
data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled
sample_size (int) – number of samples to profile from the data
min_true_samples (int) – minimum number of non-null samples to profile
- Returns:
- class dataprofiler.profilers.profile_builder.StructuredProfiler(data: Data, samples_per_update: int | None = None, min_true_samples: int = 0, options: BaseOption | None = None)¶
For profiling structured data.
Instantiate the StructuredProfiler class.
- Parameters:
data (Data class object) – Data to be profiled
samples_per_update (int) – Number of samples to use in generating profile
min_true_samples (int) – Minimum number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
- Returns:
- diff(other_profile: StructuredProfiler, options: dict | None = None) dict ¶
Find the difference between 2 Profiles and return the report.
- Parameters:
other_profile (StructuredProfiler) – profile finding the difference with this one
options (dict) – options to change results of the difference
- Returns:
difference of the profiles
- Return type:
- property profile: list[StructuredColProfiler]¶
Return the stored profiles for the given profiler.
- Returns:
- report(report_options: dict | None = None) dict ¶
Return a report.
- classmethod load_from_dict(data, config: dict | None = None) StructuredProfiler ¶
Parse attribute from json dictionary into self.
- Parameters:
data (dict[string, Any]) – dictionary with attributes and values.
config (Dict | None) – config for loading profiler params from dictionary
- Returns:
Profiler with attributes populated.
- Return type:
- save(filepath: str | None = None, save_method: str = 'pickle') None ¶
Save profiler to disk.
- Parameters:
filepath (String) – Path of file to save to
save_method (String) – The desired saving method (must be “pickle” or “json”)
- Returns:
- classmethod load(filepath: str, load_method: str | None = None) BaseProfiler ¶
Load profiler from disk.
- Parameters:
filepath (String) – Path of file to load from
load_method (Optional[String]) – The desired loading method, default = None
- Returns:
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type:
- update_profile(data: data_readers.base_data.BaseData | pd.DataFrame | pd.Series, sample_size: int = None, min_true_samples: int = None) None ¶
Update the profile for data provided.
User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.
- Parameters:
data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled
sample_size (int) – number of samples to profile from the data
min_true_samples (int) – minimum number of non-null samples to profile
- Returns:
- class dataprofiler.profilers.profile_builder.Profiler(data: Data, samples_per_update: int = None, min_true_samples: int = 0, options: ProfilerOptions = None, profiler_type: str = None)¶
For profiling data.
Instantiate Structured and Unstructured Profilers.
This is a factory class.
- Parameters:
data (Data class object) – Data to be profiled, type allowed depends on the profiler_type
samples_per_update (int) – Number of samples to use to generate profile
min_true_samples (int) – Min number of samples required for the profiler
options (ProfilerOptions Object) – Options for the profiler.
profiler_type (str) – Type of Profiler (“graph”/”structured”/”unstructured”)
- Returns:
Union[GraphProfiler, StructuredProfiler, UnstructuredProfiler]
- classmethod load(filepath: str, load_method: str | None = None) BaseProfiler ¶
Load profiler from disk.
- Parameters:
filepath (String) – Path of file to load from
load_method (Optional[String]) – The desired loading method, default = “None”
- Returns:
Profiler being loaded, StructuredProfiler or UnstructuredProfiler
- Return type: