Profile Builder

Build model for dataset by identifying col type along with its respective params.

class dataprofiler.profilers.profile_builder.StructuredColProfiler(df_series: Optional[pandas.core.series.Series] = None, sample_size: Optional[int] = None, min_sample_size: int = 5000, sampling_ratio: float = 0.2, min_true_samples: int = 0, sample_ids: Optional[numpy.ndarray] = None, pool: Optional[multiprocessing.pool.Pool] = None, options: Optional[dataprofiler.profilers.profiler_options.StructuredOptions] = None)

Bases: object

For profiling structured data columns.

Instantiate the StructuredColProfiler class for a given column.

Parameters
  • df_series (pandas.core.series.Series) – Data to be profiled

  • sample_size (int) – Number of samples to use in generating profile

  • min_true_samples (int) – Minimum number of samples required for the profiler

  • sample_ids (list(list)) – Randomized list of sample indices

  • pool (multiprocessing.Pool) – pool utilized for multiprocessing

  • options (StructuredOptions Object) – Options for the structured profiler.

update_column_profilers(clean_sampled_df: pandas.core.series.Series, pool: Optional[multiprocessing.pool.Pool] = None) None

Calculate type statistics and label dataset.

Parameters
  • clean_sampled_df (Pandas.Series) – sampled series with none types dropped

  • pool (multiprocessing.pool) – pool utilized for multiprocessing

diff(other_profile: dataprofiler.profilers.profile_builder.StructuredColProfiler, options: Optional[Dict] = None) Dict

Find the difference between 2 StructuredCols and return the report.

Parameters
  • other_profile (StructuredColProfiler) – Structured col finding the difference with this one.

  • options (dict) – options to change results of the difference

Returns

difference of the structured column

Return type

dict

report(remove_disabled_flag: bool = False) collections.OrderedDict

Return profile.

property profile: Dict

Return a report.

update_profile(df_series: pandas.core.series.Series, sample_size: Optional[int] = None, min_true_samples: Optional[int] = None, sample_ids: Optional[numpy.ndarray] = None, pool: Optional[multiprocessing.pool.Pool] = None) None

Update the column profiler.

Parameters
  • df_series (pandas.core.series.Series) – Data to be profiled

  • sample_size (int) – Number of samples to use in generating profile

  • min_true_samples (int) – Minimum number of samples required for the profiler

  • sample_ids (list(list)) – Randomized list of sample indices

  • pool (multiprocessing.Pool) – pool utilized for multiprocessing

static clean_data_and_get_base_stats(df_series: pandas.core.series.Series, sample_size: int, null_values: Optional[Dict[str, Union[re.RegexFlag, int]]] = None, min_true_samples: Optional[int] = None, sample_ids: Optional[numpy.ndarray] = None) Tuple[pandas.core.series.Series, Dict]

Identify null characters and return them in a dictionary.

Remove any nulls in column.

Parameters
  • df_series (pandas.core.series.Series) – a given column

  • sample_size (int) – Number of samples to use in generating the profile

  • null_values (dict[str, re.FLAG]) – Dictionary mapping null values to regex flag where the key represents the null value to remove from the data and the flag represents the regex flag to apply

  • min_true_samples (int) – Minimum number of samples required for the profiler

  • sample_ids (list(list)) – Randomized list of sample indices

Returns

updated column with null removed and dictionary of null parameters

Return type

pd.Series, dict

class dataprofiler.profilers.profile_builder.BaseProfiler(data: Optional[dataprofiler.data_readers.data.Data], samples_per_update: Optional[int] = None, min_true_samples: int = 0, options: Optional[dataprofiler.profilers.profiler_options.BaseOption] = None)

Bases: object

Abstract class for profiling data.

Instantiate the BaseProfiler class.

Parameters
  • data (Data class object) – Data to be profiled

  • samples_per_update (int) – Number of samples to use in generating profile

  • min_true_samples (int) – Minimum number of samples required for the profiler

  • options (ProfilerOptions Object) – Options for the profiler.

Returns

Profiler

diff(other_profile: dataprofiler.profilers.profile_builder.BaseProfiler, options: Optional[Dict] = None) Dict

Find the difference of two profiles.

Parameters

other_profile (BaseProfiler) – profile being added to this one.

Returns

diff of the two profiles

Return type

dict

property profile: dataprofiler.profilers.column_profile_compilers.BaseCompiler

Return the stored profiles for the given profiler.

Returns

BaseCompiler

report(report_options: Optional[Dict] = None) Dict

Return profile report based on all profiled data fed into the profiler.

User can specify the output_formats: (pretty, compact, serializable, flat).
Pretty: floats are rounded to four decimal places, and lists are

shortened.

Compact: Similar to pretty, but removes detailed statistics such as

runtimes, label probabilities, index locations of null types, etc.

Serializable: Output is json serializable and not prettified Flat: Nested output is returned as a flattened dictionary

Variables

report_options – optional format changes to the report dict(output_format=<FORMAT>)

Returns

dictionary report

Return type

dict

update_profile(data: Union[dataprofiler.data_readers.base_data.BaseData, pandas.core.frame.DataFrame, pandas.core.series.Series], sample_size: Optional[int] = None, min_true_samples: Optional[int] = None) None

Update the profile for data provided.

User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.

Parameters
  • data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled

  • sample_size (int) – number of samples to profile from the data

  • min_true_samples (int) – minimum number of non-null samples to profile

Returns

None

save(filepath: Optional[str] = None) None

Save profiler to disk.

Parameters

filepath (String) – Path of file to save to

Returns

None

classmethod load(filepath: str) dataprofiler.profilers.profile_builder.BaseProfiler

Load profiler from disk.

Parameters

filepath (String) – Path of file to load from

Returns

Profiler being loaded, StructuredProfiler or UnstructuredProfiler

Return type

BaseProfiler

class dataprofiler.profilers.profile_builder.UnstructuredProfiler(data: dataprofiler.data_readers.data.Data, samples_per_update: Optional[int] = None, min_true_samples: int = 0, options: Optional[dataprofiler.profilers.profiler_options.BaseOption] = None)

Bases: dataprofiler.profilers.profile_builder.BaseProfiler

For profiling unstructured data.

Instantiate the UnstructuredProfiler class.

Parameters
  • data (Data class object) – Data to be profiled

  • samples_per_update (int) – Number of samples to use in generating profile

  • min_true_samples (int) – Minimum number of samples required for the profiler

  • options (ProfilerOptions Object) – Options for the profiler.

Returns

UnstructuredProfiler

diff(other_profile: dataprofiler.profilers.profile_builder.UnstructuredProfiler, options: Optional[Dict] = None) Dict

Find difference between 2 unstuctured profiles and return the report.

Parameters
  • other_profile (UnstructuredProfiler) – profile finding the difference with this one.

  • options (dict) – options to impact the results of the diff

Returns

difference of the profiles

Return type

dict

report(report_options: Optional[Dict] = None) Dict

Return unstructured report based on all profiled data fed into profiler.

User can specify the output_formats: (pretty, compact, serializable, flat).
Pretty: floats are rounded to four decimal places, and lists are

shortened.

Compact: Similar to pretty, but removes detailed statistics such as

runtimes, label probabilities, index locations of null types, etc.

Serializable: Output is json serializable and not prettified Flat: Nested output is returned as a flattened dictionary

Variables

report_options – optional format changes to the report dict(output_format=<FORMAT>)

Returns

dictionary report

Return type

dict

save(filepath: Optional[str] = None) None

Save profiler to disk.

Parameters

filepath (String) – Path of file to save to

Returns

None

classmethod load(filepath: str) dataprofiler.profilers.profile_builder.BaseProfiler

Load profiler from disk.

Parameters

filepath (String) – Path of file to load from

Returns

Profiler being loaded, StructuredProfiler or UnstructuredProfiler

Return type

BaseProfiler

property profile: dataprofiler.profilers.column_profile_compilers.BaseCompiler

Return the stored profiles for the given profiler.

Returns

BaseCompiler

update_profile(data: Union[dataprofiler.data_readers.base_data.BaseData, pandas.core.frame.DataFrame, pandas.core.series.Series], sample_size: Optional[int] = None, min_true_samples: Optional[int] = None) None

Update the profile for data provided.

User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.

Parameters
  • data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled

  • sample_size (int) – number of samples to profile from the data

  • min_true_samples (int) – minimum number of non-null samples to profile

Returns

None

class dataprofiler.profilers.profile_builder.StructuredProfiler(data: dataprofiler.data_readers.data.Data, samples_per_update: Optional[int] = None, min_true_samples: int = 0, options: Optional[dataprofiler.profilers.profiler_options.BaseOption] = None)

Bases: dataprofiler.profilers.profile_builder.BaseProfiler

For profiling structured data.

Instantiate the StructuredProfiler class.

Parameters
  • data (Data class object) – Data to be profiled

  • samples_per_update (int) – Number of samples to use in generating profile

  • min_true_samples (int) – Minimum number of samples required for the profiler

  • options (ProfilerOptions Object) – Options for the profiler.

Returns

StructuredProfiler

diff(other_profile: dataprofiler.profilers.profile_builder.StructuredProfiler, options: Optional[Dict] = None) Dict

Find the difference between 2 Profiles and return the report.

Parameters
  • other_profile (StructuredProfiler) – profile finding the difference with this one

  • options (dict) – options to change results of the difference

Returns

difference of the profiles

Return type

dict

report(report_options: Optional[Dict] = None) Dict

Return a report.

save(filepath: Optional[str] = None) None

Save profiler to disk.

Parameters

filepath (String) – Path of file to save to

Returns

None

classmethod load(filepath: str) dataprofiler.profilers.profile_builder.BaseProfiler

Load profiler from disk.

Parameters

filepath (String) – Path of file to load from

Returns

Profiler being loaded, StructuredProfiler or UnstructuredProfiler

Return type

BaseProfiler

property profile: dataprofiler.profilers.column_profile_compilers.BaseCompiler

Return the stored profiles for the given profiler.

Returns

BaseCompiler

update_profile(data: Union[dataprofiler.data_readers.base_data.BaseData, pandas.core.frame.DataFrame, pandas.core.series.Series], sample_size: Optional[int] = None, min_true_samples: Optional[int] = None) None

Update the profile for data provided.

User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.

Parameters
  • data (Union[data_readers.base_data.BaseData, pandas.DataFrame, pandas.Series]) – data to be profiled

  • sample_size (int) – number of samples to profile from the data

  • min_true_samples (int) – minimum number of non-null samples to profile

Returns

None

class dataprofiler.profilers.profile_builder.Profiler(data: dataprofiler.data_readers.data.Data, samples_per_update: Optional[int] = None, min_true_samples: int = 0, options: Optional[dataprofiler.profilers.profiler_options.ProfilerOptions] = None, profiler_type: Optional[str] = None)

Bases: object

For profiling data.

Instantiate Structured and Unstructured Profilers.

This is a factory class.

Parameters
  • data (Data class object) – Data to be profiled, type allowed depends on the profiler_type

  • samples_per_update (int) – Number of samples to use to generate profile

  • min_true_samples (int) – Min number of samples required for the profiler

  • options (ProfilerOptions Object) – Options for the profiler.

  • profiler_type (str) – Type of Profiler (“graph”/”structured”/”unstructured”)

Returns

Union[GraphProfiler, StructuredProfiler, UnstructuredProfiler]

classmethod load(filepath: str) dataprofiler.profilers.profile_builder.BaseProfiler

Load profiler from disk.

Parameters

filepath (String) – Path of file to load from

Returns

Profiler being loaded, StructuredProfiler or UnstructuredProfiler

Return type

BaseProfiler