Profile Builder

coding=utf-8

Build model for a dataset by identifying type of column along with its respective parameters.

class dataprofiler.profilers.profile_builder.StructuredDataProfile(df_series=None, sample_size=None, min_sample_size=5000, sampling_ratio=0.2, min_true_samples=None, sample_ids=None, pool=None, options=None)

Bases: object

Instantiate the Structured Profiler class for a given column.

Parameters
  • df_series (pandas.core.series.Series) – Data to be profiled

  • sample_size (int) – Number of samples to use in generating profile

  • min_true_samples (int) – Minimum number of samples required for the profiler

  • sample_ids (list(list)) – Randomized list of sample indices

  • pool (multiprocessing.Pool) – pool utilized for multiprocessing

  • options (StructuredOptions Object) – Options for the structured profiler.

update_column_profilers(clean_sampled_df, pool)

Calculates type statistics and labels dataset

Parameters
  • clean_sampled_df (Pandas.Series) – sampled series with none types dropped

  • pool (multiprocessing.pool) – pool utilized for multiprocessing

property profile
update_profile(df_series, sample_size=None, min_true_samples=None, sample_ids=None, pool=None)

Update the column profiler

Parameters
  • df_series (pandas.core.series.Series) – Data to be profiled

  • sample_size (int) – Number of samples to use in generating profile

  • min_true_samples (int) – Minimum number of samples required for the profiler

  • sample_ids (list(list)) – Randomized list of sample indices

  • pool (multiprocessing.Pool) – pool utilized for multiprocessing

static clean_data_and_get_base_stats(df_series, sample_size, min_true_samples=None, sample_ids=None)

Identify null characters and return them in a dictionary as well as remove any nulls in column.

Parameters
  • df_series (pandas.core.series.Series) – a given column

  • sample_size (int) – Number of samples to use in generating the profile

  • min_true_samples (int) – Minimum number of samples required for the profiler

  • sample_ids (list(list)) – Randomized list of sample indices

Returns

updated column with null removed and dictionary of null parameters

Return type

pd.Series, dict

class dataprofiler.profilers.profile_builder.Profiler(data, samples_per_update=None, min_true_samples=0, profiler_options=None)

Bases: object

Instantiate the Profiler class

Parameters
  • data (Data class object) – Data to be profiled

  • samples_per_update (int) – Number of samples to use in generating profile

  • min_true_samples (int) – Minimum number of samples required for the profiler

  • profiler_options (ProfilerOptions Object) – Options for the profiler.

Returns

Profiler

property profile
report(report_options=None)
update_profile(data, sample_size=None, min_true_samples=None)

Update the profile for data provided. User can specify the sample size to profile the data with. Additionally, the user can specify the minimum number of non-null samples to profile.

Parameters
  • data (Union[data_readers.base_data.BaseData, pandas.DataFrame]) – data to be profiled

  • sample_size (int) – number of samples to profile from the data

  • min_true_samples – minimum number of non-null samples to profile

:type min_true_samples :return: None

save(filepath=None)

Save profiler to disk

Parameters

filepath (String) – Path of file to save to

Returns

None

static load(filepath)

Load profiler from disk

Parameters

filepath (String) – Path of file to load from

Returns

None