Profiler Utils

Contains functions for profilers.

dataprofiler.profilers.profiler_utils.recursive_dict_update(d: dict, update_d: dict) dict

Recursive updates nested dictionaries. Updating d with update_d.

Parameters
  • d – dict which gets updated with update_d

  • update_d – dict to update d with

Returns

updated dict

class dataprofiler.profilers.profiler_utils.KeyDict

Bases: collections.defaultdict

Helper class for sample_in_chunks.

Allows keys that are missing to become the values for that key. From: https://www.drmaciver.com/2018/01/lazy-fisher-yates-shuffling-for-precise-rejection-sampling/

clear() None.  Remove all items from D.
copy() a shallow copy of D.
default_factory

Factory for default value called by __missing__().

fromkeys(value=None, /)

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)

Return the value for key if key is in the dictionary, else default.

items() a set-like object providing a view on D’s items
keys() a set-like object providing a view on D’s keys
pop(k[, d]) v, remove specified key and return the corresponding value.

If key is not found, default is returned if given, otherwise KeyError is raised

popitem()

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F) None.  Update D from dict/iterable E and F.

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() an object providing a view on D’s values
dataprofiler.profilers.profiler_utils.shuffle_in_chunks(data_length: int, chunk_size: int) Generator[list, None, Any]

Create shuffled indexes in chunks.

This reduces the cost of having to create all indexes, but only of that what is needed. Initial Code idea from: https://www.drmaciver.com/2018/01/lazy-fisher-yates-shuffling-for-precise-rejection-sampling/

Parameters
  • data_length – length of data to be shuffled

  • chunk_size – size of shuffled chunks

Returns

list of shuffled indices of chunk size

dataprofiler.profilers.profiler_utils.warn_on_profile(col_profile: str, e: Exception) None

Return a warning if a given profile errors (tensorflow typically).

Parameters
  • col_profile (str) – Name of the column profile

  • e (Exception) – Error message from profiler error

dataprofiler.profilers.profiler_utils.partition(data: list, chunk_size: int) Generator[list, None, Any]

Create a generator which returns data in specified chunk size.

Parameters
  • data (list, dataframe, etc) – list, dataframe, etc

  • chunk_size (int) – size of partition to return

dataprofiler.profilers.profiler_utils.auto_multiprocess_toggle(data: pandas.core.frame.DataFrame, num_rows_threshold: int = 750000, num_cols_threshold: int = 20) bool

Automate multiprocessing toggle depending on dataset sizes.

Parameters
  • data (pandas.DataFrame) – a dataset

  • num_rows_threshold (int) – threshold for number of rows to use multiprocess

  • num_cols_threshold (int) – threshold for number of columns to use multiprocess

Returns

recommended option.multiprocess.is_enabled value

Return type

bool

dataprofiler.profilers.profiler_utils.suggest_pool_size(data_size: int = None, cols: int = None) int | None

Suggest the pool size based on resources.

Parameters
  • data_size (int) – size of the dataset

  • cols (int) – columns of the dataset

Return suggested_pool_size

suggested pool size

Rtype suggested_pool_size

int

dataprofiler.profilers.profiler_utils.generate_pool(max_pool_size: int = None, data_size: int = None, cols: int = None) tuple[Pool | None, int | None]

Generate a multiprocessing pool to allocate functions too.

Parameters
  • max_pool_size (Union[int, None]) – Max number of processes assigned to the pool

  • data_size (int) – size of the dataset

  • cols (int) – columns of the dataset

Return pool

Multiprocessing pool to allocate processes to

Rtype pool

Multiproessing.Pool

Return cpu_count

Number of processes (cpu bound) to utilize

Rtype cpu_count

int

dataprofiler.profilers.profiler_utils.overlap(x1: int | Any, x2: int | Any, y1: int | Any, y2: int | Any) bool

Return True iff [x1:x2] overlaps with [y1:y2].

dataprofiler.profilers.profiler_utils.add_nested_dictionaries(first_dict: dict, second_dict: dict) dict

Merge two dictionaries together and add values together.

Parameters
  • first_dict (dict) – dictionary to be merged

  • second_dict (dict) – dictionary to be merged

Returns

merged dictionary

dataprofiler.profilers.profiler_utils.biased_skew(df_series: pandas.core.series.Series) numpy.float64

Calculate the biased estimator for skewness of the given data.

The definition is formalized as g_1 here:

https://en.wikipedia.org/wiki/Skewness#Sample_skewness

Parameters

df_series (pandas Series) – data to get skewness of, assuming floats

Returns

biased skewness

Return type

np.float64

dataprofiler.profilers.profiler_utils.biased_kurt(df_series: pandas.core.series.Series) numpy.float64

Calculate the biased estimator for kurtosis of the given data.

The definition is formalized as g_2 here:

https://en.wikipedia.org/wiki/Kurtosis#A_natural_but_biased_estimator

Parameters

df_series (pandas Series) – data to get kurtosis of, assuming floats

Returns

biased kurtosis

Return type

np.float64

class dataprofiler.profilers.profiler_utils.Subtractable(*args, **kwargs)

Bases: Protocol

Protocol for annotating subtractable types.

dataprofiler.profilers.profiler_utils.find_diff_of_numbers(stat1: int | float | np.float64 | np.int64 | None, stat2: int | float | np.float64 | np.int64 | None) Any
dataprofiler.profilers.profiler_utils.find_diff_of_numbers(stat1: T | None, stat2: T | None) Any

Find the difference between two stats.

If there is no difference, return “unchanged”. For ints/floats, returns stat1 - stat2.

Parameters
  • stat1 (Union[int, float, np.float64, np.int64, None]) – the first statistical input

  • stat2 (Union[int, float, np.float64, np.int64, None]) – the second statistical input

Returns

the difference of the stats

dataprofiler.profilers.profiler_utils.find_diff_of_strings_and_bools(stat1: str | bool | None, stat2: str | bool | None) list[str | bool | None] | str

Find the difference between two stats.

If there is no difference, return “unchanged”. For strings and bools, return list containing [stat1, stat2].

Parameters
  • stat1 (Union[str, bool]) – the first statistical input

  • stat2 (Union[str, bool]) – the second statistical input

Returns

the difference of the stats

dataprofiler.profilers.profiler_utils.find_diff_of_lists_and_sets(stat1: list | set | None, stat2: list | set | None) list[list | set | None] | str

Find the difference between two stats.

If there is no difference, return “unchanged”. Remove duplicates and returns [unique values of stat1, shared values, unique values of stat2].

Parameters
  • stat1 (Union[list, set]) – the first statistical input

  • stat2 (Union[list, set]) – the second statistical input

Returns

the difference of the stats

dataprofiler.profilers.profiler_utils.find_diff_of_dates(stat1: datetime.datetime | None, stat2: datetime.datetime | None) list | str | None

Find the difference between two dates.

If there is no difference, return “unchanged”. For dates, return the difference in time.

Because only days can be stored as negative values internally for timedelta objects, the output for these negative values is less readable due to the combination of signs in the default output. This returns a readable output for timedelta that accounts for potential negative differences.

Parameters
  • stat1 (datetime.datetime object) – the first statistical input

  • stat2 (datetime.datetime object) – the second statistical input

Returns

difference in stats

Return type

Union[List, str]

dataprofiler.profilers.profiler_utils.find_diff_of_dicts(dict1: dict | None, dict2: dict | None) dict | str

Find the difference between two dicts.

For each key in each dict, return “unchanged” if there’s no difference, otherwise return the difference. Assume that if the two dictionaries share the same key, their values are the same type.

Parameters
  • dict1 (dict) – the first dict

  • dict2 (dict) – the second dict

Returns

Difference in the keys of each dict

Return type

dict

dataprofiler.profilers.profiler_utils.find_diff_of_matrices(matrix1: np.ndarray | None, matrix2: np.ndarray | None) np.ndarray | str | None

Find the difference between two matrices.

Parameters
  • matrix1 (list(list(float))) – the first matrix

  • matrix2 (list(list(float))) – the second matrix

Returns

Difference in the matrix

Return type

list(list(float))

dataprofiler.profilers.profiler_utils.find_diff_of_dicts_with_diff_keys(dict1: dict | None, dict2: dict | None) list[dict] | str

Find the difference between two dicts.

For each key in each dict, return “unchanged” if there’s no difference, otherwise return the difference. Assume that if the two dictionaries share the same key, their values are the same type.

Parameters
  • dict1 (dict) – the first dict

  • dict2 (dict) – the second dict

Returns

Difference in the keys of each dict

Return type

list

dataprofiler.profilers.profiler_utils.get_memory_size(data: list | np.ndarray | DataFrame, unit: str = 'M') float

Get memory size of the input data.

Parameters
  • data (Union[list, numpy.array, pandas.DataFrame]) – list or array of data

  • unit (string) – memory size unit (B, K, M, or G)

Returns

memory size of the input data

dataprofiler.profilers.profiler_utils.method_timeit(method: Optional[Callable] = None, name: Optional[str] = None) Callable

Measure execution time of provided method.

Record time into times dictionary.

Parameters
  • method (Callable) – method to time

  • name (str) – key argument for the times dictionary

dataprofiler.profilers.profiler_utils.perform_chi_squared_test_for_homogeneity(categories1: dict, sample_size1: int, categories2: dict, sample_size2: int) dict[str, int | float | None]

Perform a Chi Squared test for homogeneity between two groups.

Parameters
  • categories1 (dict) – Categories and respective counts of the first group

  • sample_size1 (int) – Number of samples in first group

  • categories2 (dict) – Categories and respective counts of the second group

  • sample_size2 (int) – Number of samples in second group

Returns

Results of the chi squared test

Return type

dict

dataprofiler.profilers.profiler_utils.chunk(lst: list, size: int) Iterator[tuple]

Chunk things out.

Parameters
  • lst (list) – List to chunk

  • size (int) – Size of each chunk

Returns

Iterator that produces tuples of each chunk

Return type

Iterator[Tuple]

dataprofiler.profilers.profiler_utils.merge(top_profile: profile_builder.BaseProfiler, other_profile: profile_builder.BaseProfiler = None) profile_builder.BaseProfiler

Merge two Profiles.

Parameters
  • top_profile (Profile) – First profile

  • other_profile (Profile) – Second profile

Returns

Merge of two profile objects

Return type

Profile

dataprofiler.profilers.profiler_utils.merge_profile_list(list_of_profiles: list[profile_builder.BaseProfiler], pool_count: int = 5) profile_builder.BaseProfiler

Merge list of profiles into a single profile.

Parameters
  • list_of_profiles (list) – Categories and respective counts of the second group

  • pool_count (int) – Number of samples in second group

Returns

Single profile that is the merge of all profiles in the list_of_profiles list.

Return type

Profile

dataprofiler.profilers.profiler_utils.reload_labeler_from_options_or_get_new(data_labeler_load_attr: dict, config: dict | None = None) BaseDataLabeler | None

If required by the load_attr load a data labeler, but reuse from config if possible.

Parameters
  • data_labeler_load_attr (dict[string, dict]) – dictionary with attributes and values.

  • config (dict[string, dict]) – config for loading classes to reuse an existing labeler

Returns

Profiler with attributes populated.

Return type

DataLabelerOptions