dataprofiler.profilers.profiler_utils module¶
Contains functions for profilers.
- dataprofiler.profilers.profiler_utils.recursive_dict_update(d: dict, update_d: dict) dict ¶
Recursive updates nested dictionaries. Updating d with update_d.
- Parameters:
d – dict which gets updated with update_d
update_d – dict to update d with
- Returns:
updated dict
- class dataprofiler.profilers.profiler_utils.KeyDict¶
Bases:
defaultdict
Helper class for sample_in_chunks.
Allows keys that are missing to become the values for that key. From: https://www.drmaciver.com/2018/01/lazy-fisher-yates-shuffling-for-precise-rejection-sampling/
- clear() None. Remove all items from D. ¶
- copy() a shallow copy of D. ¶
- default_factory¶
Factory for default value called by __missing__().
- fromkeys(value=None, /)¶
Create a new dictionary with keys from iterable and values set to value.
- get(key, default=None, /)¶
Return the value for key if key is in the dictionary, else default.
- items() a set-like object providing a view on D's items ¶
- keys() a set-like object providing a view on D's keys ¶
- pop(k[, d]) v, remove specified key and return the corresponding value. ¶
If key is not found, default is returned if given, otherwise KeyError is raised
- popitem()¶
Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
- setdefault(key, default=None, /)¶
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
- update([E, ]**F) None. Update D from dict/iterable E and F. ¶
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
- values() an object providing a view on D's values ¶
- dataprofiler.profilers.profiler_utils.shuffle_in_chunks(data_length: int, chunk_size: int) Generator[list[int], None, Any] ¶
Create shuffled indexes in chunks.
This reduces the cost of having to create all indexes, but only of that what is needed. Initial Code idea from: https://www.drmaciver.com/2018/01/lazy-fisher-yates-shuffling-for-precise-rejection-sampling/
- Parameters:
data_length – length of data to be shuffled
chunk_size – size of shuffled chunks
- Returns:
list of shuffled indices of chunk size
- dataprofiler.profilers.profiler_utils.warn_on_profile(col_profile: str, e: Exception) None ¶
Return a warning if a given profile errors (tensorflow typically).
- Parameters:
col_profile (str) – Name of the column profile
e (Exception) – Error message from profiler error
- dataprofiler.profilers.profiler_utils.partition(data: list, chunk_size: int) Generator[list, None, Any] ¶
Create a generator which returns data in specified chunk size.
- Parameters:
data (list, dataframe, etc) – list, dataframe, etc
chunk_size (int) – size of partition to return
- dataprofiler.profilers.profiler_utils.auto_multiprocess_toggle(data: DataFrame, num_rows_threshold: int = 750000, num_cols_threshold: int = 20) bool ¶
Automate multiprocessing toggle depending on dataset sizes.
- Parameters:
data (pandas.DataFrame) – a dataset
num_rows_threshold (int) – threshold for number of rows to use multiprocess
num_cols_threshold (int) – threshold for number of columns to use multiprocess
- Returns:
recommended option.multiprocess.is_enabled value
- Return type:
bool
- dataprofiler.profilers.profiler_utils.suggest_pool_size(data_size: int = None, cols: int = None) int | None ¶
Suggest the pool size based on resources.
- Parameters:
data_size (int) – size of the dataset
cols (int) – columns of the dataset
- Return suggested_pool_size:
suggested pool size
- Rtype suggested_pool_size:
int
- dataprofiler.profilers.profiler_utils.generate_pool(max_pool_size: int = None, data_size: int = None, cols: int = None) tuple[Pool | None, int | None] ¶
Generate a multiprocessing pool to allocate functions too.
- Parameters:
max_pool_size (Union[int, None]) – Max number of processes assigned to the pool
data_size (int) – size of the dataset
cols (int) – columns of the dataset
- Return pool:
Multiprocessing pool to allocate processes to
- Rtype pool:
Multiproessing.Pool
- Return cpu_count:
Number of processes (cpu bound) to utilize
- Rtype cpu_count:
int
- dataprofiler.profilers.profiler_utils.overlap(x1: int | Any, x2: int | Any, y1: int | Any, y2: int | Any) bool ¶
Return True iff [x1:x2] overlaps with [y1:y2].
- dataprofiler.profilers.profiler_utils.add_nested_dictionaries(first_dict: dict, second_dict: dict) dict ¶
Merge two dictionaries together and add values together.
- Parameters:
first_dict (dict) – dictionary to be merged
second_dict (dict) – dictionary to be merged
- Returns:
merged dictionary
- dataprofiler.profilers.profiler_utils.biased_skew(df_series: Series) float64 ¶
Calculate the biased estimator for skewness of the given data.
- The definition is formalized as g_1 here:
- Parameters:
df_series (pandas Series) – data to get skewness of, assuming floats
- Returns:
biased skewness
- Return type:
np.float64
- dataprofiler.profilers.profiler_utils.biased_kurt(df_series: Series) float64 ¶
Calculate the biased estimator for kurtosis of the given data.
- The definition is formalized as g_2 here:
https://en.wikipedia.org/wiki/Kurtosis#A_natural_but_biased_estimator
- Parameters:
df_series (pandas Series) – data to get kurtosis of, assuming floats
- Returns:
biased kurtosis
- Return type:
np.float64
- class dataprofiler.profilers.profiler_utils.Subtractable(*args, **kwargs)¶
Bases:
Protocol
Protocol for annotating subtractable types.
- dataprofiler.profilers.profiler_utils.find_diff_of_numbers(stat1: int | float | np.float64 | np.int64 | None, stat2: int | float | np.float64 | np.int64 | None) Any ¶
- dataprofiler.profilers.profiler_utils.find_diff_of_numbers(stat1: T | None, stat2: T | None) Any
Find the difference between two stats.
If there is no difference, return “unchanged”. For ints/floats, returns stat1 - stat2.
- Parameters:
stat1 (Union[int, float, np.float64, np.int64, None]) – the first statistical input
stat2 (Union[int, float, np.float64, np.int64, None]) – the second statistical input
- Returns:
the difference of the stats
- dataprofiler.profilers.profiler_utils.find_diff_of_strings_and_bools(stat1: str | bool | None, stat2: str | bool | None) list[str | bool | None] | str ¶
Find the difference between two stats.
If there is no difference, return “unchanged”. For strings and bools, return list containing [stat1, stat2].
- Parameters:
stat1 (Union[str, bool]) – the first statistical input
stat2 (Union[str, bool]) – the second statistical input
- Returns:
the difference of the stats
- dataprofiler.profilers.profiler_utils.find_diff_of_lists_and_sets(stat1: list | set | None, stat2: list | set | None) list[list | set | None] | str ¶
Find the difference between two stats.
If there is no difference, return “unchanged”. Remove duplicates and returns [unique values of stat1, shared values, unique values of stat2].
- Parameters:
stat1 (Union[list, set]) – the first statistical input
stat2 (Union[list, set]) – the second statistical input
- Returns:
the difference of the stats
- dataprofiler.profilers.profiler_utils.find_diff_of_dates(stat1: datetime.datetime | None, stat2: datetime.datetime | None) list | str | None ¶
Find the difference between two dates.
If there is no difference, return “unchanged”. For dates, return the difference in time.
Because only days can be stored as negative values internally for timedelta objects, the output for these negative values is less readable due to the combination of signs in the default output. This returns a readable output for timedelta that accounts for potential negative differences.
- Parameters:
stat1 (datetime.datetime object) – the first statistical input
stat2 (datetime.datetime object) – the second statistical input
- Returns:
difference in stats
- Return type:
Union[List, str]
- dataprofiler.profilers.profiler_utils.find_diff_of_dicts(dict1: dict | None, dict2: dict | None) dict | str ¶
Find the difference between two dicts.
For each key in each dict, return “unchanged” if there’s no difference, otherwise return the difference. Assume that if the two dictionaries share the same key, their values are the same type.
- Parameters:
dict1 (dict) – the first dict
dict2 (dict) – the second dict
- Returns:
Difference in the keys of each dict
- Return type:
dict
- dataprofiler.profilers.profiler_utils.find_diff_of_matrices(matrix1: np.ndarray | None, matrix2: np.ndarray | None) np.ndarray | str | None ¶
Find the difference between two matrices.
- Parameters:
matrix1 (list(list(float))) – the first matrix
matrix2 (list(list(float))) – the second matrix
- Returns:
Difference in the matrix
- Return type:
list(list(float))
- dataprofiler.profilers.profiler_utils.find_diff_of_dicts_with_diff_keys(dict1: dict | None, dict2: dict | None) list[dict] | str ¶
Find the difference between two dicts.
For each key in each dict, return “unchanged” if there’s no difference, otherwise return the difference. Assume that if the two dictionaries share the same key, their values are the same type.
- Parameters:
dict1 (dict) – the first dict
dict2 (dict) – the second dict
- Returns:
Difference in the keys of each dict
- Return type:
list
- dataprofiler.profilers.profiler_utils.get_memory_size(data: list | np.ndarray | DataFrame, unit: str = 'M') float ¶
Get memory size of the input data.
- Parameters:
data (Union[list, numpy.array, pandas.DataFrame]) – list or array of data
unit (string) – memory size unit (B, K, M, or G)
- Returns:
memory size of the input data
- dataprofiler.profilers.profiler_utils.method_timeit(method: Callable | None = None, name: str | None = None) Callable ¶
Measure execution time of provided method.
Record time into times dictionary.
- Parameters:
method (Callable) – method to time
name (str) – key argument for the times dictionary
- dataprofiler.profilers.profiler_utils.perform_chi_squared_test_for_homogeneity(categories1: dict, sample_size1: int, categories2: dict, sample_size2: int) dict[str, int | float | None] ¶
Perform a Chi Squared test for homogeneity between two groups.
- Parameters:
categories1 (dict) – Categories and respective counts of the first group
sample_size1 (int) – Number of samples in first group
categories2 (dict) – Categories and respective counts of the second group
sample_size2 (int) – Number of samples in second group
- Returns:
Results of the chi squared test
- Return type:
dict
- dataprofiler.profilers.profiler_utils.chunk(lst: list, size: int) Iterator[tuple] ¶
Chunk things out.
- Parameters:
lst (list) – List to chunk
size (int) – Size of each chunk
- Returns:
Iterator that produces tuples of each chunk
- Return type:
Iterator[Tuple]
- dataprofiler.profilers.profiler_utils.merge(top_profile: profile_builder.BaseProfiler, other_profile: profile_builder.BaseProfiler = None) profile_builder.BaseProfiler ¶
Merge two Profiles.
- Parameters:
top_profile (Profile) – First profile
other_profile (Profile) – Second profile
- Returns:
Merge of two profile objects
- Return type:
Profile
- dataprofiler.profilers.profiler_utils.merge_profile_list(list_of_profiles: list[profile_builder.BaseProfiler], pool_count: int = 5) profile_builder.BaseProfiler ¶
Merge list of profiles into a single profile.
- Parameters:
list_of_profiles (list) – Categories and respective counts of the second group
pool_count (int) – Number of samples in second group
- Returns:
Single profile that is the merge of all profiles in the list_of_profiles list.
- Return type:
Profile
- dataprofiler.profilers.profiler_utils.reload_labeler_from_options_or_get_new(data_labeler_load_attr: dict, config: dict | None = None) BaseDataLabeler | None ¶
If required by the load_attr load a data labeler, but reuse from config if possible.
- Parameters:
data_labeler_load_attr (dict[string, dict]) – dictionary with attributes and values.
config (dict[string, dict]) – config for loading classes to reuse an existing labeler
- Returns:
Profiler with attributes populated.
- Return type: