Utils¶

dataprofiler.profilers.utils.dict_merge(dct, merge_dct)¶

Recursive dict merge. Inspired by :meth:dict.update(), instead of updating only top-level keys, dict_merge recurses down into dicts nested to an arbitrary depth, updating keys. The merge_dct is merged into dct.

Parameters

dct – dict onto which the merge is executed
merge_dct – dct merged into dct

Returns

None

class dataprofiler.profilers.utils.KeyDict¶

Bases: collections.defaultdict

Helper class for sample_in_chunks. Allows keys that are missing to become the values for that key. From: https://www.drmaciver.com/2018/01/lazy-fisher-yates-shuffling-for-precise-rejection-sampling/

clear() → None. Remove all items from D.¶

copy() → a shallow copy of D.¶

default_factory¶: Factory for default value called by __missing__().

fromkeys(value=None, /)¶: Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)¶: Return the value for key if key is in the dictionary, else default.

items() → a set-like object providing a view on D’s items¶

keys() → a set-like object providing a view on D’s keys¶

pop(k[, d]) → v, remove specified key and return the corresponding value.¶: If key is not found, d is returned if given, otherwise KeyError is raised

popitem()¶

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)¶

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F) → None. Update D from dict/iterable E and F.¶: If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → an object providing a view on D’s values¶

dataprofiler.profilers.utils.shuffle_in_chunks(data_length, chunk_size)¶

A generator for creating shuffled indexes in chunks. This reduces the cost of having to create all indexes, but only of that what is needed. Initial Code idea from: https://www.drmaciver.com/2018/01/lazy-fisher-yates-shuffling-for-precise-rejection-sampling/

Parameters

data_length – length of data to be shuffled
chunk_size – size of shuffled chunks

Returns

list of shuffled indices of chunk size

dataprofiler.profilers.utils.warn_on_profile(col_profile, e)¶

Returns a warning if a given profile errors (tensorflow typcially)

Parameters

col_profile (str) – Name of the column profile
e (Exception) – Error message from profiler error

dataprofiler.profilers.utils.partition(data, chunk_size)¶

Creates a generator which returns the data in the specified chunk size.

Parameters

data (list, dataframe, etc) – list, dataframe, etc
chunk_size (int) – size of partition to return

dataprofiler.profilers.utils.suggest_pool_size(data_size=None, cols=None)¶

Suggest the pool size based on resources

Parameters

data_size (int) – size of the dataset
cols (int) – columns of the dataset

Return suggested_pool_size

suggeseted pool size

Rtype suggested_pool_size

int

dataprofiler.profilers.utils.generate_pool(max_pool_size=None, data_size=None, cols=None)¶

Generate a multiprocessing pool to allocate functions too

Parameters

max_pool_size (Union[int, None]) – Max number of processes assigned to the pool
data_size (int) – size of the dataset
cols (int) – columns of the dataset

Return pool

Multiprocessing pool to allocate processes to

Rtype pool

Multiproessing.Pool

Return cpu_count

Number of processes (cpu bound) to utilize

Rtype cpu_count

int

dataprofiler.profilers.utils.overlap(x1, x2, y1, y2)¶: Return True iff [x1:x2] overlaps with [y1:y2]

dataprofiler.profilers.utils.add_nested_dictionaries(first_dict, second_dict)¶

Merges two dictionaries together and adds values together

Parameters

first_dict (dict) – dictionary to be merged
second_dict (dict) – dictionary to be merged

Returns

merged dictionary

dataprofiler.profilers.utils.biased_skew(df_series)¶

Calculates the biased estimator for skewness of the given data. The definition is formalized as g_1 here:

https://en.wikipedia.org/wiki/Skewness#Sample_skewness

Parameters: df_series (pandas Series) – data to get skewness of, assuming floats
Returns: biased skewness
Return type: float

dataprofiler.profilers.utils.biased_kurt(df_series)¶

Calculates the biased estimator for kurtosis of the given data The definition is formalized as g_2 here:

https://en.wikipedia.org/wiki/Kurtosis#A_natural_but_biased_estimator

Parameters: df_series (pandas Series) – data to get kurtosis of, assuming floats
Returns: biased kurtosis
Return type: float

dataprofiler.profilers.utils.find_diff_of_numbers(stat1, stat2)¶

Finds the difference between two stats. If there is no difference, returns “unchanged”. For ints/floats, returns stat1 - stat2.

Parameters

stat1 (Union[int, float]) – the first statistical input
stat2 (Union[int, float]) – the second statistical input

Returns

the difference of the stats

dataprofiler.profilers.utils.find_diff_of_strings(stat1, stat2)¶

Finds the difference between two stats. If there is no difference, returns “unchanged”. For strings, returns list containing [stat1, stat2].

Parameters

stat1 (str) – the first statistical input
stat2 (str) – the second statistical input

Returns

the difference of the stats

dataprofiler.profilers.utils.find_diff_of_lists_and_sets(stat1, stat2)¶

Finds the difference between two stats. If there is no difference, returns “unchanged”. Removes duplicates and returns [unique values of stat1, shared values, unique values of stat2].

Parameters

stat1 (Union[list, set]) – the first statistical input
stat2 (Union[list, set]) – the second statistical input

Returns

the difference of the stats