Utils¶
- dataprofiler.profilers.utils.dict_merge(dct, merge_dct)¶
Recursive dict merge. Inspired by :meth:
dict.update()
, instead of updating only top-level keys, dict_merge recurses down into dicts nested to an arbitrary depth, updating keys. Themerge_dct
is merged intodct
.- Parameters
dct – dict onto which the merge is executed
merge_dct – dct merged into dct
- Returns
None
- class dataprofiler.profilers.utils.KeyDict¶
Bases:
collections.defaultdict
Helper class for sample_in_chunks. Allows keys that are missing to become the values for that key. From: https://www.drmaciver.com/2018/01/lazy-fisher-yates-shuffling-for-precise-rejection-sampling/
- clear() None. Remove all items from D. ¶
- copy() a shallow copy of D. ¶
- default_factory¶
Factory for default value called by __missing__().
- fromkeys(value=None, /)¶
Create a new dictionary with keys from iterable and values set to value.
- get(key, default=None, /)¶
Return the value for key if key is in the dictionary, else default.
- items() a set-like object providing a view on D’s items ¶
- keys() a set-like object providing a view on D’s keys ¶
- pop(k[, d]) v, remove specified key and return the corresponding value. ¶
If key is not found, d is returned if given, otherwise KeyError is raised
- popitem()¶
Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
- setdefault(key, default=None, /)¶
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
- update([E, ]**F) None. Update D from dict/iterable E and F. ¶
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
- values() an object providing a view on D’s values ¶
- dataprofiler.profilers.utils.shuffle_in_chunks(data_length, chunk_size)¶
A generator for creating shuffled indexes in chunks. This reduces the cost of having to create all indexes, but only of that what is needed. Initial Code idea from: https://www.drmaciver.com/2018/01/lazy-fisher-yates-shuffling-for-precise-rejection-sampling/
- Parameters
data_length – length of data to be shuffled
chunk_size – size of shuffled chunks
- Returns
list of shuffled indices of chunk size
- dataprofiler.profilers.utils.warn_on_profile(col_profile, e)¶
Returns a warning if a given profile errors (tensorflow typcially)
- Parameters
col_profile (str) – Name of the column profile
e (Exception) – Error message from profiler error
- dataprofiler.profilers.utils.partition(data, chunk_size)¶
Creates a generator which returns the data in the specified chunk size.
- Parameters
data (list, dataframe, etc) – list, dataframe, etc
chunk_size (int) – size of partition to return
- dataprofiler.profilers.utils.suggest_pool_size(data_size=None, cols=None)¶
Suggest the pool size based on resources
- Parameters
data_size (int) – size of the dataset
cols (int) – columns of the dataset
- Return suggested_pool_size
suggeseted pool size
- Rtype suggested_pool_size
int
- dataprofiler.profilers.utils.generate_pool(max_pool_size=None, data_size=None, cols=None)¶
Generate a multiprocessing pool to allocate functions too
- Parameters
max_pool_size (Union[int, None]) – Max number of processes assigned to the pool
data_size (int) – size of the dataset
cols (int) – columns of the dataset
- Return pool
Multiprocessing pool to allocate processes to
- Rtype pool
Multiproessing.Pool
- Return cpu_count
Number of processes (cpu bound) to utilize
- Rtype cpu_count
int
- dataprofiler.profilers.utils.overlap(x1, x2, y1, y2)¶
Return True iff [x1:x2] overlaps with [y1:y2]
- dataprofiler.profilers.utils.add_nested_dictionaries(first_dict, second_dict)¶
Merges two dictionaries together and adds values together
- Parameters
first_dict (dict) – dictionary to be merged
second_dict (dict) – dictionary to be merged
- Returns
merged dictionary
- dataprofiler.profilers.utils.biased_skew(df_series)¶
Calculates the biased estimator for skewness of the given data. The definition is formalized as g_1 here:
- Parameters
df_series (pandas Series) – data to get skewness of, assuming floats
- Returns
biased skewness
- Return type
float
- dataprofiler.profilers.utils.biased_kurt(df_series)¶
Calculates the biased estimator for kurtosis of the given data The definition is formalized as g_2 here:
- Parameters
df_series (pandas Series) – data to get kurtosis of, assuming floats
- Returns
biased kurtosis
- Return type
float
- dataprofiler.profilers.utils.find_diff_of_numbers(stat1, stat2)¶
Finds the difference between two stats. If there is no difference, returns “unchanged”. For ints/floats, returns stat1 - stat2.
- Parameters
stat1 (Union[int, float]) – the first statistical input
stat2 (Union[int, float]) – the second statistical input
- Returns
the difference of the stats
- dataprofiler.profilers.utils.find_diff_of_strings_and_bools(stat1, stat2)¶
Finds the difference between two stats. If there is no difference, returns “unchanged”. For strings and bools, returns list containing [stat1, stat2].
- Parameters
stat1 (Union[str, bool]) – the first statistical input
stat2 (Union[str, bool]) – the second statistical input
- Returns
the difference of the stats
- dataprofiler.profilers.utils.find_diff_of_lists_and_sets(stat1, stat2)¶
Finds the difference between two stats. If there is no difference, returns “unchanged”. Removes duplicates and returns [unique values of stat1, shared values, unique values of stat2].
- Parameters
stat1 (Union[list, set]) – the first statistical input
stat2 (Union[list, set]) – the second statistical input
- Returns
the difference of the stats
- dataprofiler.profilers.utils.find_diff_of_dates(stat1, stat2)¶
Finds the difference between two dates. If there is no difference, returns “unchanged”. For dates, returns the difference in time.
Because only days can be stored as negative values internally for timedelta objects, the output for these negative values is less readable due to the combination of signs in the default output. This returns a readable output for timedelta that accounts for potential negative differences.
- Parameters
stat1 (datetime.datetime object) – the first statistical input
stat2 (datetime.datetime object) – the second statistical input
- Returns
Difference in stats
- Return type
str
- dataprofiler.profilers.utils.find_diff_of_dicts(dict1, dict2)¶
Finds the difference between two dicts. For each key in each dict, returns “unchanged” if there’s no difference, otherwise returns the difference. Assumes that if the two dictionaries share the same key, their values are the same type.
- Parameters
dict1 (dict) – the first dict
dict2 (dict) – the second dict
- Returns
Difference in the keys of each dict
- Return type
dict
- dataprofiler.profilers.utils.find_diff_of_matrices(matrix1, matrix2)¶
Finds the difference between two matrices.
- Parameters
matrix1 (list(list(float))) – the first matrix
matrix2 (list(list(float))) – the second matrix
- Returns
Difference in the matrix
- Return type
list(list(float))
- dataprofiler.profilers.utils.find_diff_of_dicts_with_diff_keys(dict1, dict2)¶
Finds the difference between two dicts. For each key in each dict, returns “unchanged” if there’s no difference, otherwise returns the difference. Assumes that if the two dictionaries share the same key, their values are the same type.
- Parameters
dict1 (dict) – the first dict
dict2 (dict) – the second dict
- Returns
Difference in the keys of each dict
- Return type
list
- dataprofiler.profilers.utils.get_memory_size(data, unit='M')¶
Get memory size of the input data
- Parameters
data (Union[list, numpy.array, pandas.DataFrame]) – list or array of data
unit (string) – memory size unit (B, K, M, or G)
- Returns
memory size of the input data
- dataprofiler.profilers.utils.method_timeit(method=None, name=None)¶
Measure execution time of provided method Records time into times dictionary
- Parameters
method (Callable) – method to time
name (str) – key argument for the times dictionary