Utils¶
-
dataprofiler.profilers.utils.
dict_merge
(dct, merge_dct)¶ Recursive dict merge. Inspired by :meth:
dict.update()
, instead of updating only top-level keys, dict_merge recurses down into dicts nested to an arbitrary depth, updating keys. Themerge_dct
is merged intodct
.- Parameters
dct – dict onto which the merge is executed
merge_dct – dct merged into dct
- Returns
None
-
class
dataprofiler.profilers.utils.
KeyDict
¶ Bases:
collections.defaultdict
Helper class for sample_in_chunks. Allows keys that are missing to become the values for that key. From: https://www.drmaciver.com/2018/01/lazy-fisher-yates-shuffling-for-precise-rejection-sampling/
-
clear
() → None. Remove all items from D.¶
-
copy
() → a shallow copy of D.¶
-
default_factory
¶ Factory for default value called by __missing__().
-
fromkeys
(value=None, /)¶ Create a new dictionary with keys from iterable and values set to value.
-
get
(key, default=None, /)¶ Return the value for key if key is in the dictionary, else default.
-
items
() → a set-like object providing a view on D’s items¶
-
keys
() → a set-like object providing a view on D’s keys¶
-
pop
(k[, d]) → v, remove specified key and return the corresponding value.¶ If key is not found, d is returned if given, otherwise KeyError is raised
-
popitem
()¶ Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
-
setdefault
(key, default=None, /)¶ Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
-
update
([E, ]**F) → None. Update D from dict/iterable E and F.¶ If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
-
values
() → an object providing a view on D’s values¶
-
-
dataprofiler.profilers.utils.
shuffle_in_chunks
(data_length, chunk_size)¶ A generator for creating shuffled indexes in chunks. This reduces the cost of having to create all indexes, but only of that what is needed. Initial Code idea from: https://www.drmaciver.com/2018/01/lazy-fisher-yates-shuffling-for-precise-rejection-sampling/
- Parameters
data_length – length of data to be shuffled
chunk_size – size of shuffled chunks
- Returns
list of shuffled indices of chunk size
-
dataprofiler.profilers.utils.
warn_on_profile
(col_profile, e)¶ Returns a warning if a given profile errors (tensorflow typcially)
- Parameters
col_profile (str) – Name of the column profile
e (Exception) – Error message from profiler error
-
dataprofiler.profilers.utils.
partition
(data, chunk_size)¶ Creates a generator which returns the data in the specified chunk size.
- Parameters
data (list, dataframe, etc) – list, dataframe, etc
chunk_size (int) – size of partition to return
-
dataprofiler.profilers.utils.
suggest_pool_size
(data_size=None, cols=None)¶ Suggest the pool size based on resources
- Parameters
data_size (int) – size of the dataset
cols (int) – columns of the dataset
- Return suggested_pool_size
suggeseted pool size
- Rtype suggested_pool_size
int
-
dataprofiler.profilers.utils.
generate_pool
(max_pool_size=None, data_size=None, cols=None)¶ Generate a multiprocessing pool to allocate functions too
- Parameters
max_pool_size (int) – Max number of processes assigned to the pool
data_size (int) – size of the dataset
cols (int) – columns of the dataset
- Return pool
Multiprocessing pool to allocate processes to
- Rtype pool
Multiproessing.Pool
- Return cpu_count
Number of processes (cpu bound) to utilize
- Rtype cpu_count
int