Data Processing¶
-
class
dataprofiler.labelers.data_processing.
AutoSubRegistrationMeta
(clsname, bases, attrs)¶ Bases:
abc.ABCMeta
-
mro
()¶ Return a type’s method resolution order.
-
register
(subclass)¶ Register a virtual subclass of an ABC.
Returns the subclass, to allow usage as a class decorator.
-
-
class
dataprofiler.labelers.data_processing.
BaseDataProcessor
(**parameters)¶ Bases:
object
Abstract Data processing class.
-
processor_type
= None¶
-
classmethod
get_class
(class_name)¶
-
abstract classmethod
help
()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
get_parameters
(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
set_params
(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
abstract
process
(*args)¶ Data processing function.
-
classmethod
load_from_disk
(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library
(name)¶ Loads a data processor from within the library
-
save_to_disk
(dirpath)¶ Saves a data processor to a path on disk.
-
-
class
dataprofiler.labelers.data_processing.
BaseDataPreprocessor
(**parameters)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataProcessor
Abstract Data preprocessing class.
-
processor_type
= 'preprocessor'¶
-
abstract
process
(data, labels, label_mapping, batch_size)¶ Data preprocessing function.
-
classmethod
get_class
(class_name)¶
-
get_parameters
(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
abstract classmethod
help
()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
classmethod
load_from_disk
(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library
(name)¶ Loads a data processor from within the library
-
save_to_disk
(dirpath)¶ Saves a data processor to a path on disk.
-
set_params
(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
-
class
dataprofiler.labelers.data_processing.
BaseDataPostprocessor
(**parameters)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataProcessor
Abstract Data postprocessing class.
-
processor_type
= 'postprocessor'¶
-
abstract
process
(data, results, label_mapping)¶ Data postprocessing function.
-
classmethod
get_class
(class_name)¶
-
get_parameters
(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
abstract classmethod
help
()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
classmethod
load_from_disk
(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library
(name)¶ Loads a data processor from within the library
-
save_to_disk
(dirpath)¶ Saves a data processor to a path on disk.
-
set_params
(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
-
class
dataprofiler.labelers.data_processing.
DirectPassPreprocessor
¶ Bases:
dataprofiler.labelers.data_processing.BaseDataPreprocessor
Initialize the DirectPassPreprocessor class
-
classmethod
help
()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
process
(data, labels=None, label_mapping=None, batch_size=None)¶ Data preprocessing function.
-
classmethod
get_class
(class_name)¶
-
get_parameters
(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
classmethod
load_from_disk
(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library
(name)¶ Loads a data processor from within the library
-
processor_type
= 'preprocessor'¶
-
save_to_disk
(dirpath)¶ Saves a data processor to a path on disk.
-
set_params
(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
classmethod
-
class
dataprofiler.labelers.data_processing.
CharPreprocessor
(max_length=3400, default_label='UNKNOWN', pad_label='PAD', flatten_split=0, flatten_separator=' ', is_separate_at_max_len=False)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataPreprocessor
Initialize the CharPreprocessor class
- Parameters
max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator
-
classmethod
help
()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
process
(data, labels=None, label_mapping=None, batch_size=32)¶ Flatten batches of data
- Parameters
data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[None, dict]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data
- Return batch_data
A dict containing samples of size batch_size
- Rtype batch_data
dicts
-
classmethod
get_class
(class_name)¶
-
get_parameters
(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
classmethod
load_from_disk
(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library
(name)¶ Loads a data processor from within the library
-
processor_type
= 'preprocessor'¶
-
save_to_disk
(dirpath)¶ Saves a data processor to a path on disk.
-
set_params
(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
class
dataprofiler.labelers.data_processing.
CharPostprocessor
(default_label='UNKNOWN', pad_label='PAD', flatten_separator=' ', use_word_level_argmax=False, output_format='character_argmax', separators=(' ', ',', ';', "'", '"', ':', '\n', '\t', '.'), word_level_min_percent=0.75)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Initialize the CharPostprocessor class
- Parameters
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
use_word_level_argmax (bool) – whether to require the argmax value of each character in a word to determine the word’s entity
output_format (str) – (character_argmax vs NER) where character_argmax is a list of encodings for each character in the input text and NER is in the dict format which specifies start,end,label for each entity in a sentence
separators (tuple(str)) – list of characters to use for separating words within the character predictions
word_level_min_percent (float) – threshold on generating dominant word_level labeling
-
classmethod
help
()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
static
convert_to_NER_format
(predictions, label_mapping, default_label, pad_label)¶ Converts word level predictions to specified format
- Parameters
predictions (list) – predictions
label_mapping (dict) – labels and corresponding integers
default_label (str) – default label in label_mapping
pad_label (str) – pad label in label_mapping
- Returns
formatted predictions
- Return type
list
-
static
match_sentence_lengths
(data, results, flatten_separator, inplace=True)¶ Converts the results from the model into the same ragged data shapes as the original data.
- Parameters
data (numpy.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place
- Returns
dict(pred=…) or dict(pred=…, conf=…)
-
process
(data, results, label_mapping)¶ Conducts the processing on the data given the predictions, label_mapping, and default_label.
- Parameters
data (np.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
label_mapping (dict) – labels and corresponding integers
- Returns
dict of predictions and if they exist, confidences
-
classmethod
get_class
(class_name)¶
-
get_parameters
(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
classmethod
load_from_disk
(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library
(name)¶ Loads a data processor from within the library
-
processor_type
= 'postprocessor'¶
-
save_to_disk
(dirpath)¶ Saves a data processor to a path on disk.
-
set_params
(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
class
dataprofiler.labelers.data_processing.
StructCharPreprocessor
(max_length=3400, default_label='UNKNOWN', pad_label='PAD', flatten_separator='\x01\x01\x01\x01\x01', is_separate_at_max_len=False)¶ Bases:
dataprofiler.labelers.data_processing.CharPreprocessor
Initialize the StructCharPreprocessor class
- Parameters
max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator
-
classmethod
help
()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for preprocessors.
- Returns
None
-
get_parameters
(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
convert_to_unstructured_format
(data, labels)¶ Converts the list of data samples into the CharPreprocessor required input data format.
- Parameters
data (numpy.ndarray) – list of strings
labels (list) – labels for each input character
- Returns
data in the following format text=”<SAMPLE><SEPARATOR><SAMPLE>…”, entities=[(start=<INT>, end=<INT>, label=”<LABEL>”),
…(num_samples in data)])
-
process
(data, labels=None, label_mapping=None, batch_size=32)¶ Process structured data for being processed by the CharacterLevelCnnModel.
- Parameters
data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[dict, None]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data
- Return batch_data
A dict containing samples of size batch_size
- Rtype batch_data
dict
-
classmethod
get_class
(class_name)¶
-
classmethod
load_from_disk
(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library
(name)¶ Loads a data processor from within the library
-
processor_type
= 'preprocessor'¶
-
save_to_disk
(dirpath)¶ Saves a data processor to a path on disk.
-
set_params
(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
class
dataprofiler.labelers.data_processing.
StructCharPostprocessor
(default_label='UNKNOWN', pad_label='PAD', flatten_separator='\x01\x01\x01\x01\x01', random_state=None)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Initialize the StructCharPostprocessor class
- Parameters
default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.
-
classmethod
help
()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
static
match_sentence_lengths
(data, results, flatten_separator, inplace=True)¶ Converts the results from the model into the same ragged data shapes as the original data.
- Parameters
data (np.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place
- Returns
dict(pred=…) or dict(pred=…, conf=…)
-
convert_to_structured_analysis
(sentences, results, label_mapping, default_label, pad_label)¶ Converts unstructured results to a structured column analysis assuming the column was flattened into a single sample. This takes the mode of all character predictions except for the separator labels. In cases of tie, chose anything but background, otherwise randomly choose between the remaining labels.
- Parameters
sentences (list(str)) – samples which were predicted upon
results (dict) – character predictions for each sample return from model
label_mapping (dict) – maps labels to their encoded integers
default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label
- Returns
prediction value for a single column
-
process
(data, results, label_mapping)¶ Postprocessing of CharacterLevelCnnModel results when given structured data processed by StructCharPreprocessor.
- Parameters
data (Union[numpy.ndarray, pandas.DataFrame]) – original input data to the data labeler
results – dict of model character level predictions and confs
results – dict
label_mapping (dict) – maps labels to their encoded integers
- Returns
dict of predictions and if they exist, confidences
- Return type
dict
-
classmethod
get_class
(class_name)¶
-
get_parameters
(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
classmethod
load_from_disk
(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library
(name)¶ Loads a data processor from within the library
-
processor_type
= 'postprocessor'¶
-
save_to_disk
(dirpath)¶ Saves a data processor to a path on disk.
-
set_params
(**kwargs)¶ Given kwargs, set the parameters if they exist.
-
class
dataprofiler.labelers.data_processing.
RegexPostProcessor
(aggregation_func='split', priority_order=None, random_state=None)¶ Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Initialize the RegexPostProcessor class
- Parameters
aggregation_func (str) – aggregation function to apply to regex model output (split, random, priority)
priority_order (Union[list, numpy.ndarray]) – if priority is set as the aggregation function, the order in which entities are given priority must be set
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.
-
classmethod
help
()¶ Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
-
static
priority_prediction
(results, entity_priority_order)¶ Aggregation function using priority of regex to give entity determination.
- Parameters
results (dict) – regex from model in format: dict(pred=…, conf=…)
entity_priority_order (np.ndarray) – list of entity priorities (lowest has higher priority)
- Returns
aggregated predictions
-
static
split_prediction
(results)¶ Splits the prediction across votes. :param results: regex from model in format: dict(pred=…, conf=…) :type results: dict :return: aggregated predictions
-
process
(data, labels=None, label_mapping=None, batch_size=None)¶ Data preprocessing function.
-
classmethod
get_class
(class_name)¶
-
get_parameters
(param_list=None)¶ Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
-
classmethod
load_from_disk
(dirpath)¶ Loads a data processor from a given path on disk
-
classmethod
load_from_library
(name)¶ Loads a data processor from within the library
-
processor_type
= 'postprocessor'¶
-
save_to_disk
(dirpath)¶ Saves a data processor to a path on disk.
-
set_params
(**kwargs)¶ Given kwargs, set the parameters if they exist.