Data Processing¶
- class dataprofiler.labelers.data_processing.AutoSubRegistrationMeta(clsname, bases, attrs)¶
Bases:
abc.ABCMeta
- mro()¶
Return a type’s method resolution order.
- register(subclass)¶
Register a virtual subclass of an ABC.
Returns the subclass, to allow usage as a class decorator.
- class dataprofiler.labelers.data_processing.BaseDataProcessor(**parameters)¶
Bases:
object
Abstract Data processing class.
- processor_type = None¶
- classmethod get_class(class_name)¶
- abstract classmethod help()¶
Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
- get_parameters(param_list=None)¶
Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- set_params(**kwargs)¶
Given kwargs, set the parameters if they exist.
- abstract process(*args)¶
Data processing function.
- classmethod load_from_disk(dirpath)¶
Loads a data processor from a given path on disk
- classmethod load_from_library(name)¶
Loads a data processor from within the library
- save_to_disk(dirpath)¶
Saves a data processor to a path on disk.
- class dataprofiler.labelers.data_processing.BaseDataPreprocessor(**parameters)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataProcessor
Abstract Data preprocessing class.
- processor_type = 'preprocessor'¶
- abstract process(data, labels, label_mapping, batch_size)¶
Data preprocessing function.
- classmethod get_class(class_name)¶
- get_parameters(param_list=None)¶
Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- abstract classmethod help()¶
Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
- classmethod load_from_disk(dirpath)¶
Loads a data processor from a given path on disk
- classmethod load_from_library(name)¶
Loads a data processor from within the library
- save_to_disk(dirpath)¶
Saves a data processor to a path on disk.
- set_params(**kwargs)¶
Given kwargs, set the parameters if they exist.
- class dataprofiler.labelers.data_processing.BaseDataPostprocessor(**parameters)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataProcessor
Abstract Data postprocessing class.
- processor_type = 'postprocessor'¶
- abstract process(data, results, label_mapping)¶
Data postprocessing function.
- classmethod get_class(class_name)¶
- get_parameters(param_list=None)¶
Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- abstract classmethod help()¶
Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
- classmethod load_from_disk(dirpath)¶
Loads a data processor from a given path on disk
- classmethod load_from_library(name)¶
Loads a data processor from within the library
- save_to_disk(dirpath)¶
Saves a data processor to a path on disk.
- set_params(**kwargs)¶
Given kwargs, set the parameters if they exist.
- class dataprofiler.labelers.data_processing.DirectPassPreprocessor¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPreprocessor
Initialize the DirectPassPreprocessor class
- classmethod help()¶
Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
- process(data, labels=None, label_mapping=None, batch_size=None)¶
Data preprocessing function.
- classmethod get_class(class_name)¶
- get_parameters(param_list=None)¶
Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath)¶
Loads a data processor from a given path on disk
- classmethod load_from_library(name)¶
Loads a data processor from within the library
- processor_type = 'preprocessor'¶
- save_to_disk(dirpath)¶
Saves a data processor to a path on disk.
- set_params(**kwargs)¶
Given kwargs, set the parameters if they exist.
- class dataprofiler.labelers.data_processing.CharPreprocessor(max_length=3400, default_label='UNKNOWN', pad_label='PAD', flatten_split=0, flatten_separator=' ', is_separate_at_max_len=False)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPreprocessor
Initialize the CharPreprocessor class
- Parameters
max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator
- classmethod help()¶
Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
- process(data, labels=None, label_mapping=None, batch_size=32)¶
Flatten batches of data
- Parameters
data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[None, dict]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data
- Return batch_data
A dict containing samples of size batch_size
- Rtype batch_data
dicts
- classmethod get_class(class_name)¶
- get_parameters(param_list=None)¶
Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath)¶
Loads a data processor from a given path on disk
- classmethod load_from_library(name)¶
Loads a data processor from within the library
- processor_type = 'preprocessor'¶
- save_to_disk(dirpath)¶
Saves a data processor to a path on disk.
- set_params(**kwargs)¶
Given kwargs, set the parameters if they exist.
- class dataprofiler.labelers.data_processing.CharPostprocessor(default_label='UNKNOWN', pad_label='PAD', flatten_separator=' ', use_word_level_argmax=False, output_format='character_argmax', separators=(' ', ',', ';', "'", '"', ':', '\n', '\t', '.'), word_level_min_percent=0.75)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Initialize the CharPostprocessor class
- Parameters
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
use_word_level_argmax (bool) – whether to require the argmax value of each character in a word to determine the word’s entity
output_format (str) – (character_argmax vs NER) where character_argmax is a list of encodings for each character in the input text and NER is in the dict format which specifies start,end,label for each entity in a sentence
separators (tuple(str)) – list of characters to use for separating words within the character predictions
word_level_min_percent (float) – threshold on generating dominant word_level labeling
- classmethod help()¶
Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
- static convert_to_NER_format(predictions, label_mapping, default_label, pad_label)¶
Converts word level predictions to specified format
- Parameters
predictions (list) – predictions
label_mapping (dict) – labels and corresponding integers
default_label (str) – default label in label_mapping
pad_label (str) – pad label in label_mapping
- Returns
formatted predictions
- Return type
list
- static match_sentence_lengths(data, results, flatten_separator, inplace=True)¶
Converts the results from the model into the same ragged data shapes as the original data.
- Parameters
data (numpy.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place
- Returns
dict(pred=…) or dict(pred=…, conf=…)
- process(data, results, label_mapping)¶
Conducts the processing on the data given the predictions, label_mapping, and default_label.
- Parameters
data (np.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
label_mapping (dict) – labels and corresponding integers
- Returns
dict of predictions and if they exist, confidences
- classmethod get_class(class_name)¶
- get_parameters(param_list=None)¶
Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath)¶
Loads a data processor from a given path on disk
- classmethod load_from_library(name)¶
Loads a data processor from within the library
- processor_type = 'postprocessor'¶
- save_to_disk(dirpath)¶
Saves a data processor to a path on disk.
- set_params(**kwargs)¶
Given kwargs, set the parameters if they exist.
- class dataprofiler.labelers.data_processing.StructCharPreprocessor(max_length=3400, default_label='UNKNOWN', pad_label='PAD', flatten_separator='\x01\x01\x01\x01\x01', is_separate_at_max_len=False)¶
Bases:
dataprofiler.labelers.data_processing.CharPreprocessor
Initialize the StructCharPreprocessor class
- Parameters
max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator
- classmethod help()¶
Help function describing alterable parameters, input data formats for preprocessors, and output data formats for preprocessors.
- Returns
None
- get_parameters(param_list=None)¶
Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- convert_to_unstructured_format(data, labels)¶
Converts the list of data samples into the CharPreprocessor required input data format.
- Parameters
data (numpy.ndarray) – list of strings
labels (list) – labels for each input character
- Returns
data in the following format text=”<SAMPLE><SEPARATOR><SAMPLE>…”, entities=[(start=<INT>, end=<INT>, label=”<LABEL>”),
…(num_samples in data)])
- process(data, labels=None, label_mapping=None, batch_size=32)¶
Process structured data for being processed by the CharacterLevelCnnModel.
- Parameters
data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[dict, None]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data
- Return batch_data
A dict containing samples of size batch_size
- Rtype batch_data
dict
- classmethod get_class(class_name)¶
- classmethod load_from_disk(dirpath)¶
Loads a data processor from a given path on disk
- classmethod load_from_library(name)¶
Loads a data processor from within the library
- processor_type = 'preprocessor'¶
- save_to_disk(dirpath)¶
Saves a data processor to a path on disk.
- set_params(**kwargs)¶
Given kwargs, set the parameters if they exist.
- class dataprofiler.labelers.data_processing.StructCharPostprocessor(default_label='UNKNOWN', pad_label='PAD', flatten_separator='\x01\x01\x01\x01\x01', is_pred_labels=True, random_state=None)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Initialize the StructCharPostprocessor class
- Parameters
default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
is_pred_labels (bool) – (default: true) if true, will convert the model indexes to the label strings given the label_mapping
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.
- classmethod help()¶
Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
- static match_sentence_lengths(data, results, flatten_separator, inplace=True)¶
Converts the results from the model into the same ragged data shapes as the original data.
- Parameters
data (np.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place
- Returns
dict(pred=…) or dict(pred=…, conf=…)
- convert_to_structured_analysis(sentences, results, label_mapping, default_label, pad_label)¶
Converts unstructured results to a structured column analysis assuming the column was flattened into a single sample. This takes the mode of all character predictions except for the separator labels. In cases of tie, chose anything but background, otherwise randomly choose between the remaining labels.
- Parameters
sentences (list(str)) – samples which were predicted upon
results (dict) – character predictions for each sample return from model
label_mapping (dict) – maps labels to their encoded integers
default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label
- Returns
prediction value for a single column
- process(data, results, label_mapping)¶
Postprocessing of CharacterLevelCnnModel results when given structured data processed by StructCharPreprocessor.
- Parameters
data (Union[numpy.ndarray, pandas.DataFrame]) – original input data to the data labeler
results – dict of model character level predictions and confs
results – dict
label_mapping (dict) – maps labels to their encoded integers
- Returns
dict of predictions and if they exist, confidences
- Return type
dict
- classmethod get_class(class_name)¶
- get_parameters(param_list=None)¶
Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath)¶
Loads a data processor from a given path on disk
- classmethod load_from_library(name)¶
Loads a data processor from within the library
- processor_type = 'postprocessor'¶
- save_to_disk(dirpath)¶
Saves a data processor to a path on disk.
- set_params(**kwargs)¶
Given kwargs, set the parameters if they exist.
- class dataprofiler.labelers.data_processing.RegexPostProcessor(aggregation_func='split', priority_order=None, random_state=None)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Initialize the RegexPostProcessor class
- Parameters
aggregation_func (str) – aggregation function to apply to regex model output (split, random, priority)
priority_order (Union[list, numpy.ndarray]) – if priority is set as the aggregation function, the order in which entities are given priority must be set
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.
- classmethod help()¶
Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
- static priority_prediction(results, entity_priority_order)¶
Aggregation function using priority of regex to give entity determination.
- Parameters
results (dict) – regex from model in format: dict(pred=…, conf=…)
entity_priority_order (np.ndarray) – list of entity priorities (lowest has higher priority)
- Returns
aggregated predictions
- static split_prediction(results)¶
Splits the prediction across votes. :param results: regex from model in format: dict(pred=…, conf=…) :type results: dict :return: aggregated predictions
- process(data, labels=None, label_mapping=None, batch_size=None)¶
Data preprocessing function.
- classmethod get_class(class_name)¶
- get_parameters(param_list=None)¶
Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath)¶
Loads a data processor from a given path on disk
- classmethod load_from_library(name)¶
Loads a data processor from within the library
- processor_type = 'postprocessor'¶
- save_to_disk(dirpath)¶
Saves a data processor to a path on disk.
- set_params(**kwargs)¶
Given kwargs, set the parameters if they exist.
- class dataprofiler.labelers.data_processing.StructRegexPostProcessor(random_state=None)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Initialize the RegexPostProcessor class
- Parameters
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.
- set_params(**kwargs)¶
Given kwargs, set the parameters if they exist.
- classmethod help()¶
Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.
- Returns
None
- classmethod get_class(class_name)¶
- get_parameters(param_list=None)¶
Returns a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath)¶
Loads a data processor from a given path on disk
- classmethod load_from_library(name)¶
Loads a data processor from within the library
- process(data, labels=None, label_mapping=None, batch_size=None)¶
Data preprocessing function.
- processor_type = 'postprocessor'¶
- save_to_disk(dirpath)¶
Saves a data processor to a path on disk.