Data Processing¶

class dataprofiler.labelers.data_processing.AutoSubRegistrationMeta(clsname, bases, attrs)¶

Bases: abc.ABCMeta

mro()¶: Return a type’s method resolution order.

register(subclass)¶

Returns the subclass, to allow usage as a class decorator.

class dataprofiler.labelers.data_processing.BaseDataProcessor(**parameters)¶

Bases: object

Abstract Data processing class.

processor_type = None¶

classmethod get_class(class_name)¶

abstract classmethod help()¶

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns: None

get_parameters(param_list=None)¶

Returns a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

set_params(**kwargs)¶: Given kwargs, set the parameters if they exist.

abstract process(*args)¶: Data processing function.

classmethod load_from_disk(dirpath)¶: Loads a data processor from a given path on disk

classmethod load_from_library(name)¶: Loads a data processor from within the library

save_to_disk(dirpath)¶: Saves a data processor to a path on disk.

class dataprofiler.labelers.data_processing.BaseDataPreprocessor(**parameters)¶

Bases: dataprofiler.labelers.data_processing.BaseDataProcessor

Abstract Data preprocessing class.

processor_type = 'preprocessor'¶

abstract process(data, labels, label_mapping, batch_size)¶: Data preprocessing function.

classmethod get_class(class_name)¶

get_parameters(param_list=None)¶

Returns a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

abstract classmethod help()¶

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns: None

classmethod load_from_disk(dirpath)¶: Loads a data processor from a given path on disk

classmethod load_from_library(name)¶: Loads a data processor from within the library

save_to_disk(dirpath)¶: Saves a data processor to a path on disk.

set_params(**kwargs)¶: Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.BaseDataPostprocessor(**parameters)¶

Bases: dataprofiler.labelers.data_processing.BaseDataProcessor

Abstract Data postprocessing class.

processor_type = 'postprocessor'¶

abstract process(data, results, label_mapping)¶: Data postprocessing function.

classmethod get_class(class_name)¶

get_parameters(param_list=None)¶

Returns a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

abstract classmethod help()¶

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns: None

classmethod load_from_disk(dirpath)¶: Loads a data processor from a given path on disk

classmethod load_from_library(name)¶: Loads a data processor from within the library

save_to_disk(dirpath)¶: Saves a data processor to a path on disk.

set_params(**kwargs)¶: Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.DirectPassPreprocessor¶

Bases: dataprofiler.labelers.data_processing.BaseDataPreprocessor

Initialize the DirectPassPreprocessor class

classmethod help()¶

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns: None

process(data, labels=None, label_mapping=None, batch_size=None)¶: Data preprocessing function.

classmethod get_class(class_name)¶

get_parameters(param_list=None)¶

Returns a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath)¶: Loads a data processor from a given path on disk

classmethod load_from_library(name)¶: Loads a data processor from within the library

processor_type = 'preprocessor'¶

save_to_disk(dirpath)¶: Saves a data processor to a path on disk.

set_params(**kwargs)¶: Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.CharPreprocessor(max_length=3400, default_label='UNKNOWN', pad_label='PAD', flatten_split=0, flatten_separator=' ', is_separate_at_max_len=False)¶

Bases: dataprofiler.labelers.data_processing.BaseDataPreprocessor

Initialize the CharPreprocessor class

Parameters

max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help()¶

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns: None

process(data, labels=None, label_mapping=None, batch_size=32)¶

Flatten batches of data

Parameters

data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[None, dict]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data

Return batch_data

A dict containing samples of size batch_size

Rtype batch_data

dicts

classmethod get_class(class_name)¶

get_parameters(param_list=None)¶

Returns a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath)¶: Loads a data processor from a given path on disk

classmethod load_from_library(name)¶: Loads a data processor from within the library

processor_type = 'preprocessor'¶

save_to_disk(dirpath)¶: Saves a data processor to a path on disk.

set_params(**kwargs)¶: Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.CharPostprocessor(default_label='UNKNOWN', pad_label='PAD', flatten_separator=' ', use_word_level_argmax=False, output_format='character_argmax', separators=(' ', ',', ';', "'", '"', ':', '\n', '\t', '.'), word_level_min_percent=0.75)¶

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Initialize the CharPostprocessor class

Parameters

default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
use_word_level_argmax (bool) – whether to require the argmax value of each character in a word to determine the word’s entity
output_format (str) – (character_argmax vs NER) where character_argmax is a list of encodings for each character in the input text and NER is in the dict format which specifies start,end,label for each entity in a sentence
separators (tuple(str)) – list of characters to use for separating words within the character predictions
word_level_min_percent (float) – threshold on generating dominant word_level labeling

classmethod help()¶

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns: None

static convert_to_NER_format(predictions, label_mapping, default_label, pad_label)¶

Converts word level predictions to specified format

Parameters

predictions (list) – predictions
label_mapping (dict) – labels and corresponding integers
default_label (str) – default label in label_mapping
pad_label (str) – pad label in label_mapping

Returns

formatted predictions

Return type

list

static match_sentence_lengths(data, results, flatten_separator, inplace=True)¶

Converts the results from the model into the same ragged data shapes as the original data.

Parameters

data (numpy.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place

Returns

dict(pred=…) or dict(pred=…, conf=…)

process(data, results, label_mapping)¶

Conducts the processing on the data given the predictions, label_mapping, and default_label.

Parameters

data (np.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
label_mapping (dict) – labels and corresponding integers

Returns

dict of predictions and if they exist, confidences

classmethod get_class(class_name)¶

get_parameters(param_list=None)¶

Returns a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath)¶: Loads a data processor from a given path on disk

classmethod load_from_library(name)¶: Loads a data processor from within the library

processor_type = 'postprocessor'¶

save_to_disk(dirpath)¶: Saves a data processor to a path on disk.

set_params(**kwargs)¶: Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.StructCharPreprocessor(max_length=3400, default_label='UNKNOWN', pad_label='PAD', flatten_separator='\x01\x01\x01\x01\x01', is_separate_at_max_len=False)¶

Bases: dataprofiler.labelers.data_processing.CharPreprocessor

Initialize the StructCharPreprocessor class

Parameters

max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help()¶

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for preprocessors.

Returns: None

get_parameters(param_list=None)¶

Returns a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

convert_to_unstructured_format(data, labels)¶

Converts the list of data samples into the CharPreprocessor required input data format.

Parameters

data (numpy.ndarray) – list of strings
labels (list) – labels for each input character

Returns

data in the following format text=”<SAMPLE><SEPARATOR><SAMPLE>…”, entities=[(start=<INT>, end=<INT>, label=”<LABEL>”),

…(num_samples in data)])

process(data, labels=None, label_mapping=None, batch_size=32)¶

Process structured data for being processed by the CharacterLevelCnnModel.

Parameters

data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[dict, None]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data

Return batch_data

A dict containing samples of size batch_size

Rtype batch_data

dict

classmethod get_class(class_name)¶

classmethod load_from_disk(dirpath)¶: Loads a data processor from a given path on disk

classmethod load_from_library(name)¶: Loads a data processor from within the library

processor_type = 'preprocessor'¶

save_to_disk(dirpath)¶: Saves a data processor to a path on disk.

set_params(**kwargs)¶: Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.StructCharPostprocessor(default_label='UNKNOWN', pad_label='PAD', flatten_separator='\x01\x01\x01\x01\x01', random_state=None)¶

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Initialize the StructCharPostprocessor class

Parameters

default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help()¶

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns: None

static match_sentence_lengths(data, results, flatten_separator, inplace=True)¶

Converts the results from the model into the same ragged data shapes as the original data.

Parameters

data (np.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place

Returns

dict(pred=…) or dict(pred=…, conf=…)

convert_to_structured_analysis(sentences, results, label_mapping, default_label, pad_label)¶

Converts unstructured results to a structured column analysis assuming the column was flattened into a single sample. This takes the mode of all character predictions except for the separator labels. In cases of tie, chose anything but background, otherwise randomly choose between the remaining labels.

Parameters

sentences (list(str)) – samples which were predicted upon
results (dict) – character predictions for each sample return from model
label_mapping (dict) – maps labels to their encoded integers
default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label

Returns

prediction value for a single column

process(data, results, label_mapping)¶

Postprocessing of CharacterLevelCnnModel results when given structured data processed by StructCharPreprocessor.

Parameters

data (Union[numpy.ndarray, pandas.DataFrame]) – original input data to the data labeler
results – dict of model character level predictions and confs
results – dict
label_mapping (dict) – maps labels to their encoded integers

Returns

dict of predictions and if they exist, confidences

Return type

dict

classmethod get_class(class_name)¶

get_parameters(param_list=None)¶

Returns a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath)¶: Loads a data processor from a given path on disk

classmethod load_from_library(name)¶: Loads a data processor from within the library

processor_type = 'postprocessor'¶

save_to_disk(dirpath)¶: Saves a data processor to a path on disk.

set_params(**kwargs)¶: Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.RegexPostProcessor(aggregation_func='split', priority_order=None, random_state=None)¶

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Initialize the RegexPostProcessor class

Parameters

aggregation_func (str) – aggregation function to apply to regex model output (split, random, priority)
priority_order (Union[list, numpy.ndarray]) – if priority is set as the aggregation function, the order in which entities are given priority must be set
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help()¶

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns: None

static priority_prediction(results, entity_priority_order)¶

Aggregation function using priority of regex to give entity determination.

Parameters

results (dict) – regex from model in format: dict(pred=…, conf=…)
entity_priority_order (np.ndarray) – list of entity priorities (lowest has higher priority)

Returns

aggregated predictions

static split_prediction(results)¶: Splits the prediction across votes. :param results: regex from model in format: dict(pred=…, conf=…) :type results: dict :return: aggregated predictions

process(data, labels=None, label_mapping=None, batch_size=None)¶: Data preprocessing function.

classmethod get_class(class_name)¶

get_parameters(param_list=None)¶

Returns a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath)¶: Loads a data processor from a given path on disk

classmethod load_from_library(name)¶: Loads a data processor from within the library

processor_type = 'postprocessor'¶

save_to_disk(dirpath)¶: Saves a data processor to a path on disk.

set_params(**kwargs)¶: Given kwargs, set the parameters if they exist.