Data Processing

class dataprofiler.labelers.data_processing.AutoSubRegistrationMeta(clsname, bases, attrs)

Bases: abc.ABCMeta

mro()

Return a type’s method resolution order.

register(subclass)

Register a virtual subclass of an ABC.

Returns the subclass, to allow usage as a class decorator.

class dataprofiler.labelers.data_processing.BaseDataProcessor(**parameters)

Bases: object

Abstract Data processing class.

processor_type = None
classmethod get_class(class_name)
abstract classmethod help()

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns

None

get_parameters(param_list=None)

Returns a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

set_params(**kwargs)

Given kwargs, set the parameters if they exist.

abstract process(*args)

Data processing function.

classmethod load_from_disk(dirpath)

Loads a data processor from a given path on disk

classmethod load_from_library(name)

Loads a data processor from within the library

save_to_disk(dirpath)

Saves a data processor to a path on disk.

class dataprofiler.labelers.data_processing.BaseDataPreprocessor(**parameters)

Bases: dataprofiler.labelers.data_processing.BaseDataProcessor

Abstract Data preprocessing class.

processor_type = 'preprocessor'
abstract process(data, labels, label_mapping, batch_size)

Data preprocessing function.

classmethod get_class(class_name)
get_parameters(param_list=None)

Returns a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

abstract classmethod help()

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns

None

classmethod load_from_disk(dirpath)

Loads a data processor from a given path on disk

classmethod load_from_library(name)

Loads a data processor from within the library

save_to_disk(dirpath)

Saves a data processor to a path on disk.

set_params(**kwargs)

Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.BaseDataPostprocessor(**parameters)

Bases: dataprofiler.labelers.data_processing.BaseDataProcessor

Abstract Data postprocessing class.

processor_type = 'postprocessor'
abstract process(data, results, label_mapping)

Data postprocessing function.

classmethod get_class(class_name)
get_parameters(param_list=None)

Returns a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

abstract classmethod help()

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns

None

classmethod load_from_disk(dirpath)

Loads a data processor from a given path on disk

classmethod load_from_library(name)

Loads a data processor from within the library

save_to_disk(dirpath)

Saves a data processor to a path on disk.

set_params(**kwargs)

Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.DirectPassPreprocessor

Bases: dataprofiler.labelers.data_processing.BaseDataPreprocessor

Initialize the DirectPassPreprocessor class

classmethod help()

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns

None

process(data, labels=None, label_mapping=None, batch_size=None)

Data preprocessing function.

classmethod get_class(class_name)
get_parameters(param_list=None)

Returns a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath)

Loads a data processor from a given path on disk

classmethod load_from_library(name)

Loads a data processor from within the library

processor_type = 'preprocessor'
save_to_disk(dirpath)

Saves a data processor to a path on disk.

set_params(**kwargs)

Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.CharPreprocessor(max_length=3400, default_label='UNKNOWN', pad_label='PAD', flatten_split=0, flatten_separator=' ', is_separate_at_max_len=False)

Bases: dataprofiler.labelers.data_processing.BaseDataPreprocessor

Initialize the CharPreprocessor class

Parameters
  • max_length (int) – Maximum char length in a sample.

  • default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label

  • pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label

  • flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value

  • flatten_separator (str) – separator used to put between flattened samples.

  • is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help()

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns

None

process(data, labels=None, label_mapping=None, batch_size=32)

Flatten batches of data

Parameters
  • data (numpy.ndarray) – List of strings to create embeddings for

  • labels (numpy.ndarray) – labels for each input character

  • label_mapping (Union[None, dict]) – maps labels to their encoded integers

  • batch_size (int) – Number of samples in the batch of data

Return batch_data

A dict containing samples of size batch_size

Rtype batch_data

dicts

classmethod get_class(class_name)
get_parameters(param_list=None)

Returns a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath)

Loads a data processor from a given path on disk

classmethod load_from_library(name)

Loads a data processor from within the library

processor_type = 'preprocessor'
save_to_disk(dirpath)

Saves a data processor to a path on disk.

set_params(**kwargs)

Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.CharPostprocessor(default_label='UNKNOWN', pad_label='PAD', flatten_separator=' ', use_word_level_argmax=False, output_format='character_argmax', separators=(' ', ',', ';', "'", '"', ':', '\n', '\t', '.'), word_level_min_percent=0.75)

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Initialize the CharPostprocessor class

Parameters
  • default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label

  • pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label

  • flatten_separator (str) – separator used to put between flattened samples.

  • use_word_level_argmax (bool) – whether to require the argmax value of each character in a word to determine the word’s entity

  • output_format (str) – (character_argmax vs NER) where character_argmax is a list of encodings for each character in the input text and NER is in the dict format which specifies start,end,label for each entity in a sentence

  • separators (tuple(str)) – list of characters to use for separating words within the character predictions

  • word_level_min_percent (float) – threshold on generating dominant word_level labeling

classmethod help()

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns

None

static convert_to_NER_format(predictions, label_mapping, default_label, pad_label)

Converts word level predictions to specified format

Parameters
  • predictions (list) – predictions

  • label_mapping (dict) – labels and corresponding integers

  • default_label (str) – default label in label_mapping

  • pad_label (str) – pad label in label_mapping

Returns

formatted predictions

Return type

list

static match_sentence_lengths(data, results, flatten_separator, inplace=True)

Converts the results from the model into the same ragged data shapes as the original data.

Parameters
  • data (numpy.ndarray) – original input data to the data labeler

  • results (dict) – dict of model character level predictions and confs

  • flatten_separator (str) – string which joins to samples together when flattening

  • inplace (bool) – flag to modify results in place

Returns

dict(pred=…) or dict(pred=…, conf=…)

process(data, results, label_mapping)

Conducts the processing on the data given the predictions, label_mapping, and default_label.

Parameters
  • data (np.ndarray) – original input data to the data labeler

  • results (dict) – dict of model character level predictions and confs

  • label_mapping (dict) – labels and corresponding integers

Returns

dict of predictions and if they exist, confidences

classmethod get_class(class_name)
get_parameters(param_list=None)

Returns a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath)

Loads a data processor from a given path on disk

classmethod load_from_library(name)

Loads a data processor from within the library

processor_type = 'postprocessor'
save_to_disk(dirpath)

Saves a data processor to a path on disk.

set_params(**kwargs)

Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.StructCharPreprocessor(max_length=3400, default_label='UNKNOWN', pad_label='PAD', flatten_separator='\x01\x01\x01\x01\x01', is_separate_at_max_len=False)

Bases: dataprofiler.labelers.data_processing.CharPreprocessor

Initialize the StructCharPreprocessor class

Parameters
  • max_length (int) – Maximum char length in a sample.

  • default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label

  • pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label

  • flatten_separator (str) – separator used to put between flattened samples.

  • is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help()

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for preprocessors.

Returns

None

get_parameters(param_list=None)

Returns a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

convert_to_unstructured_format(data, labels)

Converts the list of data samples into the CharPreprocessor required input data format.

Parameters
  • data (numpy.ndarray) – list of strings

  • labels (list) – labels for each input character

Returns

data in the following format text=”<SAMPLE><SEPARATOR><SAMPLE>…”, entities=[(start=<INT>, end=<INT>, label=”<LABEL>”),

…(num_samples in data)])

process(data, labels=None, label_mapping=None, batch_size=32)

Process structured data for being processed by the CharacterLevelCnnModel.

Parameters
  • data (numpy.ndarray) – List of strings to create embeddings for

  • labels (numpy.ndarray) – labels for each input character

  • label_mapping (Union[dict, None]) – maps labels to their encoded integers

  • batch_size (int) – Number of samples in the batch of data

Return batch_data

A dict containing samples of size batch_size

Rtype batch_data

dict

classmethod get_class(class_name)
classmethod load_from_disk(dirpath)

Loads a data processor from a given path on disk

classmethod load_from_library(name)

Loads a data processor from within the library

processor_type = 'preprocessor'
save_to_disk(dirpath)

Saves a data processor to a path on disk.

set_params(**kwargs)

Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.StructCharPostprocessor(default_label='UNKNOWN', pad_label='PAD', flatten_separator='\x01\x01\x01\x01\x01', random_state=None)

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Initialize the StructCharPostprocessor class

Parameters
  • default_label (str) – Key for label_mapping that is the default label

  • pad_label (str) – Key for label_mapping that is the pad label

  • flatten_separator (str) – separator used to put between flattened samples.

  • random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help()

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns

None

static match_sentence_lengths(data, results, flatten_separator, inplace=True)

Converts the results from the model into the same ragged data shapes as the original data.

Parameters
  • data (np.ndarray) – original input data to the data labeler

  • results (dict) – dict of model character level predictions and confs

  • flatten_separator (str) – string which joins to samples together when flattening

  • inplace (bool) – flag to modify results in place

Returns

dict(pred=…) or dict(pred=…, conf=…)

convert_to_structured_analysis(sentences, results, label_mapping, default_label, pad_label)

Converts unstructured results to a structured column analysis assuming the column was flattened into a single sample. This takes the mode of all character predictions except for the separator labels. In cases of tie, chose anything but background, otherwise randomly choose between the remaining labels.

Parameters
  • sentences (list(str)) – samples which were predicted upon

  • results (dict) – character predictions for each sample return from model

  • label_mapping (dict) – maps labels to their encoded integers

  • default_label (str) – Key for label_mapping that is the default label

  • pad_label (str) – Key for label_mapping that is the pad label

Returns

prediction value for a single column

process(data, results, label_mapping)

Postprocessing of CharacterLevelCnnModel results when given structured data processed by StructCharPreprocessor.

Parameters
  • data (Union[numpy.ndarray, pandas.DataFrame]) – original input data to the data labeler

  • results – dict of model character level predictions and confs

  • results – dict

  • label_mapping (dict) – maps labels to their encoded integers

Returns

dict of predictions and if they exist, confidences

Return type

dict

classmethod get_class(class_name)
get_parameters(param_list=None)

Returns a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath)

Loads a data processor from a given path on disk

classmethod load_from_library(name)

Loads a data processor from within the library

processor_type = 'postprocessor'
save_to_disk(dirpath)

Saves a data processor to a path on disk.

set_params(**kwargs)

Given kwargs, set the parameters if they exist.

class dataprofiler.labelers.data_processing.RegexPostProcessor(aggregation_func='split', priority_order=None, random_state=None)

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Initialize the RegexPostProcessor class

Parameters
  • aggregation_func (str) – aggregation function to apply to regex model output (split, random, priority)

  • priority_order (Union[list, numpy.ndarray]) – if priority is set as the aggregation function, the order in which entities are given priority must be set

  • random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help()

Help function describing alterable parameters, input data formats for preprocessors, and output data formats for postprocessors.

Returns

None

static priority_prediction(results, entity_priority_order)

Aggregation function using priority of regex to give entity determination.

Parameters
  • results (dict) – regex from model in format: dict(pred=…, conf=…)

  • entity_priority_order (np.ndarray) – list of entity priorities (lowest has higher priority)

Returns

aggregated predictions

static split_prediction(results)

Splits the prediction across votes. :param results: regex from model in format: dict(pred=…, conf=…) :type results: dict :return: aggregated predictions

process(data, labels=None, label_mapping=None, batch_size=None)

Data preprocessing function.

classmethod get_class(class_name)
get_parameters(param_list=None)

Returns a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath)

Loads a data processor from a given path on disk

classmethod load_from_library(name)

Loads a data processor from within the library

processor_type = 'postprocessor'
save_to_disk(dirpath)

Saves a data processor to a path on disk.

set_params(**kwargs)

Given kwargs, set the parameters if they exist.