Data Processing

Contains pre-built processors for data labeling/processing.

class dataprofiler.labelers.data_processing.AutoSubRegistrationMeta(clsname: str, bases: tuple, attrs: dict)

Bases: abc.ABCMeta

For registering subclasses.

Create AutoSubRegistration object.

mro()

Return a type’s method resolution order.

register(subclass)

Register a virtual subclass of an ABC.

Returns the subclass, to allow usage as a class decorator.

class dataprofiler.labelers.data_processing.BaseDataProcessor(**parameters: Any)

Bases: object

Abstract Data processing class.

Initialize BaseDataProcessor object.

processor_type: str = None
classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

abstract classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

abstract process(*args: Any) Any

Process data.

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

save_to_disk(dirpath: str) None

Save data processor to a path on disk.

class dataprofiler.labelers.data_processing.BaseDataPreprocessor(**parameters: Any)

Bases: dataprofiler.labelers.data_processing.BaseDataProcessor

Abstract Data preprocessing class.

Initialize BaseDataPreprocessor object.

processor_type: str = 'preprocessor'
abstract process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None]

Preprocess data.

classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

abstract classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.BaseDataPostprocessor(**parameters: Any)

Bases: dataprofiler.labelers.data_processing.BaseDataProcessor

Abstract Data postprocessing class.

Initialize BaseDataPostprocessor object.

processor_type: str = 'postprocessor'
abstract process(data: numpy.ndarray, results: dict, label_mapping: dict) dict

Postprocess data.

classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

abstract classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.DirectPassPreprocessor

Bases: dataprofiler.labelers.data_processing.BaseDataPreprocessor

Subclass of BaseDataPreprocessor for preprocessing data.

Initialize the DirectPassPreprocessor class.

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) tuple[np.ndarray, np.ndarray] | np.ndarray

Preprocess data.

classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'preprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharPreprocessor(max_length: int = 3400, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_split: float = 0, flatten_separator: str = ' ', is_separate_at_max_len: bool = False, **kwargs: Any)

Bases: dataprofiler.labelers.data_processing.BaseDataPreprocessor

Subclass of BaseDataPreprocessor for preprocessing char data.

Initialize the CharPreprocessor class.

Parameters
  • max_length (int) – Maximum char length in a sample.

  • default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label

  • pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label

  • flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value

  • flatten_separator (str) – separator used to put between flattened samples.

  • is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help() None

Describe alterable parameters.

Input data formats. Output data formats for postprocessors.

Returns

None

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None]

Flatten batches of data.

Parameters
  • data (numpy.ndarray) – List of strings to create embeddings for

  • labels (numpy.ndarray) – labels for each input character

  • label_mapping (Union[None, dict]) – maps labels to their encoded integers

  • batch_size (int) – Number of samples in the batch of data

Return batch_data

A dict containing samples of size batch_size

Rtype batch_data

dicts

classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'preprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharEncodedPreprocessor(encoding_map: dict[str, int] | None = None, max_length: int = 5000, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_split: float = 0, flatten_separator: str = ' ', is_separate_at_max_len: bool = False)

Bases: dataprofiler.labelers.data_processing.CharPreprocessor

Subclass of CharPreprocessor for preprocessing char encoded data.

Initialize the CharEncodedPreprocessor class.

Parameters
  • encoding_map (dict) – char to int encoding map

  • max_length (int) – Maximum char length in a sample.

  • default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label

  • pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label

  • flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value

  • flatten_separator (str) – separator used to put between flattened samples.

  • is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None]

Process structured data for being processed by CharacterLevelCnnModel.

Parameters
  • data (numpy.ndarray) – List of strings to create embeddings for

  • labels (numpy.ndarray) – labels for each input character

  • label_mapping (Union[dict, None]) – maps labels to their encoded integers

  • batch_size (int) – Number of samples in the batch of data

Return batch_data

A dict containing samples of size batch_size

Rtype batch_data

dict

classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod help() None

Describe alterable parameters.

Input data formats. Output data formats for postprocessors.

Returns

None

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'preprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharPostprocessor(default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = ' ', use_word_level_argmax: bool = False, output_format: str = 'character_argmax', separators: tuple = (' ', ',', ';', "'", '"', ':', '\n', '\t', '.'), word_level_min_percent: float = 0.75)

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing char data.

Initialize the CharPostprocessor class.

Parameters
  • default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label

  • pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label

  • flatten_separator (str) – separator used to put between flattened samples.

  • use_word_level_argmax (bool) – whether to require the argmax value of each character in a word to determine the word’s entity

  • output_format (str) – (character_argmax vs NER) where character_argmax is a list of encodings for each character in the input text and NER is in the dict format which specifies start,end,label for each entity in a sentence

  • separators (tuple(str)) – list of characters to use for separating words within the character predictions

  • word_level_min_percent (float) – threshold on generating dominant word_level labeling

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

static convert_to_NER_format(predictions: list, label_mapping: dict, default_label: str, pad_label: str) list

Convert word level predictions to specified format.

Parameters
  • predictions (list) – predictions

  • label_mapping (dict) – labels and corresponding integers

  • default_label (str) – default label in label_mapping

  • pad_label (str) – pad label in label_mapping

Returns

formatted predictions

Return type

list

static match_sentence_lengths(data: numpy.ndarray, results: dict, flatten_separator: str, inplace: bool = True) dict

Convert results from model into same ragged data shapes as original data.

Parameters
  • data (numpy.ndarray) – original input data to the data labeler

  • results (dict) – dict of model character level predictions and confs

  • flatten_separator (str) – string which joins to samples together when flattening

  • inplace (bool) – flag to modify results in place

Returns

dict(pred=…) or dict(pred=…, conf=…)

process(data: numpy.ndarray, results: dict, label_mapping: dict) dict

Conduct processing on data given predictions, label_mapping, and default_label.

Parameters
  • data (Union[np.ndarray, pd.DataFrame]) – original input data to the data labeler

  • results (dict) – dict of model character level predictions and confs

  • label_mapping (dict) – labels and corresponding integers

Returns

dict of predictions and if they exist, confidences

classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'postprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructCharPreprocessor(max_length: int = 3400, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = '\x01\x01\x01\x01\x01', is_separate_at_max_len: bool = False)

Bases: dataprofiler.labelers.data_processing.CharPreprocessor

Subclass of CharPreprocessor for preprocessing struct char data.

Initialize the StructCharPreprocessor class.

Parameters
  • max_length (int) – Maximum char length in a sample.

  • default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label

  • pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label

  • flatten_separator (str) – separator used to put between flattened samples.

  • is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for preprocessors.

Returns

None

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

convert_to_unstructured_format(data: np.ndarray, labels: list[str] | npt.NDArray[np.str_] | None) tuple[str, list[tuple[int, int, str]] | None]

Convert data samples list to StructCharPreprocessor required input data format.

Parameters
  • data (numpy.ndarray) – list of strings

  • labels (Optional[Union[List[str], npt.NDArray[np.str_]]]) – labels for each input character

Returns

data in the following format text=”<SAMPLE><SEPARATOR><SAMPLE>…”, entities=[(start=<INT>, end=<INT>, label=”<LABEL>”),

…(num_samples in data)])

Return type

Tuple[str, Optional[List[Tuple[int, int, str]]]]

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None]

Process structured data for being processed by CharacterLevelCnnModel.

Parameters
  • data (numpy.ndarray) – List of strings to create embeddings for

  • labels (numpy.ndarray) – labels for each input character

  • label_mapping (Union[dict, None]) – maps labels to their encoded integers

  • batch_size (int) – Number of samples in the batch of data

Return batch_data

A dict containing samples of size batch_size

Rtype batch_data

dict

classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'preprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructCharPostprocessor(default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = '\x01\x01\x01\x01\x01', is_pred_labels: bool = True, random_state: random.Random | int | list | tuple | None = None)

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing struct char data.

Initialize the StructCharPostprocessor class.

Parameters
  • default_label (str) – Key for label_mapping that is the default label

  • pad_label (str) – Key for label_mapping that is the pad label

  • flatten_separator (str) – separator used to put between flattened samples.

  • is_pred_labels (bool) – (default: true) if true, will convert the model indexes to the label strings given the label_mapping

  • random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

static match_sentence_lengths(data: numpy.ndarray, results: dict, flatten_separator: str, inplace: bool = True) dict

Convert results from model into same ragged data shapes as original data.

Parameters
  • data (np.ndarray) – original input data to the data labeler

  • results (dict) – dict of model character level predictions and confs

  • flatten_separator (str) – string which joins to samples together when flattening

  • inplace (bool) – flag to modify results in place

Returns

dict(pred=…) or dict(pred=…, conf=…)

convert_to_structured_analysis(sentences: numpy.ndarray, results: dict, label_mapping: dict, default_label: str, pad_label: str) dict

Convert unstructured results to a structured column analysis.

This assumes the column was flattened into a single sample, and takes mode of all character predictions except for the separator labels. In cases of tie, chose anything but background, otherwise randomly choose between the remaining labels.

Parameters
  • sentences (numpy.ndarray) – samples which were predicted upon

  • results (dict) – character predictions for each sample return from model

  • label_mapping (dict) – maps labels to their encoded integers

  • default_label (str) – Key for label_mapping that is the default label

  • pad_label (str) – Key for label_mapping that is the pad label

Returns

prediction value for a single column

process(data: numpy.ndarray, results: dict, label_mapping: dict) dict

Postprocess CharacterLevelCnnModel results when given structured data.

Said structured data is processed by StructCharPreprocessor.

Parameters
  • data (Union[numpy.ndarray, pandas.DataFrame]) – original input data to the data labeler

  • results – dict of model character level predictions and confs

  • results – dict

  • label_mapping (dict) – maps labels to their encoded integers

Returns

dict of predictions and if they exist, confidences

Return type

dict

classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'postprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.RegexPostProcessor(aggregation_func: str = 'split', priority_order: list | np.ndarray | None = None, random_state: random.Random | int | list | tuple | None = None)

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing regex data.

Initialize the RegexPostProcessor class.

Parameters
  • aggregation_func (str) – aggregation function to apply to regex model output (split, random, priority)

  • priority_order (Union[list, numpy.ndarray]) – if priority is set as the aggregation function, the order in which entities are given priority must be set

  • random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

static priority_prediction(results: dict, entity_priority_order: numpy.ndarray) None

Use priority of regex to give entity determination.

Parameters
  • results (dict) – regex from model in format: dict(pred=…, conf=…)

  • entity_priority_order (np.ndarray) – list of entity priorities (lowest has higher priority)

Returns

None

static split_prediction(results: dict) None

Split the prediction across votes.

Parameters

results (dict) – regex from model in format: dict(pred=…, conf=…)

Returns

None

process(data: numpy.ndarray, results: dict, label_mapping: dict) dict

Preprocess data.

classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'postprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructRegexPostProcessor(random_state: random.Random | int | list | tuple | None = None)

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing struct regex data.

Initialize the RegexPostProcessor class.

Parameters

random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

set_params(**kwargs: Any) None

Given kwargs, set the parameters if they exist.

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors Output data formats for postprocessors.

Returns

None

process(data: numpy.ndarray, results: dict, label_mapping: dict) dict

Preprocess data.

classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'postprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

class dataprofiler.labelers.data_processing.ColumnNameModelPostprocessor

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing regex data.

Initialize the ColumnNameModelPostProcessor class.

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

process(data: np.ndarray, results: dict, label_mapping: dict[str, int] | None = None) dict

Preprocess data.

classmethod get_class(class_name: str) type[Processor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters

param_list (list) – list of parameters to retrieve from the model.

Returns

dict of parameters

classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'postprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.