dataprofiler.labelers.data_processing module

Contains pre-built processors for data labeling/processing.

class dataprofiler.labelers.data_processing.AutoSubRegistrationMeta(clsname: str, bases: tuple[type, ...], attrs: dict[str, object])

Bases: ABCMeta

For registering subclasses.

Create AutoSubRegistration object.

mro()

Return a type’s method resolution order.

register(subclass)

Register a virtual subclass of an ABC.

Returns the subclass, to allow usage as a class decorator.

class dataprofiler.labelers.data_processing.BaseDataProcessor(**parameters: Any)

Bases: object

Abstract Data processing class.

Initialize BaseDataProcessor object.

processor_type: str
classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

abstract classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:

None

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

abstract process(*args: Any, **kwargs: Any) Any

Process data.

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

save_to_disk(dirpath: str) None

Save data processor to a path on disk.

class dataprofiler.labelers.data_processing.BaseDataPreprocessor(**parameters: Any)

Bases: BaseDataProcessor

Abstract Data preprocessing class.

Initialize BaseDataPreprocessor object.

processor_type: str = 'preprocessor'
abstract process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None] | tuple[np.ndarray, np.ndarray] | np.ndarray

Preprocess data.

classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

abstract classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:

None

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.BaseDataPostprocessor(**parameters: Any)

Bases: BaseDataProcessor

Abstract Data postprocessing class.

Initialize BaseDataPostprocessor object.

processor_type: str = 'postprocessor'
abstract process(data: ndarray, results: dict, label_mapping: dict[str, int]) dict

Postprocess data.

classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

abstract classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:

None

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.DirectPassPreprocessor

Bases: BaseDataPreprocessor

Subclass of BaseDataPreprocessor for preprocessing data.

Initialize the DirectPassPreprocessor class.

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:

None

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) tuple[np.ndarray, np.ndarray] | np.ndarray

Preprocess data.

classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'preprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharPreprocessor(max_length: int = 3400, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_split: float = 0, flatten_separator: str = ' ', is_separate_at_max_len: bool = False, **kwargs: Any)

Bases: BaseDataPreprocessor

Subclass of BaseDataPreprocessor for preprocessing char data.

Initialize the CharPreprocessor class.

Parameters:
  • max_length (int) – Maximum char length in a sample.

  • default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label

  • pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label

  • flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value

  • flatten_separator (str) – separator used to put between flattened samples.

  • is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help() None

Describe alterable parameters.

Input data formats. Output data formats for postprocessors.

Returns:

None

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None]

Flatten batches of data.

Parameters:
  • data (numpy.ndarray) – List of strings to create embeddings for

  • labels (numpy.ndarray) – labels for each input character

  • label_mapping (Union[None, dict]) – maps labels to their encoded integers

  • batch_size (int) – Number of samples in the batch of data

Return batch_data:

A dict containing samples of size batch_size

Rtype batch_data:

dicts

classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'preprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharEncodedPreprocessor(encoding_map: dict[str, int] | None = None, max_length: int = 5000, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_split: float = 0, flatten_separator: str = ' ', is_separate_at_max_len: bool = False)

Bases: CharPreprocessor

Subclass of CharPreprocessor for preprocessing char encoded data.

Initialize the CharEncodedPreprocessor class.

Parameters:
  • encoding_map (dict) – char to int encoding map

  • max_length (int) – Maximum char length in a sample.

  • default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label

  • pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label

  • flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value

  • flatten_separator (str) – separator used to put between flattened samples.

  • is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None]

Process structured data for being processed by CharacterLevelCnnModel.

Parameters:
  • data (numpy.ndarray) – List of strings to create embeddings for

  • labels (numpy.ndarray) – labels for each input character

  • label_mapping (Union[dict, None]) – maps labels to their encoded integers

  • batch_size (int) – Number of samples in the batch of data

Return batch_data:

A dict containing samples of size batch_size

Rtype batch_data:

dict

classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

classmethod help() None

Describe alterable parameters.

Input data formats. Output data formats for postprocessors.

Returns:

None

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'preprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharPostprocessor(default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = ' ', use_word_level_argmax: bool = False, output_format: str = 'character_argmax', separators: tuple[str, ...] = (' ', ',', ';', "'", '"', ':', '\n', '\t', '.'), word_level_min_percent: float = 0.75)

Bases: BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing char data.

Initialize the CharPostprocessor class.

Parameters:
  • default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label

  • pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label

  • flatten_separator (str) – separator used to put between flattened samples.

  • use_word_level_argmax (bool) – whether to require the argmax value of each character in a word to determine the word’s entity

  • output_format (str) – (character_argmax vs NER) where character_argmax is a list of encodings for each character in the input text and NER is in the dict format which specifies start,end,label for each entity in a sentence

  • separators (tuple(str)) – list of characters to use for separating words within the character predictions

  • word_level_min_percent (float) – threshold on generating dominant word_level labeling

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:

None

static convert_to_NER_format(predictions: list[list], label_mapping: dict[str, int], default_label: str, pad_label: str) list[list]

Convert word level predictions to specified format.

Parameters:
  • predictions (list) – predictions

  • label_mapping (dict) – labels and corresponding integers

  • default_label (str) – default label in label_mapping

  • pad_label (str) – pad label in label_mapping

Returns:

formatted predictions

Return type:

list

static match_sentence_lengths(data: ndarray, results: dict, flatten_separator: str, inplace: bool = True) dict

Convert results from model into same ragged data shapes as original data.

Parameters:
  • data (numpy.ndarray) – original input data to the data labeler

  • results (dict) – dict of model character level predictions and confs

  • flatten_separator (str) – string which joins to samples together when flattening

  • inplace (bool) – flag to modify results in place

Returns:

dict(pred=…) or dict(pred=…, conf=…)

process(data: ndarray, results: dict, label_mapping: dict[str, int]) dict

Conduct processing on data given predictions, label_mapping, and default_label.

Parameters:
  • data (Union[np.ndarray, pd.DataFrame]) – original input data to the data labeler

  • results (dict) – dict of model character level predictions and confs

  • label_mapping (dict) – labels and corresponding integers

Returns:

dict of predictions and if they exist, confidences

classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'postprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructCharPreprocessor(max_length: int = 3400, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = '\x01\x01\x01\x01\x01', is_separate_at_max_len: bool = False)

Bases: CharPreprocessor

Subclass of CharPreprocessor for preprocessing struct char data.

Initialize the StructCharPreprocessor class.

Parameters:
  • max_length (int) – Maximum char length in a sample.

  • default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label

  • pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label

  • flatten_separator (str) – separator used to put between flattened samples.

  • is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for preprocessors.

Returns:

None

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

convert_to_unstructured_format(data: np.ndarray, labels: list[str] | npt.NDArray[np.str_] | None) tuple[str, list[tuple[int, int, str]] | None]

Convert data samples list to StructCharPreprocessor required input data format.

Parameters:
  • data (numpy.ndarray) – list of strings

  • labels (Optional[Union[List[str], npt.NDArray[np.str_]]]) – labels for each input character

Returns:

data in the following format text=”<SAMPLE><SEPARATOR><SAMPLE>…”, entities=[(start=<INT>, end=<INT>, label=”<LABEL>”),

…(num_samples in data)])

Return type:

Tuple[str, Optional[List[Tuple[int, int, str]]]]

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None]

Process structured data for being processed by CharacterLevelCnnModel.

Parameters:
  • data (numpy.ndarray) – List of strings to create embeddings for

  • labels (numpy.ndarray) – labels for each input character

  • label_mapping (Union[dict, None]) – maps labels to their encoded integers

  • batch_size (int) – Number of samples in the batch of data

Return batch_data:

A dict containing samples of size batch_size

Rtype batch_data:

dict

classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'preprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructCharPostprocessor(default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = '\x01\x01\x01\x01\x01', is_pred_labels: bool = True, random_state: random.Random | int | list | tuple | None = None)

Bases: BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing struct char data.

Initialize the StructCharPostprocessor class.

Parameters:
  • default_label (str) – Key for label_mapping that is the default label

  • pad_label (str) – Key for label_mapping that is the pad label

  • flatten_separator (str) – separator used to put between flattened samples.

  • is_pred_labels (bool) – (default: true) if true, will convert the model indexes to the label strings given the label_mapping

  • random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:

None

static match_sentence_lengths(data: ndarray, results: dict, flatten_separator: str, inplace: bool = True) dict

Convert results from model into same ragged data shapes as original data.

Parameters:
  • data (np.ndarray) – original input data to the data labeler

  • results (dict) – dict of model character level predictions and confs

  • flatten_separator (str) – string which joins to samples together when flattening

  • inplace (bool) – flag to modify results in place

Returns:

dict(pred=…) or dict(pred=…, conf=…)

convert_to_structured_analysis(sentences: ndarray, results: dict, label_mapping: dict[str, int], default_label: str, pad_label: str) dict

Convert unstructured results to a structured column analysis.

This assumes the column was flattened into a single sample, and takes mode of all character predictions except for the separator labels. In cases of tie, chose anything but background, otherwise randomly choose between the remaining labels.

Parameters:
  • sentences (numpy.ndarray) – samples which were predicted upon

  • results (dict) – character predictions for each sample return from model

  • label_mapping (dict) – maps labels to their encoded integers

  • default_label (str) – Key for label_mapping that is the default label

  • pad_label (str) – Key for label_mapping that is the pad label

Returns:

prediction value for a single column

process(data: ndarray, results: dict, label_mapping: dict[str, int]) dict

Postprocess CharacterLevelCnnModel results when given structured data.

Said structured data is processed by StructCharPreprocessor.

Parameters:
  • data (Union[numpy.ndarray, pandas.DataFrame]) – original input data to the data labeler

  • results – dict of model character level predictions and confs

  • results – dict

  • label_mapping (dict) – maps labels to their encoded integers

Returns:

dict of predictions and if they exist, confidences

Return type:

dict

classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'postprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.RegexPostProcessor(aggregation_func: str = 'split', priority_order: list | np.ndarray | None = None, random_state: random.Random | int | list | tuple | None = None)

Bases: BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing regex data.

Initialize the RegexPostProcessor class.

Parameters:
  • aggregation_func (str) – aggregation function to apply to regex model output (split, random, priority)

  • priority_order (Union[list, numpy.ndarray]) – if priority is set as the aggregation function, the order in which entities are given priority must be set

  • random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:

None

static priority_prediction(results: dict, entity_priority_order: ndarray) None

Use priority of regex to give entity determination.

Parameters:
  • results (dict) – regex from model in format: dict(pred=…, conf=…)

  • entity_priority_order (np.ndarray) – list of entity priorities (lowest has higher priority)

Returns:

None

static split_prediction(results: dict) None

Split the prediction across votes.

Parameters:

results (dict) – regex from model in format: dict(pred=…, conf=…)

Returns:

None

process(data: ndarray, results: dict, label_mapping: dict[str, int]) dict

Preprocess data.

classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'postprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructRegexPostProcessor(random_state: random.Random | int | list | tuple | None = None)

Bases: BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing struct regex data.

Initialize the RegexPostProcessor class.

Parameters:

random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

set_params(**kwargs: Any) None

Given kwargs, set the parameters if they exist.

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors Output data formats for postprocessors.

Returns:

None

process(data: ndarray, results: dict, label_mapping: dict[str, int]) dict

Preprocess data.

classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'postprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

class dataprofiler.labelers.data_processing.ColumnNameModelPostprocessor

Bases: BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing regex data.

Initialize the ColumnNameModelPostProcessor class.

classmethod help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:

None

process(data: np.ndarray, results: dict, label_mapping: dict[str, int] | None = None) dict

Preprocess data.

classmethod get_class(class_name: str) type[BaseDataProcessor] | None

Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (list) – list of parameters to retrieve from the model.

Returns:

dict of parameters

classmethod load_from_disk(dirpath: str) Processor

Load data processor from a given path on disk.

classmethod load_from_library(name: str) BaseDataProcessor

Load data processor from within the library.

processor_type: str = 'postprocessor'
save_to_disk(dirpath: str) None

Save data processor to a path on disk.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.