dataprofiler.labelers.data_processing module¶

Contains pre-built processors for data labeling/processing.

class dataprofiler.labelers.data_processing.AutoSubRegistrationMeta(clsname: str, bases: tuple[type, ...], attrs: dict[str, object])¶

Bases: ABCMeta

For registering subclasses.

Create AutoSubRegistration object.

mro()¶: Return a type’s method resolution order.

register(subclass)¶

Returns the subclass, to allow usage as a class decorator.

class dataprofiler.labelers.data_processing.BaseDataProcessor(**parameters: Any)¶

Bases: object

Abstract Data processing class.

Initialize BaseDataProcessor object.

processor_type: str¶

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

abstract classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:: None

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

abstract process(*args: Any, **kwargs: Any) → Any¶: Process data.

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

class dataprofiler.labelers.data_processing.BaseDataPreprocessor(**parameters: Any)¶

Bases: BaseDataProcessor

Abstract Data preprocessing class.

Initialize BaseDataPreprocessor object.

processor_type: str = 'preprocessor'¶

abstract process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) → Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None] | tuple[np.ndarray, np.ndarray] | np.ndarray¶: Preprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

abstract classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:: None

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.BaseDataPostprocessor(**parameters: Any)¶

Bases: BaseDataProcessor

Abstract Data postprocessing class.

Initialize BaseDataPostprocessor object.

processor_type: str = 'postprocessor'¶

abstract process(data: ndarray, results: dict, label_mapping: dict[str, int]) → dict¶: Postprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

abstract classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:: None

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.DirectPassPreprocessor¶

Bases: BaseDataPreprocessor

Subclass of BaseDataPreprocessor for preprocessing data.

Initialize the DirectPassPreprocessor class.

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:: None

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) → tuple[np.ndarray, np.ndarray] | np.ndarray¶: Preprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'preprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharPreprocessor(max_length: int = 3400, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_split: float = 0, flatten_separator: str = ' ', is_separate_at_max_len: bool = False, **kwargs: Any)¶

Bases: BaseDataPreprocessor

Subclass of BaseDataPreprocessor for preprocessing char data.

Initialize the CharPreprocessor class.

Parameters:

max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help() → None¶

Describe alterable parameters.

Input data formats. Output data formats for postprocessors.

Returns:: None

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) → Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None]¶

Flatten batches of data.

Parameters:

data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[None, dict]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data

Return batch_data:

A dict containing samples of size batch_size

Rtype batch_data:

dicts

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'preprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharEncodedPreprocessor(encoding_map: dict[str, int] | None = None, max_length: int = 5000, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_split: float = 0, flatten_separator: str = ' ', is_separate_at_max_len: bool = False)¶

Bases: CharPreprocessor

Subclass of CharPreprocessor for preprocessing char encoded data.

Initialize the CharEncodedPreprocessor class.

Parameters:

encoding_map (dict) – char to int encoding map
max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

Process structured data for being processed by CharacterLevelCnnModel.

Parameters:

data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[dict, None]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data

Return batch_data:

A dict containing samples of size batch_size

Rtype batch_data:

dict

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

classmethod help() → None¶

Describe alterable parameters.

Input data formats. Output data formats for postprocessors.

Returns:: None

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'preprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharPostprocessor(default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = ' ', use_word_level_argmax: bool = False, output_format: str = 'character_argmax', separators: tuple[str, ...] = (' ', ',', ';', "'", '"', ':', '\n', '\t', '.'), word_level_min_percent: float = 0.75)¶

Bases: BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing char data.

Initialize the CharPostprocessor class.

Parameters:

default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
use_word_level_argmax (bool) – whether to require the argmax value of each character in a word to determine the word’s entity
output_format (str) – (character_argmax vs NER) where character_argmax is a list of encodings for each character in the input text and NER is in the dict format which specifies start,end,label for each entity in a sentence
separators (tuple(str)) – list of characters to use for separating words within the character predictions
word_level_min_percent (float) – threshold on generating dominant word_level labeling

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:: None

static convert_to_NER_format(predictions: list[list], label_mapping: dict[str, int], default_label: str, pad_label: str) → list[list]¶

Convert word level predictions to specified format.

Parameters:

predictions (list) – predictions
label_mapping (dict) – labels and corresponding integers
default_label (str) – default label in label_mapping
pad_label (str) – pad label in label_mapping

Returns:

formatted predictions

Return type:

list

static match_sentence_lengths(data: ndarray, results: dict, flatten_separator: str, inplace: bool = True) → dict¶

Convert results from model into same ragged data shapes as original data.

Parameters:

data (numpy.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place

Returns:

dict(pred=…) or dict(pred=…, conf=…)

process(data: ndarray, results: dict, label_mapping: dict[str, int]) → dict¶

Conduct processing on data given predictions, label_mapping, and default_label.

Parameters:

data (Union[np.ndarray, pd.DataFrame]) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
label_mapping (dict) – labels and corresponding integers

Returns:

dict of predictions and if they exist, confidences

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'postprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructCharPreprocessor(max_length: int = 3400, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = '\x01\x01\x01\x01\x01', is_separate_at_max_len: bool = False)¶

Bases: CharPreprocessor

Subclass of CharPreprocessor for preprocessing struct char data.

Initialize the StructCharPreprocessor class.

Parameters:

max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for preprocessors.

Returns:: None

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

convert_to_unstructured_format(data: np.ndarray, labels: list[str] | npt.NDArray[np.str_] | None) → tuple[str, list[tuple[int, int, str]] | None]¶

Convert data samples list to StructCharPreprocessor required input data format.

Parameters:

data (numpy.ndarray) – list of strings
labels (Optional[Union[List[str], npt.NDArray[np.str_]]]) – labels for each input character

Returns:

data in the following format text=”<SAMPLE><SEPARATOR><SAMPLE>…”, entities=[(start=<INT>, end=<INT>, label=”<LABEL>”),

…(num_samples in data)])

Return type:

Tuple[str, Optional[List[Tuple[int, int, str]]]]

Process structured data for being processed by CharacterLevelCnnModel.

Parameters:

data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[dict, None]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data

Return batch_data:

A dict containing samples of size batch_size

Rtype batch_data:

dict

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'preprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructCharPostprocessor(default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = '\x01\x01\x01\x01\x01', is_pred_labels: bool = True, random_state: random.Random | int | list | tuple | None = None)¶

Bases: BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing struct char data.

Initialize the StructCharPostprocessor class.

Parameters:

default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
is_pred_labels (bool) – (default: true) if true, will convert the model indexes to the label strings given the label_mapping
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:: None

static match_sentence_lengths(data: ndarray, results: dict, flatten_separator: str, inplace: bool = True) → dict¶

Convert results from model into same ragged data shapes as original data.

Parameters:

data (np.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place

Returns:

dict(pred=…) or dict(pred=…, conf=…)

convert_to_structured_analysis(sentences: ndarray, results: dict, label_mapping: dict[str, int], default_label: str, pad_label: str) → dict¶

Convert unstructured results to a structured column analysis.

This assumes the column was flattened into a single sample, and takes mode of all character predictions except for the separator labels. In cases of tie, chose anything but background, otherwise randomly choose between the remaining labels.

Parameters:

sentences (numpy.ndarray) – samples which were predicted upon
results (dict) – character predictions for each sample return from model
label_mapping (dict) – maps labels to their encoded integers
default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label

Returns:

prediction value for a single column

process(data: ndarray, results: dict, label_mapping: dict[str, int]) → dict¶

Postprocess CharacterLevelCnnModel results when given structured data.

Said structured data is processed by StructCharPreprocessor.

Parameters:

data (Union[numpy.ndarray, pandas.DataFrame]) – original input data to the data labeler
results – dict of model character level predictions and confs
results – dict
label_mapping (dict) – maps labels to their encoded integers

Returns:

dict of predictions and if they exist, confidences

Return type:

dict

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'postprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

Bases: BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing regex data.

Initialize the RegexPostProcessor class.

Parameters:

aggregation_func (str) – aggregation function to apply to regex model output (split, random, priority)
priority_order (Union[list, numpy.ndarray]) – if priority is set as the aggregation function, the order in which entities are given priority must be set
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:: None

static priority_prediction(results: dict, entity_priority_order: ndarray) → None¶

Use priority of regex to give entity determination.

Parameters:

results (dict) – regex from model in format: dict(pred=…, conf=…)
entity_priority_order (np.ndarray) – list of entity priorities (lowest has higher priority)

Returns:

None

static split_prediction(results: dict) → None¶

Split the prediction across votes.

Parameters:: results (dict) – regex from model in format: dict(pred=…, conf=…)
Returns:: None

process(data: ndarray, results: dict, label_mapping: dict[str, int]) → dict¶: Preprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'postprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructRegexPostProcessor(random_state: random.Random | int | list | tuple | None = None)¶

Bases: BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing struct regex data.

Initialize the RegexPostProcessor class.

Parameters:: random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

set_params(**kwargs: Any) → None¶: Given kwargs, set the parameters if they exist.

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors Output data formats for postprocessors.

Returns:: None

process(data: ndarray, results: dict, label_mapping: dict[str, int]) → dict¶: Preprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'postprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

class dataprofiler.labelers.data_processing.ColumnNameModelPostprocessor¶

Bases: BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing regex data.

Initialize the ColumnNameModelPostProcessor class.

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:: None

process(data: np.ndarray, results: dict, label_mapping: dict[str, int] | None = None) → dict¶: Preprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters:: param_list (list) – list of parameters to retrieve from the model.
Returns:: dict of parameters

classmethod load_from_disk(dirpath: str) → Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'postprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.