Data Processing¶

Contains pre-built processors for data labeling/processing.

class dataprofiler.labelers.data_processing.AutoSubRegistrationMeta(clsname: str, bases: tuple, attrs: dict)¶

Bases: abc.ABCMeta

For registering subclasses.

Create AutoSubRegistration object.

mro()¶: Return a type’s method resolution order.

register(subclass)¶

Returns the subclass, to allow usage as a class decorator.

class dataprofiler.labelers.data_processing.BaseDataProcessor(**parameters: Any)¶

Bases: object

Abstract Data processing class.

Initialize BaseDataProcessor object.

processor_type: str¶

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

abstract classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns: None

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

abstract process(*args: Any, **kwargs: Any) → Any¶: Process data.

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

class dataprofiler.labelers.data_processing.BaseDataPreprocessor(**parameters: Any)¶

Bases: dataprofiler.labelers.data_processing.BaseDataProcessor

Abstract Data preprocessing class.

Initialize BaseDataPreprocessor object.

processor_type: str = 'preprocessor'¶

abstract process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) → Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None] | tuple[np.ndarray, np.ndarray] | np.ndarray¶: Preprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

abstract classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns: None

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.BaseDataPostprocessor(**parameters: Any)¶

Bases: dataprofiler.labelers.data_processing.BaseDataProcessor

Abstract Data postprocessing class.

Initialize BaseDataPostprocessor object.

processor_type: str = 'postprocessor'¶

abstract process(data: numpy.ndarray, results: dict, label_mapping: dict) → dict¶: Postprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

abstract classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns: None

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.DirectPassPreprocessor¶

Bases: dataprofiler.labelers.data_processing.BaseDataPreprocessor

Subclass of BaseDataPreprocessor for preprocessing data.

Initialize the DirectPassPreprocessor class.

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns: None

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) → tuple[np.ndarray, np.ndarray] | np.ndarray¶: Preprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'preprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharPreprocessor(max_length: int = 3400, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_split: float = 0, flatten_separator: str = ' ', is_separate_at_max_len: bool = False, **kwargs: Any)¶

Bases: dataprofiler.labelers.data_processing.BaseDataPreprocessor

Subclass of BaseDataPreprocessor for preprocessing char data.

Initialize the CharPreprocessor class.

Parameters

max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help() → None¶

Describe alterable parameters.

Input data formats. Output data formats for postprocessors.

Returns: None

process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) → Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None]¶

Flatten batches of data.

Parameters

data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[None, dict]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data

Return batch_data

A dict containing samples of size batch_size

Rtype batch_data

dicts

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'preprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharEncodedPreprocessor(encoding_map: dict[str, int] | None = None, max_length: int = 5000, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_split: float = 0, flatten_separator: str = ' ', is_separate_at_max_len: bool = False)¶

Bases: dataprofiler.labelers.data_processing.CharPreprocessor

Subclass of CharPreprocessor for preprocessing char encoded data.

Initialize the CharEncodedPreprocessor class.

Parameters

encoding_map (dict) – char to int encoding map
max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

Process structured data for being processed by CharacterLevelCnnModel.

Parameters

data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[dict, None]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data

Return batch_data

A dict containing samples of size batch_size

Rtype batch_data

dict

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod help() → None¶

Describe alterable parameters.

Input data formats. Output data formats for postprocessors.

Returns: None

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'preprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.CharPostprocessor(default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = ' ', use_word_level_argmax: bool = False, output_format: str = 'character_argmax', separators: tuple = (' ', ',', ';', "'", '"', ':', '\n', '\t', '.'), word_level_min_percent: float = 0.75)¶

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing char data.

Initialize the CharPostprocessor class.

Parameters

default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
use_word_level_argmax (bool) – whether to require the argmax value of each character in a word to determine the word’s entity
output_format (str) – (character_argmax vs NER) where character_argmax is a list of encodings for each character in the input text and NER is in the dict format which specifies start,end,label for each entity in a sentence
separators (tuple(str)) – list of characters to use for separating words within the character predictions
word_level_min_percent (float) – threshold on generating dominant word_level labeling

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns: None

static convert_to_NER_format(predictions: list, label_mapping: dict, default_label: str, pad_label: str) → list¶

Convert word level predictions to specified format.

Parameters

predictions (list) – predictions
label_mapping (dict) – labels and corresponding integers
default_label (str) – default label in label_mapping
pad_label (str) – pad label in label_mapping

Returns

formatted predictions

Return type

list

static match_sentence_lengths(data: numpy.ndarray, results: dict, flatten_separator: str, inplace: bool = True) → dict¶

Convert results from model into same ragged data shapes as original data.

Parameters

data (numpy.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place

Returns

dict(pred=…) or dict(pred=…, conf=…)

process(data: numpy.ndarray, results: dict, label_mapping: dict) → dict¶

Conduct processing on data given predictions, label_mapping, and default_label.

Parameters

data (Union[np.ndarray, pd.DataFrame]) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
label_mapping (dict) – labels and corresponding integers

Returns

dict of predictions and if they exist, confidences

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'postprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructCharPreprocessor(max_length: int = 3400, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = '\x01\x01\x01\x01\x01', is_separate_at_max_len: bool = False)¶

Bases: dataprofiler.labelers.data_processing.CharPreprocessor

Subclass of CharPreprocessor for preprocessing struct char data.

Initialize the StructCharPreprocessor class.

Parameters

max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for preprocessors.

Returns: None

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

convert_to_unstructured_format(data: np.ndarray, labels: list[str] | npt.NDArray[np.str_] | None) → tuple[str, list[tuple[int, int, str]] | None]¶

Convert data samples list to StructCharPreprocessor required input data format.

Parameters

data (numpy.ndarray) – list of strings
labels (Optional[Union[List[str], npt.NDArray[np.str_]]]) – labels for each input character

Returns

data in the following format text=”<SAMPLE><SEPARATOR><SAMPLE>…”, entities=[(start=<INT>, end=<INT>, label=”<LABEL>”),

…(num_samples in data)])

Return type

Tuple[str, Optional[List[Tuple[int, int, str]]]]

Process structured data for being processed by CharacterLevelCnnModel.

Parameters

data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[dict, None]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data

Return batch_data

A dict containing samples of size batch_size

Rtype batch_data

dict

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'preprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructCharPostprocessor(default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = '\x01\x01\x01\x01\x01', is_pred_labels: bool = True, random_state: random.Random | int | list | tuple | None = None)¶

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing struct char data.

Initialize the StructCharPostprocessor class.

Parameters

default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
is_pred_labels (bool) – (default: true) if true, will convert the model indexes to the label strings given the label_mapping
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns: None

static match_sentence_lengths(data: numpy.ndarray, results: dict, flatten_separator: str, inplace: bool = True) → dict¶

Convert results from model into same ragged data shapes as original data.

Parameters

data (np.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place

Returns

dict(pred=…) or dict(pred=…, conf=…)

convert_to_structured_analysis(sentences: numpy.ndarray, results: dict, label_mapping: dict, default_label: str, pad_label: str) → dict¶

Convert unstructured results to a structured column analysis.

This assumes the column was flattened into a single sample, and takes mode of all character predictions except for the separator labels. In cases of tie, chose anything but background, otherwise randomly choose between the remaining labels.

Parameters

sentences (numpy.ndarray) – samples which were predicted upon
results (dict) – character predictions for each sample return from model
label_mapping (dict) – maps labels to their encoded integers
default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label

Returns

prediction value for a single column

process(data: numpy.ndarray, results: dict, label_mapping: dict) → dict¶

Postprocess CharacterLevelCnnModel results when given structured data.

Said structured data is processed by StructCharPreprocessor.

Parameters

data (Union[numpy.ndarray, pandas.DataFrame]) – original input data to the data labeler
results – dict of model character level predictions and confs
results – dict
label_mapping (dict) – maps labels to their encoded integers

Returns

dict of predictions and if they exist, confidences

Return type

dict

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'postprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing regex data.

Initialize the RegexPostProcessor class.

Parameters

aggregation_func (str) – aggregation function to apply to regex model output (split, random, priority)
priority_order (Union[list, numpy.ndarray]) – if priority is set as the aggregation function, the order in which entities are given priority must be set
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns: None

static priority_prediction(results: dict, entity_priority_order: numpy.ndarray) → None¶

Use priority of regex to give entity determination.

Parameters

results (dict) – regex from model in format: dict(pred=…, conf=…)
entity_priority_order (np.ndarray) – list of entity priorities (lowest has higher priority)

Returns

None

static split_prediction(results: dict) → None¶

Split the prediction across votes.

Parameters: results (dict) – regex from model in format: dict(pred=…, conf=…)
Returns: None

process(data: numpy.ndarray, results: dict, label_mapping: dict) → dict¶: Preprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'postprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.

class dataprofiler.labelers.data_processing.StructRegexPostProcessor(random_state: random.Random | int | list | tuple | None = None)¶

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing struct regex data.

Initialize the RegexPostProcessor class.

Parameters: random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.

set_params(**kwargs: Any) → None¶: Given kwargs, set the parameters if they exist.

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors Output data formats for postprocessors.

Returns: None

process(data: numpy.ndarray, results: dict, label_mapping: dict) → dict¶: Preprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'postprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

class dataprofiler.labelers.data_processing.ColumnNameModelPostprocessor¶

Bases: dataprofiler.labelers.data_processing.BaseDataPostprocessor

Subclass of BaseDataPostprocessor for postprocessing regex data.

Initialize the ColumnNameModelPostProcessor class.

classmethod help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns: None

process(data: np.ndarray, results: dict, label_mapping: dict[str, int] | None = None) → dict¶: Preprocess data.

classmethod get_class(class_name: str) → type[BaseDataProcessor] | None¶: Get class of BaseDataProcessor object.

get_parameters(param_list: list[str] | None = None) → dict¶

Return a dict of parameters from the model given a list.

Parameters: param_list (list) – list of parameters to retrieve from the model.
Returns: dict of parameters

classmethod load_from_disk(dirpath: str) → dataprofiler.labelers.data_processing.Processor¶: Load data processor from a given path on disk.

classmethod load_from_library(name: str) → dataprofiler.labelers.data_processing.BaseDataProcessor¶: Load data processor from within the library.

processor_type: str = 'postprocessor'¶

save_to_disk(dirpath: str) → None¶: Save data processor to a path on disk.

set_params(**kwargs: Any) → None¶: Set the parameters if they exist given kwargs.