Data Processing¶
Contains pre-built processors for data labeling/processing.
- class dataprofiler.labelers.data_processing.AutoSubRegistrationMeta(clsname: str, bases: tuple, attrs: dict)¶
Bases:
abc.ABCMeta
For registering subclasses.
Create AutoSubRegistration object.
- mro()¶
Return a type’s method resolution order.
- register(subclass)¶
Register a virtual subclass of an ABC.
Returns the subclass, to allow usage as a class decorator.
- class dataprofiler.labelers.data_processing.BaseDataProcessor(**parameters: Any)¶
Bases:
object
Abstract Data processing class.
Initialize BaseDataProcessor object.
- processor_type: str¶
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- abstract classmethod help() None ¶
Describe alterable parameters.
Input data formats for preprocessors. Output data formats for postprocessors.
- Returns
None
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- set_params(**kwargs: Any) None ¶
Set the parameters if they exist given kwargs.
- abstract process(*args: Any, **kwargs: Any) Any ¶
Process data.
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- class dataprofiler.labelers.data_processing.BaseDataPreprocessor(**parameters: Any)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataProcessor
Abstract Data preprocessing class.
Initialize BaseDataPreprocessor object.
- processor_type: str = 'preprocessor'¶
- abstract process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None] | tuple[np.ndarray, np.ndarray] | np.ndarray ¶
Preprocess data.
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- abstract classmethod help() None ¶
Describe alterable parameters.
Input data formats for preprocessors. Output data formats for postprocessors.
- Returns
None
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- set_params(**kwargs: Any) None ¶
Set the parameters if they exist given kwargs.
- class dataprofiler.labelers.data_processing.BaseDataPostprocessor(**parameters: Any)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataProcessor
Abstract Data postprocessing class.
Initialize BaseDataPostprocessor object.
- processor_type: str = 'postprocessor'¶
- abstract process(data: numpy.ndarray, results: dict, label_mapping: dict) dict ¶
Postprocess data.
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- abstract classmethod help() None ¶
Describe alterable parameters.
Input data formats for preprocessors. Output data formats for postprocessors.
- Returns
None
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- set_params(**kwargs: Any) None ¶
Set the parameters if they exist given kwargs.
- class dataprofiler.labelers.data_processing.DirectPassPreprocessor¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPreprocessor
Subclass of BaseDataPreprocessor for preprocessing data.
Initialize the DirectPassPreprocessor class.
- classmethod help() None ¶
Describe alterable parameters.
Input data formats for preprocessors. Output data formats for postprocessors.
- Returns
None
- process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) tuple[np.ndarray, np.ndarray] | np.ndarray ¶
Preprocess data.
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- processor_type: str = 'preprocessor'¶
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- set_params(**kwargs: Any) None ¶
Set the parameters if they exist given kwargs.
- class dataprofiler.labelers.data_processing.CharPreprocessor(max_length: int = 3400, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_split: float = 0, flatten_separator: str = ' ', is_separate_at_max_len: bool = False, **kwargs: Any)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPreprocessor
Subclass of BaseDataPreprocessor for preprocessing char data.
Initialize the CharPreprocessor class.
- Parameters
max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator
- classmethod help() None ¶
Describe alterable parameters.
Input data formats. Output data formats for postprocessors.
- Returns
None
- process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None] ¶
Flatten batches of data.
- Parameters
data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[None, dict]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data
- Return batch_data
A dict containing samples of size batch_size
- Rtype batch_data
dicts
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- processor_type: str = 'preprocessor'¶
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- set_params(**kwargs: Any) None ¶
Set the parameters if they exist given kwargs.
- class dataprofiler.labelers.data_processing.CharEncodedPreprocessor(encoding_map: dict[str, int] | None = None, max_length: int = 5000, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_split: float = 0, flatten_separator: str = ' ', is_separate_at_max_len: bool = False)¶
Bases:
dataprofiler.labelers.data_processing.CharPreprocessor
Subclass of CharPreprocessor for preprocessing char encoded data.
Initialize the CharEncodedPreprocessor class.
- Parameters
encoding_map (dict) – char to int encoding map
max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_split (float) – approximate output of split between flattened and non-flattened characters, value between [0, 1]. When the current flattened split becomes more than the flatten_split value, any leftover sample or subsequent samples will be non-flattened until the current flattened split is below the flatten_split value
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator
- process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None] ¶
Process structured data for being processed by CharacterLevelCnnModel.
- Parameters
data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[dict, None]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data
- Return batch_data
A dict containing samples of size batch_size
- Rtype batch_data
dict
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod help() None ¶
Describe alterable parameters.
Input data formats. Output data formats for postprocessors.
- Returns
None
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- processor_type: str = 'preprocessor'¶
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- set_params(**kwargs: Any) None ¶
Set the parameters if they exist given kwargs.
- class dataprofiler.labelers.data_processing.CharPostprocessor(default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = ' ', use_word_level_argmax: bool = False, output_format: str = 'character_argmax', separators: tuple = (' ', ',', ';', "'", '"', ':', '\n', '\t', '.'), word_level_min_percent: float = 0.75)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Subclass of BaseDataPostprocessor for postprocessing char data.
Initialize the CharPostprocessor class.
- Parameters
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
use_word_level_argmax (bool) – whether to require the argmax value of each character in a word to determine the word’s entity
output_format (str) – (character_argmax vs NER) where character_argmax is a list of encodings for each character in the input text and NER is in the dict format which specifies start,end,label for each entity in a sentence
separators (tuple(str)) – list of characters to use for separating words within the character predictions
word_level_min_percent (float) – threshold on generating dominant word_level labeling
- classmethod help() None ¶
Describe alterable parameters.
Input data formats for preprocessors. Output data formats for postprocessors.
- Returns
None
- static convert_to_NER_format(predictions: list, label_mapping: dict, default_label: str, pad_label: str) list ¶
Convert word level predictions to specified format.
- Parameters
predictions (list) – predictions
label_mapping (dict) – labels and corresponding integers
default_label (str) – default label in label_mapping
pad_label (str) – pad label in label_mapping
- Returns
formatted predictions
- Return type
list
- static match_sentence_lengths(data: numpy.ndarray, results: dict, flatten_separator: str, inplace: bool = True) dict ¶
Convert results from model into same ragged data shapes as original data.
- Parameters
data (numpy.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place
- Returns
dict(pred=…) or dict(pred=…, conf=…)
- process(data: numpy.ndarray, results: dict, label_mapping: dict) dict ¶
Conduct processing on data given predictions, label_mapping, and default_label.
- Parameters
data (Union[np.ndarray, pd.DataFrame]) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
label_mapping (dict) – labels and corresponding integers
- Returns
dict of predictions and if they exist, confidences
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- processor_type: str = 'postprocessor'¶
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- set_params(**kwargs: Any) None ¶
Set the parameters if they exist given kwargs.
- class dataprofiler.labelers.data_processing.StructCharPreprocessor(max_length: int = 3400, default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = '\x01\x01\x01\x01\x01', is_separate_at_max_len: bool = False)¶
Bases:
dataprofiler.labelers.data_processing.CharPreprocessor
Subclass of CharPreprocessor for preprocessing struct char data.
Initialize the StructCharPreprocessor class.
- Parameters
max_length (int) – Maximum char length in a sample.
default_label (string (could be int, char, etc.)) – Key for label_mapping that is the default label
pad_label (string (could be int, char, etc.)) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
is_separate_at_max_len (bool) – if true, separates at max_length, otherwise at nearest separator
- classmethod help() None ¶
Describe alterable parameters.
Input data formats for preprocessors. Output data formats for preprocessors.
- Returns
None
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- convert_to_unstructured_format(data: np.ndarray, labels: list[str] | npt.NDArray[np.str_] | None) tuple[str, list[tuple[int, int, str]] | None] ¶
Convert data samples list to StructCharPreprocessor required input data format.
- Parameters
data (numpy.ndarray) – list of strings
labels (Optional[Union[List[str], npt.NDArray[np.str_]]]) – labels for each input character
- Returns
data in the following format text=”<SAMPLE><SEPARATOR><SAMPLE>…”, entities=[(start=<INT>, end=<INT>, label=”<LABEL>”),
…(num_samples in data)])
- Return type
Tuple[str, Optional[List[Tuple[int, int, str]]]]
- process(data: np.ndarray, labels: np.ndarray | None = None, label_mapping: dict[str, int] | None = None, batch_size: int = 32) Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None] ¶
Process structured data for being processed by CharacterLevelCnnModel.
- Parameters
data (numpy.ndarray) – List of strings to create embeddings for
labels (numpy.ndarray) – labels for each input character
label_mapping (Union[dict, None]) – maps labels to their encoded integers
batch_size (int) – Number of samples in the batch of data
- Return batch_data
A dict containing samples of size batch_size
- Rtype batch_data
dict
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- processor_type: str = 'preprocessor'¶
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- set_params(**kwargs: Any) None ¶
Set the parameters if they exist given kwargs.
- class dataprofiler.labelers.data_processing.StructCharPostprocessor(default_label: str = 'UNKNOWN', pad_label: str = 'PAD', flatten_separator: str = '\x01\x01\x01\x01\x01', is_pred_labels: bool = True, random_state: random.Random | int | list | tuple | None = None)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Subclass of BaseDataPostprocessor for postprocessing struct char data.
Initialize the StructCharPostprocessor class.
- Parameters
default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label
flatten_separator (str) – separator used to put between flattened samples.
is_pred_labels (bool) – (default: true) if true, will convert the model indexes to the label strings given the label_mapping
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.
- classmethod help() None ¶
Describe alterable parameters.
Input data formats for preprocessors. Output data formats for postprocessors.
- Returns
None
- static match_sentence_lengths(data: numpy.ndarray, results: dict, flatten_separator: str, inplace: bool = True) dict ¶
Convert results from model into same ragged data shapes as original data.
- Parameters
data (np.ndarray) – original input data to the data labeler
results (dict) – dict of model character level predictions and confs
flatten_separator (str) – string which joins to samples together when flattening
inplace (bool) – flag to modify results in place
- Returns
dict(pred=…) or dict(pred=…, conf=…)
- convert_to_structured_analysis(sentences: numpy.ndarray, results: dict, label_mapping: dict, default_label: str, pad_label: str) dict ¶
Convert unstructured results to a structured column analysis.
This assumes the column was flattened into a single sample, and takes mode of all character predictions except for the separator labels. In cases of tie, chose anything but background, otherwise randomly choose between the remaining labels.
- Parameters
sentences (numpy.ndarray) – samples which were predicted upon
results (dict) – character predictions for each sample return from model
label_mapping (dict) – maps labels to their encoded integers
default_label (str) – Key for label_mapping that is the default label
pad_label (str) – Key for label_mapping that is the pad label
- Returns
prediction value for a single column
- process(data: numpy.ndarray, results: dict, label_mapping: dict) dict ¶
Postprocess CharacterLevelCnnModel results when given structured data.
Said structured data is processed by StructCharPreprocessor.
- Parameters
data (Union[numpy.ndarray, pandas.DataFrame]) – original input data to the data labeler
results – dict of model character level predictions and confs
results – dict
label_mapping (dict) – maps labels to their encoded integers
- Returns
dict of predictions and if they exist, confidences
- Return type
dict
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- processor_type: str = 'postprocessor'¶
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- set_params(**kwargs: Any) None ¶
Set the parameters if they exist given kwargs.
- class dataprofiler.labelers.data_processing.RegexPostProcessor(aggregation_func: str = 'split', priority_order: list | np.ndarray | None = None, random_state: random.Random | int | list | tuple | None = None)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Subclass of BaseDataPostprocessor for postprocessing regex data.
Initialize the RegexPostProcessor class.
- Parameters
aggregation_func (str) – aggregation function to apply to regex model output (split, random, priority)
priority_order (Union[list, numpy.ndarray]) – if priority is set as the aggregation function, the order in which entities are given priority must be set
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.
- classmethod help() None ¶
Describe alterable parameters.
Input data formats for preprocessors. Output data formats for postprocessors.
- Returns
None
- static priority_prediction(results: dict, entity_priority_order: numpy.ndarray) None ¶
Use priority of regex to give entity determination.
- Parameters
results (dict) – regex from model in format: dict(pred=…, conf=…)
entity_priority_order (np.ndarray) – list of entity priorities (lowest has higher priority)
- Returns
None
- static split_prediction(results: dict) None ¶
Split the prediction across votes.
- Parameters
results (dict) – regex from model in format: dict(pred=…, conf=…)
- Returns
None
- process(data: numpy.ndarray, results: dict, label_mapping: dict) dict ¶
Preprocess data.
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- processor_type: str = 'postprocessor'¶
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- set_params(**kwargs: Any) None ¶
Set the parameters if they exist given kwargs.
- class dataprofiler.labelers.data_processing.StructRegexPostProcessor(random_state: random.Random | int | list | tuple | None = None)¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Subclass of BaseDataPostprocessor for postprocessing struct regex data.
Initialize the RegexPostProcessor class.
- Parameters
random_state (random.Random) – random state setting to be used for randomly selecting a prediction when two labels have equal opportunity for a given sample.
- set_params(**kwargs: Any) None ¶
Given kwargs, set the parameters if they exist.
- classmethod help() None ¶
Describe alterable parameters.
Input data formats for preprocessors Output data formats for postprocessors.
- Returns
None
- process(data: numpy.ndarray, results: dict, label_mapping: dict) dict ¶
Preprocess data.
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- processor_type: str = 'postprocessor'¶
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- class dataprofiler.labelers.data_processing.ColumnNameModelPostprocessor¶
Bases:
dataprofiler.labelers.data_processing.BaseDataPostprocessor
Subclass of BaseDataPostprocessor for postprocessing regex data.
Initialize the ColumnNameModelPostProcessor class.
- classmethod help() None ¶
Describe alterable parameters.
Input data formats for preprocessors. Output data formats for postprocessors.
- Returns
None
- process(data: np.ndarray, results: dict, label_mapping: dict[str, int] | None = None) dict ¶
Preprocess data.
- classmethod get_class(class_name: str) type[BaseDataProcessor] | None ¶
Get class of BaseDataProcessor object.
- get_parameters(param_list: list[str] | None = None) dict ¶
Return a dict of parameters from the model given a list.
- Parameters
param_list (list) – list of parameters to retrieve from the model.
- Returns
dict of parameters
- classmethod load_from_disk(dirpath: str) dataprofiler.labelers.data_processing.Processor ¶
Load data processor from a given path on disk.
- classmethod load_from_library(name: str) dataprofiler.labelers.data_processing.BaseDataProcessor ¶
Load data processor from within the library.
- processor_type: str = 'postprocessor'¶
- save_to_disk(dirpath: str) None ¶
Save data processor to a path on disk.
- set_params(**kwargs: Any) None ¶
Set the parameters if they exist given kwargs.