dataprofiler.labelers.base_data_labeler module¶

Contains abstract classes from which labeler classes will inherit.

class dataprofiler.labelers.base_data_labeler.BaseDataLabeler(dirpath: str | None = None, load_options: dict | None = None)¶

Bases: object

Parent class for data labeler objects.

Initialize DataLabeler class.

Parameters:

dirpath – path to data labeler
load_options – optional arguments to include for load i.e. class for model or processors

help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:: None

property label_mapping: dict¶

Retrieve the label encodings.

Returns:: dictionary for associating labels to indexes

property reverse_label_mapping: dict¶

Retrieve the index to label encoding.

Returns:: dictionary for associating indexes to labels

property labels: list[str]¶

Retrieve the label.

Returns:: list of labels

property preprocessor: data_processing.BaseDataPreprocessor | None¶

Retrieve the data preprocessor.

Returns:: returns the preprocessor instance

property model: BaseModel¶

Retrieve the data labeler model.

Returns:: returns the model instance

property postprocessor: data_processing.BaseDataPostprocessor | None¶

Retrieve the data postprocessor.

Returns:: returns the postprocessor instance

set_params(params: dict) → None¶

Allow user to set parameters of pipeline components.

Done in the following format:

params = dict(: preprocessor=dict(…), model=dict(…), postprocessor=dict(…)

)

where the key,values pairs for each pipeline component must match parameters that exist in their components.

Parameters:

params (dict) –

dictionary containing a key for a given pipeline component and its associated value of parameters as such:

dict(preprocessor=dict(…), model=dict(…), postprocessor=dict(…))

Returns:

None

add_label(label: str, same_as: str | None = None) → None¶

Add a label to the data labeler.

Parameters:

label (str) – new label being added to the data labeler
same_as (str) – label to have the same encoding index as for multi-label to single encoding index.

Returns:

None

set_labels(labels: list | dict) → None¶

Set the labels for the data labeler.

Parameters:: labels (list or dict) – new labels in either encoding list or dict
Returns:: None

predict(data: DataFrame | Series | ndarray, batch_size: int = 32, predict_options: dict[str, bool] | None = None, error_on_mismatch: bool = False, verbose: bool = True) → dict¶

Predict labels of input data based with the data labeler model.

Parameters:

data (Union[pd.DataFrame, pd.Series, np.ndarray]) – data to be predicted upon
batch_size (int) – batch size of prediction
predict_options (Dict[str, bool]) – optional parameters to allow for predict as a dict, i.e. dict(show_confidences=True)
error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline
verbose (bool) – Flag to determine whether to print status or not

Returns:

predictions

Return type:

Dict

set_preprocessor(data_processor: BaseDataPreprocessor) → None¶

Set the data preprocessor for the data labeler.

Parameters:: data_processor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor
Returns:: None

set_model(model: BaseModel) → None¶

Set the model for the data labeler.

Parameters:: model (base_model.BaseModel) – model to use within the data labeler
Returns:: None

set_postprocessor(data_processor: BaseDataPostprocessor) → None¶

Set the data postprocessor for the data labeler.

Parameters:: data_processor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor
Returns:: None

check_pipeline(skip_postprocessor: bool = False, error_on_mismatch: bool = False) → None¶

Check whether the processors and models connect together without error.

Parameters:

skip_postprocessor (bool) – skip checking postprocessor is valid in pipeline
error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

Returns:

None

classmethod load_from_library(name: str) → BaseDataLabeler¶

Load the data labeler from the data labeler zoo in the library.

Parameters:: name (str) – name of the data labeler.
Returns:: DataLabeler class
Return type:: BaseDataLabeler

classmethod load_from_disk(dirpath: str, load_options: dict | None = None) → BaseDataLabeler¶

Load the data labeler from a saved location on disk.

Parameters:

dirpath (str) – path to data labeler files.
load_options (dict) – optional arguments to include for load i.e. class for model or processors

Returns:

DataLabeler class

Return type:

BaseDataLabeler

classmethod load_with_components(preprocessor: BaseDataPreprocessor, model: BaseModel, postprocessor: BaseDataPostprocessor) → BaseDataLabeler¶

Load the data labeler from a its set of components.

Parameters:

preprocessor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor
model (base_model.BaseModel) – model to use within the data labeler
postprocessor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor

Returns:

loaded BaseDataLabeler

Return type:

BaseDataLabeler

save_to_disk(dirpath: str) → None¶

Save the data labeler to the specified location.

Parameters:: dirpath (str) – location to save the data labeler.
Returns:: None

class dataprofiler.labelers.base_data_labeler.TrainableDataLabeler(dirpath: str | None = None, load_options: dict | None = None)¶

Bases: BaseDataLabeler

Subclass of BaseDataLabeler that can be trained.

Initialize DataLabeler class.

Parameters:

dirpath – path to data labeler
load_options – optional arguments to include for load i.e. class for model or processors

fit(x: DataArray, y: DataArray, validation_split: float = 0.2, labels: list | dict | None = None, reset_weights: bool = False, batch_size: int = 32, epochs: int = 1, error_on_mismatch: bool = False) → list¶

Fit the data labeler model for the dataset.

Parameters:

x (Union[pd.DataFrame, pd.Series, np.ndarray]) – samples to fit model
y (Union[pd.DataFrame, pd.Series, np.ndarray]) – labels associated with the samples to fit model
validation_split (float) – split of the data to have as cross-validation data
labels (Union[list, dict]) – Encoding or number of labels if refit is needed to new labels
reset_weights (bool) – Flag to determine whether or not to reset the weights
batch_size (int) – Size of each batch sent to data labeler model
epochs (int) – number of epochs to iterate over the dataset and send to the model
error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

Returns:

model output

set_model(model: BaseModel) → None¶

Set the model for a trainable data labeler.

Model must have a train function to be able to be set.

Parameters:: model (base_model.BaseModel) – model to use within the data labeler
Returns:: None

classmethod load_with_components(preprocessor: BaseDataPreprocessor, model: BaseModel, postprocessor: BaseDataPostprocessor) → TrainableDataLabeler¶

Load the data labeler from a its set of components.

Parameters:

preprocessor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor
model (base_model.BaseModel) – model to use within the data labeler
postprocessor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor

Returns:

loaded TrainableDataLabeler

Return type:

TrainableDataLabeler

add_label(label: str, same_as: str | None = None) → None¶

Add a label to the data labeler.

Parameters:

label (str) – new label being added to the data labeler
same_as (str) – label to have the same encoding index as for multi-label to single encoding index.

Returns:

None

check_pipeline(skip_postprocessor: bool = False, error_on_mismatch: bool = False) → None¶

Check whether the processors and models connect together without error.

Parameters:

skip_postprocessor (bool) – skip checking postprocessor is valid in pipeline
error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

Returns:

None

help() → None¶

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns:: None

property label_mapping: dict¶

Retrieve the label encodings.

Returns:: dictionary for associating labels to indexes

property labels: list[str]¶

Retrieve the label.

Returns:: list of labels

classmethod load_from_disk(dirpath: str, load_options: dict | None = None) → BaseDataLabeler¶

Load the data labeler from a saved location on disk.

Parameters:

dirpath (str) – path to data labeler files.
load_options (dict) – optional arguments to include for load i.e. class for model or processors

Returns:

DataLabeler class

Return type:

BaseDataLabeler

classmethod load_from_library(name: str) → BaseDataLabeler¶

Load the data labeler from the data labeler zoo in the library.

Parameters:: name (str) – name of the data labeler.
Returns:: DataLabeler class
Return type:: BaseDataLabeler

property model: BaseModel¶

Retrieve the data labeler model.

Returns:: returns the model instance

property postprocessor: data_processing.BaseDataPostprocessor | None¶

Retrieve the data postprocessor.

Returns:: returns the postprocessor instance

predict(data: DataFrame | Series | ndarray, batch_size: int = 32, predict_options: dict[str, bool] | None = None, error_on_mismatch: bool = False, verbose: bool = True) → dict¶

Predict labels of input data based with the data labeler model.

Parameters:

data (Union[pd.DataFrame, pd.Series, np.ndarray]) – data to be predicted upon
batch_size (int) – batch size of prediction
predict_options (Dict[str, bool]) – optional parameters to allow for predict as a dict, i.e. dict(show_confidences=True)
error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline
verbose (bool) – Flag to determine whether to print status or not

Returns:

predictions

Return type:

Dict

property preprocessor: data_processing.BaseDataPreprocessor | None¶

Retrieve the data preprocessor.

Returns:: returns the preprocessor instance

property reverse_label_mapping: dict¶

Retrieve the index to label encoding.

Returns:: dictionary for associating indexes to labels

save_to_disk(dirpath: str) → None¶

Save the data labeler to the specified location.

Parameters:: dirpath (str) – location to save the data labeler.
Returns:: None

set_labels(labels: list | dict) → None¶

Set the labels for the data labeler.

Parameters:: labels (list or dict) – new labels in either encoding list or dict
Returns:: None

set_params(params: dict) → None¶

Allow user to set parameters of pipeline components.

Done in the following format:

params = dict(: preprocessor=dict(…), model=dict(…), postprocessor=dict(…)

)

where the key,values pairs for each pipeline component must match parameters that exist in their components.

Parameters:

params (dict) –

dictionary containing a key for a given pipeline component and its associated value of parameters as such:

dict(preprocessor=dict(…), model=dict(…), postprocessor=dict(…))

Returns:

None

set_postprocessor(data_processor: BaseDataPostprocessor) → None¶

Set the data postprocessor for the data labeler.

Parameters:: data_processor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor
Returns:: None

set_preprocessor(data_processor: BaseDataPreprocessor) → None¶

Set the data preprocessor for the data labeler.

Parameters:: data_processor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor
Returns:: None