Base Data Labeler

Contains abstract classes from which labeler classes will inherit.

class dataprofiler.labelers.base_data_labeler.BaseDataLabeler(dirpath: Optional[str] = None, load_options: Optional[dict] = None)

Bases: object

Parent class for data labeler objects.

Initialize DataLabeler class.

Parameters
  • dirpath – path to data labeler

  • load_options – optional arguments to include for load i.e. class for model or processors

help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

property label_mapping: dict

Retrieve the label encodings.

Returns

dictionary for associating labels to indexes

property reverse_label_mapping: dict

Retrieve the index to label encoding.

Returns

dictionary for associating indexes to labels

property labels: list

Retrieve the label.

Returns

list of labels

property preprocessor: data_processing.BaseDataPreprocessor | None

Retrieve the data preprocessor.

Returns

returns the preprocessor instance

property model: dataprofiler.labelers.base_model.BaseModel

Retrieve the data labeler model.

Returns

returns the model instance

property postprocessor: data_processing.BaseDataPostprocessor | None

Retrieve the data postprocessor.

Returns

returns the postprocessor instance

set_params(params: dict) None

Allow user to set parameters of pipeline components.

Done in the following format:
params = dict(

preprocessor=dict(…), model=dict(…), postprocessor=dict(…)

)

where the key,values pairs for each pipeline component must match parameters that exist in their components.

Parameters

params (dict) –

dictionary containing a key for a given pipeline component and its associated value of parameters as such:

dict(preprocessor=dict(…), model=dict(…), postprocessor=dict(…))

Returns

None

add_label(label: str, same_as: Optional[str] = None) None

Add a label to the data labeler.

Parameters
  • label (str) – new label being added to the data labeler

  • same_as (str) – label to have the same encoding index as for multi-label to single encoding index.

Returns

None

set_labels(labels: list | dict) None

Set the labels for the data labeler.

Parameters

labels (list or dict) – new labels in either encoding list or dict

Returns

None

predict(data: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray], batch_size: int = 32, predict_options: Optional[dict] = None, error_on_mismatch: bool = False, verbose: bool = True) dict

Predict labels of input data based with the data labeler model.

Parameters
  • data (Union[pd.DataFrame, pd.Series, np.ndarray]) – data to be predicted upon

  • batch_size (int) – batch size of prediction

  • predict_options (Dict[str, bool]) – optional parameters to allow for predict as a dict, i.e. dict(show_confidences=True)

  • error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

  • verbose (bool) – Flag to determine whether to print status or not

Returns

predictions

Return type

Dict

set_preprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPreprocessor) None

Set the data preprocessor for the data labeler.

Parameters

data_processor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor

Returns

None

set_model(model: dataprofiler.labelers.base_model.BaseModel) None

Set the model for the data labeler.

Parameters

model (base_model.BaseModel) – model to use within the data labeler

Returns

None

set_postprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) None

Set the data postprocessor for the data labeler.

Parameters

data_processor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor

Returns

None

check_pipeline(skip_postprocessor: bool = False, error_on_mismatch: bool = False) None

Check whether the processors and models connect together without error.

Parameters
  • skip_postprocessor (bool) – skip checking postprocessor is valid in pipeline

  • error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

Returns

None

classmethod load_from_library(name: str) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from the data labeler zoo in the library.

Parameters

name (str) – name of the data labeler.

Returns

DataLabeler class

Return type

BaseDataLabeler

classmethod load_from_disk(dirpath: str, load_options: Optional[dict] = None) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from a saved location on disk.

Parameters
  • dirpath (str) – path to data labeler files.

  • load_options (dict) – optional arguments to include for load i.e. class for model or processors

Returns

DataLabeler class

Return type

BaseDataLabeler

classmethod load_with_components(preprocessor: dataprofiler.labelers.data_processing.BaseDataPreprocessor, model: dataprofiler.labelers.base_model.BaseModel, postprocessor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from a its set of components.

Parameters
Returns

loaded BaseDataLabeler

Return type

BaseDataLabeler

save_to_disk(dirpath: str) None

Save the data labeler to the specified location.

Parameters

dirpath (str) – location to save the data labeler.

Returns

None

class dataprofiler.labelers.base_data_labeler.TrainableDataLabeler(dirpath: Optional[str] = None, load_options: Optional[dict] = None)

Bases: dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Subclass of BaseDataLabeler that can be trained.

Initialize DataLabeler class.

Parameters
  • dirpath – path to data labeler

  • load_options – optional arguments to include for load i.e. class for model or processors

fit(x: DataArray, y: DataArray, validation_split: float = 0.2, labels: list | dict | None = None, reset_weights: bool = False, batch_size: int = 32, epochs: int = 1, error_on_mismatch: bool = False) list

Fit the data labeler model for the dataset.

Parameters
  • x (Union[pd.DataFrame, pd.Series, np.ndarray]) – samples to fit model

  • y (Union[pd.DataFrame, pd.Series, np.ndarray]) – labels associated with the samples to fit model

  • validation_split (float) – split of the data to have as cross-validation data

  • labels (Union[list, dict]) – Encoding or number of labels if refit is needed to new labels

  • reset_weights (bool) – Flag to determine whether or not to reset the weights

  • batch_size (int) – Size of each batch sent to data labeler model

  • epochs (int) – number of epochs to iterate over the dataset and send to the model

  • error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

Returns

model output

set_model(model: dataprofiler.labelers.base_model.BaseModel) None

Set the model for a trainable data labeler.

Model must have a train function to be able to be set.

Parameters

model (base_model.BaseModel) – model to use within the data labeler

Returns

None

classmethod load_with_components(preprocessor: dataprofiler.labelers.data_processing.BaseDataPreprocessor, model: dataprofiler.labelers.base_model.BaseModel, postprocessor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) dataprofiler.labelers.base_data_labeler.TrainableDataLabeler

Load the data labeler from a its set of components.

Parameters
Returns

loaded TrainableDataLabeler

Return type

TrainableDataLabeler

add_label(label: str, same_as: Optional[str] = None) None

Add a label to the data labeler.

Parameters
  • label (str) – new label being added to the data labeler

  • same_as (str) – label to have the same encoding index as for multi-label to single encoding index.

Returns

None

check_pipeline(skip_postprocessor: bool = False, error_on_mismatch: bool = False) None

Check whether the processors and models connect together without error.

Parameters
  • skip_postprocessor (bool) – skip checking postprocessor is valid in pipeline

  • error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

Returns

None

help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

property label_mapping: dict

Retrieve the label encodings.

Returns

dictionary for associating labels to indexes

property labels: list

Retrieve the label.

Returns

list of labels

classmethod load_from_disk(dirpath: str, load_options: Optional[dict] = None) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from a saved location on disk.

Parameters
  • dirpath (str) – path to data labeler files.

  • load_options (dict) – optional arguments to include for load i.e. class for model or processors

Returns

DataLabeler class

Return type

BaseDataLabeler

classmethod load_from_library(name: str) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from the data labeler zoo in the library.

Parameters

name (str) – name of the data labeler.

Returns

DataLabeler class

Return type

BaseDataLabeler

property model: dataprofiler.labelers.base_model.BaseModel

Retrieve the data labeler model.

Returns

returns the model instance

property postprocessor: data_processing.BaseDataPostprocessor | None

Retrieve the data postprocessor.

Returns

returns the postprocessor instance

predict(data: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray], batch_size: int = 32, predict_options: Optional[dict] = None, error_on_mismatch: bool = False, verbose: bool = True) dict

Predict labels of input data based with the data labeler model.

Parameters
  • data (Union[pd.DataFrame, pd.Series, np.ndarray]) – data to be predicted upon

  • batch_size (int) – batch size of prediction

  • predict_options (Dict[str, bool]) – optional parameters to allow for predict as a dict, i.e. dict(show_confidences=True)

  • error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

  • verbose (bool) – Flag to determine whether to print status or not

Returns

predictions

Return type

Dict

property preprocessor: data_processing.BaseDataPreprocessor | None

Retrieve the data preprocessor.

Returns

returns the preprocessor instance

property reverse_label_mapping: dict

Retrieve the index to label encoding.

Returns

dictionary for associating indexes to labels

save_to_disk(dirpath: str) None

Save the data labeler to the specified location.

Parameters

dirpath (str) – location to save the data labeler.

Returns

None

set_labels(labels: list | dict) None

Set the labels for the data labeler.

Parameters

labels (list or dict) – new labels in either encoding list or dict

Returns

None

set_params(params: dict) None

Allow user to set parameters of pipeline components.

Done in the following format:
params = dict(

preprocessor=dict(…), model=dict(…), postprocessor=dict(…)

)

where the key,values pairs for each pipeline component must match parameters that exist in their components.

Parameters

params (dict) –

dictionary containing a key for a given pipeline component and its associated value of parameters as such:

dict(preprocessor=dict(…), model=dict(…), postprocessor=dict(…))

Returns

None

set_postprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) None

Set the data postprocessor for the data labeler.

Parameters

data_processor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor

Returns

None

set_preprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPreprocessor) None

Set the data preprocessor for the data labeler.

Parameters

data_processor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor

Returns

None