Data Labelers

Module to train and choose between structured and unstructured data labelers.

dataprofiler.labelers.data_labelers.train_structured_labeler(data: Union[None, pandas.core.frame.DataFrame], default_label: Optional[int] = None, save_dirpath: Optional[str] = None, epochs: int = 2) dataprofiler.labelers.base_data_labeler.TrainableDataLabeler

Use provided data to create and save a structured data labeler.

Parameters
  • data (Union[None, pd.DataFrame]) – data to be trained upon

  • save_dirpath (Union[None, str]) – path to save data labeler

  • epochs (int) – number of epochs to loop training the data

Returns

structured data labeler

Return type

TrainableDataLabeler

class dataprofiler.labelers.data_labelers.UnstructuredDataLabeler(dirpath: Optional[str] = None, load_options: Optional[Dict] = None)

Bases: dataprofiler.labelers.base_data_labeler.BaseDataLabeler

BaseDataLabeler subclass specified as unstructured with internal variable.

Initialize DataLabeler class.

Parameters
  • dirpath – path to data labeler

  • load_options – optional arguments to include for load i.e. class for model or processors

add_label(label: str, same_as: Optional[str] = None) None

Add a label to the data labeler.

Parameters
  • label (str) – new label being added to the data labeler

  • same_as (str) – label to have the same encoding index as for multi-label to single encoding index.

Returns

None

check_pipeline(skip_postprocessor: bool = False, error_on_mismatch: bool = False) None

Check whether the processors and models connect together without error.

Parameters
  • skip_postprocessor (bool) – skip checking postprocessor is valid in pipeline

  • error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

Returns

None

help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

property label_mapping: Dict

Retrieve the label encodings.

Returns

dictionary for associating labels to indexes

property labels: List[str]

Retrieve the label.

Returns

list of labels

classmethod load_from_disk(dirpath: str, load_options: Optional[Dict] = None) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from a saved location on disk.

Parameters
  • dirpath (str) – path to data labeler files.

  • load_options (dict) – optional arguments to include for load i.e. class for model or processors

Returns

DataLabeler class

Return type

BaseDataLabeler

classmethod load_from_library(name: str) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from the data labeler zoo in the library.

Parameters

name (str) – name of the data labeler.

Returns

DataLabeler class

Return type

BaseDataLabeler

classmethod load_with_components(preprocessor: dataprofiler.labelers.data_processing.BaseDataPreprocessor, model: dataprofiler.labelers.base_model.BaseModel, postprocessor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from a its set of components.

Parameters
Returns

loaded BaseDataLabeler

Return type

BaseDataLabeler

property model: dataprofiler.labelers.base_model.BaseModel

Retrieve the data labeler model.

Returns

returns the model instance

property postprocessor: Optional[dataprofiler.labelers.data_processing.BaseDataPostprocessor]

Retrieve the data postprocessor.

Returns

returns the postprocessor instance

predict(data: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray], batch_size: int = 32, predict_options: Optional[Dict[str, bool]] = None, error_on_mismatch: bool = False, verbose: bool = True) Dict

Predict labels of input data based with the data labeler model.

Parameters
  • data (Union[pd.DataFrame, pd.Series, np.ndarray]) – data to be predicted upon

  • batch_size (int) – batch size of prediction

  • predict_options (Dict[str, bool]) – optional parameters to allow for predict as a dict, i.e. dict(show_confidences=True)

  • error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

  • verbose (bool) – Flag to determine whether to print status or not

Returns

predictions

Return type

Dict

property preprocessor: Optional[dataprofiler.labelers.data_processing.BaseDataPreprocessor]

Retrieve the data preprocessor.

Returns

returns the preprocessor instance

property reverse_label_mapping: Dict

Retrieve the index to label encoding.

Returns

dictionary for associating indexes to labels

save_to_disk(dirpath: str) None

Save the data labeler to the specified location.

Parameters

dirpath (str) – location to save the data labeler.

Returns

None

set_labels(labels: Union[List, Dict]) None

Set the labels for the data labeler.

Parameters

labels (list or dict) – new labels in either encoding list or dict

Returns

None

set_model(model: dataprofiler.labelers.base_model.BaseModel) None

Set the model for the data labeler.

Parameters

model (base_model.BaseModel) – model to use within the data labeler

Returns

None

set_params(params: Dict) None

Allow user to set parameters of pipeline components.

Done in the following format:
params = dict(

preprocessor=dict(…), model=dict(…), postprocessor=dict(…)

)

where the key,values pairs for each pipeline component must match parameters that exist in their components.

Parameters

params (dict) –

dictionary containing a key for a given pipeline component and its associated value of parameters as such:

dict(preprocessor=dict(…), model=dict(…), postprocessor=dict(…))

Returns

None

set_postprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) None

Set the data postprocessor for the data labeler.

Parameters

data_processor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor

Returns

None

set_preprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPreprocessor) None

Set the data preprocessor for the data labeler.

Parameters

data_processor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor

Returns

None

class dataprofiler.labelers.data_labelers.StructuredDataLabeler(dirpath: Optional[str] = None, load_options: Optional[Dict] = None)

Bases: dataprofiler.labelers.base_data_labeler.BaseDataLabeler

BaseDataLabeler subclass specified as structured with internal variable.

Initialize DataLabeler class.

Parameters
  • dirpath – path to data labeler

  • load_options – optional arguments to include for load i.e. class for model or processors

add_label(label: str, same_as: Optional[str] = None) None

Add a label to the data labeler.

Parameters
  • label (str) – new label being added to the data labeler

  • same_as (str) – label to have the same encoding index as for multi-label to single encoding index.

Returns

None

check_pipeline(skip_postprocessor: bool = False, error_on_mismatch: bool = False) None

Check whether the processors and models connect together without error.

Parameters
  • skip_postprocessor (bool) – skip checking postprocessor is valid in pipeline

  • error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

Returns

None

help() None

Describe alterable parameters.

Input data formats for preprocessors. Output data formats for postprocessors.

Returns

None

property label_mapping: Dict

Retrieve the label encodings.

Returns

dictionary for associating labels to indexes

property labels: List[str]

Retrieve the label.

Returns

list of labels

classmethod load_from_disk(dirpath: str, load_options: Optional[Dict] = None) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from a saved location on disk.

Parameters
  • dirpath (str) – path to data labeler files.

  • load_options (dict) – optional arguments to include for load i.e. class for model or processors

Returns

DataLabeler class

Return type

BaseDataLabeler

classmethod load_from_library(name: str) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from the data labeler zoo in the library.

Parameters

name (str) – name of the data labeler.

Returns

DataLabeler class

Return type

BaseDataLabeler

classmethod load_with_components(preprocessor: dataprofiler.labelers.data_processing.BaseDataPreprocessor, model: dataprofiler.labelers.base_model.BaseModel, postprocessor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from a its set of components.

Parameters
Returns

loaded BaseDataLabeler

Return type

BaseDataLabeler

property model: dataprofiler.labelers.base_model.BaseModel

Retrieve the data labeler model.

Returns

returns the model instance

property postprocessor: Optional[dataprofiler.labelers.data_processing.BaseDataPostprocessor]

Retrieve the data postprocessor.

Returns

returns the postprocessor instance

predict(data: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray], batch_size: int = 32, predict_options: Optional[Dict[str, bool]] = None, error_on_mismatch: bool = False, verbose: bool = True) Dict

Predict labels of input data based with the data labeler model.

Parameters
  • data (Union[pd.DataFrame, pd.Series, np.ndarray]) – data to be predicted upon

  • batch_size (int) – batch size of prediction

  • predict_options (Dict[str, bool]) – optional parameters to allow for predict as a dict, i.e. dict(show_confidences=True)

  • error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline

  • verbose (bool) – Flag to determine whether to print status or not

Returns

predictions

Return type

Dict

property preprocessor: Optional[dataprofiler.labelers.data_processing.BaseDataPreprocessor]

Retrieve the data preprocessor.

Returns

returns the preprocessor instance

property reverse_label_mapping: Dict

Retrieve the index to label encoding.

Returns

dictionary for associating indexes to labels

save_to_disk(dirpath: str) None

Save the data labeler to the specified location.

Parameters

dirpath (str) – location to save the data labeler.

Returns

None

set_labels(labels: Union[List, Dict]) None

Set the labels for the data labeler.

Parameters

labels (list or dict) – new labels in either encoding list or dict

Returns

None

set_model(model: dataprofiler.labelers.base_model.BaseModel) None

Set the model for the data labeler.

Parameters

model (base_model.BaseModel) – model to use within the data labeler

Returns

None

set_params(params: Dict) None

Allow user to set parameters of pipeline components.

Done in the following format:
params = dict(

preprocessor=dict(…), model=dict(…), postprocessor=dict(…)

)

where the key,values pairs for each pipeline component must match parameters that exist in their components.

Parameters

params (dict) –

dictionary containing a key for a given pipeline component and its associated value of parameters as such:

dict(preprocessor=dict(…), model=dict(…), postprocessor=dict(…))

Returns

None

set_postprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) None

Set the data postprocessor for the data labeler.

Parameters

data_processor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor

Returns

None

set_preprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPreprocessor) None

Set the data preprocessor for the data labeler.

Parameters

data_processor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor

Returns

None

class dataprofiler.labelers.data_labelers.DataLabeler(labeler_type: str, dirpath: Optional[str] = None, load_options: Optional[Dict] = None, trainable: bool = False)

Bases: object

Wrapper class for choosing between structured and unstructured labeler.

Create structured and unstructred data labeler objects.

Parameters
  • dirpath (str) – Path to load data labeler

  • load_options (Dict) – Optional arguments to include for load.

  • trainable (bool) – variable to dictate whether you want a trainable data labeler

Returns

labeler_classes = {'structured': <class 'dataprofiler.labelers.data_labelers.StructuredDataLabeler'>, 'unstructured': <class 'dataprofiler.labelers.data_labelers.UnstructuredDataLabeler'>}
classmethod load_from_library(name: str, trainable: bool = False) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from the data labeler zoo in the library.

Parameters
  • name (str) – name of the data labeler.

  • trainable (bool) – variable to dictate whether you want a trainable data labeler

Returns

DataLabeler class

classmethod load_from_disk(dirpath: str, load_options: Optional[Dict] = None, trainable: bool = False) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from a saved location on disk.

Parameters
  • dirpath (str) – path to data labeler files.

  • load_options (dict) – optional arguments to include for load i.e. class for model or processors

  • trainable (bool) – variable to dictate whether you want a trainable data labeler

Returns

DataLabeler class

classmethod load_with_components(preprocessor: dataprofiler.labelers.data_processing.BaseDataPreprocessor, model: dataprofiler.labelers.base_model.BaseModel, postprocessor: dataprofiler.labelers.data_processing.BaseDataPostprocessor, trainable: bool = False) dataprofiler.labelers.base_data_labeler.BaseDataLabeler

Load the data labeler from a its set of components.

Parameters
Returns