Base Data Labeler¶
Contains abstract classes from which labeler classes will inherit.
- class dataprofiler.labelers.base_data_labeler.BaseDataLabeler(dirpath: Optional[str] = None, load_options: Optional[Dict] = None)¶
Bases:
object
Parent class for data labeler objects.
Initialize DataLabeler class.
- Parameters
dirpath – path to data labeler
load_options – optional arguments to include for load i.e. class for model or processors
- help() None ¶
Describe alterable parameters.
Input data formats for preprocessors. Output data formats for postprocessors.
- Returns
None
- property label_mapping: Dict¶
Retrieve the label encodings.
- Returns
dictionary for associating labels to indexes
- property reverse_label_mapping: Dict¶
Retrieve the index to label encoding.
- Returns
dictionary for associating indexes to labels
- property labels: List[str]¶
Retrieve the label.
- Returns
list of labels
- property preprocessor: Optional[dataprofiler.labelers.data_processing.BaseDataPreprocessor]¶
Retrieve the data preprocessor.
- Returns
returns the preprocessor instance
- property model: Optional[dataprofiler.labelers.base_model.BaseModel]¶
Retrieve the data labeler model.
- Returns
returns the model instance
- property postprocessor: Optional[dataprofiler.labelers.data_processing.BaseDataPostprocessor]¶
Retrieve the data postprocessor.
- Returns
returns the postprocessor instance
- set_params(params: Dict) None ¶
Allow user to set parameters of pipeline components.
- Done in the following format:
- params = dict(
preprocessor=dict(…), model=dict(…), postprocessor=dict(…)
)
where the key,values pairs for each pipeline component must match parameters that exist in their components.
- Parameters
params (dict) –
dictionary containing a key for a given pipeline component and its associated value of parameters as such:
dict(preprocessor=dict(…), model=dict(…), postprocessor=dict(…))
- Returns
None
- add_label(label: str, same_as: Optional[str] = None) None ¶
Add a label to the data labeler.
- Parameters
label (str) – new label being added to the data labeler
same_as (str) – label to have the same encoding index as for multi-label to single encoding index.
- Returns
None
- set_labels(labels: Union[List, Dict]) None ¶
Set the labels for the data labeler.
- Parameters
labels (list or dict) – new labels in either encoding list or dict
- Returns
None
- predict(data: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray], batch_size: int = 32, predict_options: Optional[Dict[str, bool]] = None, error_on_mismatch: bool = False, verbose: bool = True) Dict ¶
Predict labels of input data based with the data labeler model.
- Parameters
data (Union[pd.DataFrame, pd.Series, np.ndarray]) – data to be predicted upon
batch_size (int) – batch size of prediction
predict_options (Dict[str, bool]) – optional parameters to allow for predict as a dict, i.e. dict(show_confidences=True)
error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline
verbose (bool) – Flag to determine whether to print status or not
- Returns
predictions
- Return type
Dict
- set_preprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPreprocessor) None ¶
Set the data preprocessor for the data labeler.
- Parameters
data_processor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor
- Returns
None
- set_model(model: dataprofiler.labelers.base_model.BaseModel) None ¶
Set the model for the data labeler.
- Parameters
model (base_model.BaseModel) – model to use within the data labeler
- Returns
None
- set_postprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) None ¶
Set the data postprocessor for the data labeler.
- Parameters
data_processor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor
- Returns
None
- check_pipeline(skip_postprocessor: bool = False, error_on_mismatch: bool = False) None ¶
Check whether the processors and models connect together without error.
- Parameters
skip_postprocessor (bool) – skip checking postprocessor is valid in pipeline
error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline
- Returns
None
- classmethod load_from_library(name: str) dataprofiler.labelers.base_data_labeler.BaseDataLabeler ¶
Load the data labeler from the data labeler zoo in the library.
- Parameters
name (str) – name of the data labeler.
- Returns
DataLabeler class
- Return type
- classmethod load_from_disk(dirpath: str, load_options: Optional[Dict] = None) dataprofiler.labelers.base_data_labeler.BaseDataLabeler ¶
Load the data labeler from a saved location on disk.
- Parameters
dirpath (str) – path to data labeler files.
load_options (dict) – optional arguments to include for load i.e. class for model or processors
- Returns
DataLabeler class
- Return type
- classmethod load_with_components(preprocessor: dataprofiler.labelers.data_processing.BaseDataPreprocessor, model: dataprofiler.labelers.base_model.BaseModel, postprocessor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) dataprofiler.labelers.base_data_labeler.BaseDataLabeler ¶
Load the data labeler from a its set of components.
- Parameters
preprocessor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor
model (base_model.BaseModel) – model to use within the data labeler
postprocessor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor
- Returns
loaded BaseDataLabeler
- Return type
- save_to_disk(dirpath: str) None ¶
Save the data labeler to the specified location.
- Parameters
dirpath (str) – location to save the data labeler.
- Returns
None
- class dataprofiler.labelers.base_data_labeler.TrainableDataLabeler(dirpath: Optional[str] = None, load_options: Optional[Dict] = None)¶
Bases:
dataprofiler.labelers.base_data_labeler.BaseDataLabeler
Subclass of BaseDataLabeler that can be trained.
Initialize DataLabeler class.
- Parameters
dirpath – path to data labeler
load_options – optional arguments to include for load i.e. class for model or processors
- fit(x: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray], y: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray], validation_split: float = 0.2, labels: Optional[Union[List, Dict]] = None, reset_weights: bool = False, batch_size: int = 32, epochs: int = 1, error_on_mismatch: bool = False) List ¶
Fit the data labeler model for the dataset.
- Parameters
x (Union[pd.DataFrame, pd.Series, np.ndarray]) – samples to fit model
y (Union[pd.DataFrame, pd.Series, np.ndarray]) – labels associated with the samples to fit model
validation_split (float) – split of the data to have as cross-validation data
labels (Union[list, dict]) – Encoding or number of labels if refit is needed to new labels
reset_weights (bool) – Flag to determine whether or not to reset the weights
batch_size (int) – Size of each batch sent to data labeler model
epochs (int) – number of epochs to iterate over the dataset and send to the model
error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline
- Returns
model output
- set_model(model: dataprofiler.labelers.base_model.BaseModel) None ¶
Set the model for a trainable data labeler.
Model must have a train function to be able to be set.
- Parameters
model (base_model.BaseModel) – model to use within the data labeler
- Returns
None
- classmethod load_with_components(preprocessor: dataprofiler.labelers.data_processing.BaseDataPreprocessor, model: dataprofiler.labelers.base_model.BaseModel, postprocessor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) dataprofiler.labelers.base_data_labeler.TrainableDataLabeler ¶
Load the data labeler from a its set of components.
- Parameters
preprocessor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor
model (base_model.BaseModel) – model to use within the data labeler
postprocessor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor
- Returns
loaded TrainableDataLabeler
- Return type
- add_label(label: str, same_as: Optional[str] = None) None ¶
Add a label to the data labeler.
- Parameters
label (str) – new label being added to the data labeler
same_as (str) – label to have the same encoding index as for multi-label to single encoding index.
- Returns
None
- check_pipeline(skip_postprocessor: bool = False, error_on_mismatch: bool = False) None ¶
Check whether the processors and models connect together without error.
- Parameters
skip_postprocessor (bool) – skip checking postprocessor is valid in pipeline
error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline
- Returns
None
- help() None ¶
Describe alterable parameters.
Input data formats for preprocessors. Output data formats for postprocessors.
- Returns
None
- property label_mapping: Dict¶
Retrieve the label encodings.
- Returns
dictionary for associating labels to indexes
- property labels: List[str]¶
Retrieve the label.
- Returns
list of labels
- classmethod load_from_disk(dirpath: str, load_options: Optional[Dict] = None) dataprofiler.labelers.base_data_labeler.BaseDataLabeler ¶
Load the data labeler from a saved location on disk.
- Parameters
dirpath (str) – path to data labeler files.
load_options (dict) – optional arguments to include for load i.e. class for model or processors
- Returns
DataLabeler class
- Return type
- classmethod load_from_library(name: str) dataprofiler.labelers.base_data_labeler.BaseDataLabeler ¶
Load the data labeler from the data labeler zoo in the library.
- Parameters
name (str) – name of the data labeler.
- Returns
DataLabeler class
- Return type
- property model: Optional[dataprofiler.labelers.base_model.BaseModel]¶
Retrieve the data labeler model.
- Returns
returns the model instance
- property postprocessor: Optional[dataprofiler.labelers.data_processing.BaseDataPostprocessor]¶
Retrieve the data postprocessor.
- Returns
returns the postprocessor instance
- predict(data: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, numpy.ndarray], batch_size: int = 32, predict_options: Optional[Dict[str, bool]] = None, error_on_mismatch: bool = False, verbose: bool = True) Dict ¶
Predict labels of input data based with the data labeler model.
- Parameters
data (Union[pd.DataFrame, pd.Series, np.ndarray]) – data to be predicted upon
batch_size (int) – batch size of prediction
predict_options (Dict[str, bool]) – optional parameters to allow for predict as a dict, i.e. dict(show_confidences=True)
error_on_mismatch (bool) – if true, errors instead of warns on parameter mismatches in pipeline
verbose (bool) – Flag to determine whether to print status or not
- Returns
predictions
- Return type
Dict
- property preprocessor: Optional[dataprofiler.labelers.data_processing.BaseDataPreprocessor]¶
Retrieve the data preprocessor.
- Returns
returns the preprocessor instance
- property reverse_label_mapping: Dict¶
Retrieve the index to label encoding.
- Returns
dictionary for associating indexes to labels
- save_to_disk(dirpath: str) None ¶
Save the data labeler to the specified location.
- Parameters
dirpath (str) – location to save the data labeler.
- Returns
None
- set_labels(labels: Union[List, Dict]) None ¶
Set the labels for the data labeler.
- Parameters
labels (list or dict) – new labels in either encoding list or dict
- Returns
None
- set_params(params: Dict) None ¶
Allow user to set parameters of pipeline components.
- Done in the following format:
- params = dict(
preprocessor=dict(…), model=dict(…), postprocessor=dict(…)
)
where the key,values pairs for each pipeline component must match parameters that exist in their components.
- Parameters
params (dict) –
dictionary containing a key for a given pipeline component and its associated value of parameters as such:
dict(preprocessor=dict(…), model=dict(…), postprocessor=dict(…))
- Returns
None
- set_postprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPostprocessor) None ¶
Set the data postprocessor for the data labeler.
- Parameters
data_processor (data_processing.BaseDataPostprocessor) – processor to set as the postprocessor
- Returns
None
- set_preprocessor(data_processor: dataprofiler.labelers.data_processing.BaseDataPreprocessor) None ¶
Set the data preprocessor for the data labeler.
- Parameters
data_processor (data_processing.BaseDataPreprocessor) – processor to set as the preprocessor
- Returns
None