dataprofiler.labelers.labeler_utils module

Contains functions for the data labeler.

dataprofiler.labelers.labeler_utils.f1_report_dict_to_str(f1_report: dict, label_names: list[str]) str

Return the report string from the f1_report dict.

Example Output:

precision recall f1-score support

class 0 0.00 0.00 0.00 1 class 1 1.00 0.67 0.80 3

micro avg 0.67 0.50 0.57 4 macro avg 0.50 0.33 0.40 4

weighted avg 0.75 0.50 0.60 4

Note: this is generally taken from the classification_report function inside sklearn. :param f1_report: f1 report dictionary from sklearn :type f1_report: dict :param label_names: names of labels included in the report :type label_names: list(str) :return: string representing f1_report printout :rtype: str

dataprofiler.labelers.labeler_utils.evaluate_accuracy(predicted_entities_in_index: list[list[int]], true_entities_in_index: list[list[int]], num_labels: int, entity_rev_dict: dict[int, str], verbose: bool = True, omitted_labels: tuple[str, ...] = ('PAD', 'UNKNOWN'), confusion_matrix_file: str | None = None) tuple[float, dict]

Evaluate accuracy from comparing predicted labels with true labels.

Parameters:
  • predicted_entities_in_index (list(array(int))) – predicted encoded labels for input sentences

  • true_entities_in_index (list(array(int))) – true encoded labels for input sentences

  • entity_rev_dict (dict([index, entity])) – dictionary to convert indices to entities

  • verbose (boolean) – print additional information for debugging

  • omitted_labels (list() of text labels) – labels to omit from the accuracy evaluation

  • confusion_matrix_file (str) – File name (and dir) for confusion matrix

:return : f1-score :rtype: float

dataprofiler.labelers.labeler_utils.get_tf_layer_index_from_name(model: tf.keras.Model, layer_name: str) int | None

Return the index of the layer given the layer name within a tf model.

Parameters:
  • model – tf keras model to search

  • layer_name – name of the layer to find

Returns:

layer index if it exists or None

dataprofiler.labelers.labeler_utils.hide_tf_logger_warnings() None

Filter out a set of warnings from the tf logger.

dataprofiler.labelers.labeler_utils.protected_register_keras_serializable(package: str = 'Custom', name: str | None = None) Callable

Protect against already registered keras serializable layers.

Ensures that if it was already registered, it will not try to register it again.

class dataprofiler.labelers.labeler_utils.FBetaScore(num_classes: int, average: str | None = None, beta: float = 1.0, threshold: float | None = None, name: str = 'fbeta_score', dtype: str | None = None, **kwargs: Any)

Bases: Metric

Computes F-Beta score.

Adapted and slightly modified from https://github.com/tensorflow/addons/blob/v0.12.0/tensorflow_addons/metrics/f_scores.py#L211-L283

# Copyright 2019 The TensorFlow Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the “License”); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # https://github.com/tensorflow/addons/blob/v0.12.0/LICENSE # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an “AS IS” BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ==============================================================================

It is the weighted harmonic mean of precision and recall. Output range is [0, 1]. Works for both multi-class and multi-label classification. $$ F_{beta} = (1 + beta^2) * frac{textrm{precision} * textrm{precision}}{(beta^2 cdot textrm{precision}) + textrm{recall}} $$ :param num_classes: Number of unique classes in the dataset. :param average: Type of averaging to be performed on data.

Acceptable values are None, micro, macro and weighted. Default value is None.

Parameters:
  • beta – Determines the weight of precision and recall in harmonic mean. Determines the weight given to the precision and recall. Default value is 1.

  • threshold – Elements of y_pred greater than threshold are converted to be 1, and the rest 0. If threshold is None, the argmax is converted to 1, and the rest 0.

  • name – (Optional) String name of the metric instance.

  • dtype – (Optional) Data type of the metric result.

Returns:

float.

Return type:

F-Beta Score

Initialize FBetaScore class.

update_state(y_true: tf.Tensor, y_pred: tf.Tensor, sample_weight: tf.Tensor | None = None) None

Update state.

result() Tensor

Return f1 score.

get_config() dict

Return the serializable config of the metric.

add_variable(shape, initializer, dtype=None, aggregation='sum', name=None)
add_weight(shape=(), initializer=None, dtype=None, name=None)
property dtype
classmethod from_config(config)
reset_state()

Reset all of the metric state variables.

This function is called between epochs/steps, when a metric is evaluated during training.

stateless_reset_state()
stateless_result(metric_variables)
stateless_update_state(metric_variables, *args, **kwargs)
property variables
class dataprofiler.labelers.labeler_utils.F1Score(num_classes: int, average: str | None = None, threshold: float | None = None, name: str = 'f1_score', dtype: str | None = None)

Bases: FBetaScore

Computes F-1 Score.

# Copyright 2019 The TensorFlow Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the “License”); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # https://github.com/tensorflow/addons/blob/v0.12.0/LICENSE # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an “AS IS” BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ==============================================================================

It is the harmonic mean of precision and recall. Output range is [0, 1]. Works for both multi-class and multi-label classification. $$ F_1 = 2 cdot frac{textrm{precision} cdot textrm{recall}}{textrm{precision} + textrm{recall}} $$ :param num_classes: Number of unique classes in the dataset. :param average: Type of averaging to be performed on data.

Acceptable values are None, micro, macro and weighted. Default value is None.

Parameters:
  • threshold – Elements of y_pred above threshold are considered to be 1, and the rest 0. If threshold is None, the argmax is converted to 1, and the rest 0.

  • name – (Optional) String name of the metric instance.

  • dtype – (Optional) Data type of the metric result.

Returns:

float.

Return type:

F-1 Score

Initialize F1Score object.

add_variable(shape, initializer, dtype=None, aggregation='sum', name=None)
add_weight(shape=(), initializer=None, dtype=None, name=None)
property dtype
classmethod from_config(config)
reset_state()

Reset all of the metric state variables.

This function is called between epochs/steps, when a metric is evaluated during training.

result() Tensor

Return f1 score.

stateless_reset_state()
stateless_result(metric_variables)
stateless_update_state(metric_variables, *args, **kwargs)
update_state(y_true: tf.Tensor, y_pred: tf.Tensor, sample_weight: tf.Tensor | None = None) None

Update state.

property variables
get_config() dict

Get configuration.