dataprofiler.labelers.character_level_cnn_model module

Contains classes for char data labeling.

dataprofiler.labelers.character_level_cnn_model.build_embd_dictionary(filename: str) dict[str, ndarray]

Return a numpy embedding dictionary from embed file with GloVe-like format.

Parameters:

filename (str) – Path to the embed file for loading

dataprofiler.labelers.character_level_cnn_model.create_glove_char(n_dims: int, source_file: str | None = None) None

Embed GloVe chars embeddings from source file to n_dims principal components.

Embed in a new file.

Parameters:
  • n_dims (int) – Final number of principal component dims of the embeddings

  • source_file (str) – Location of original embeddings to factor down

class dataprofiler.labelers.character_level_cnn_model.ThreshArgMaxLayer(*args, **kwargs)

Bases: Layer

Keras layer applying a thresholded argmax.

Apply a minimum threshold to the argmax value.

When below this threshold the index will be the default.

Parameters:
  • num_labels (int) – number of entities

  • threshold (float) – default set to 0 so all confidences pass.

  • default_ind (int) – default index

Returns:

final argmax threshold layer for the model

:return : tensor containing argmax thresholded integers, labels out :rtype: tf.Tensor

get_config()

Return a serializable config for saving the layer.

call(argmax_layer: Tensor, confidence_layer: Tensor) Tensor

Apply the threshold argmax to the input tensor.

add_loss(loss)

Can be called inside of the call() method to add a scalar loss.

Example:

```python class MyLayer(Layer):

… def call(self, x):

self.add_loss(ops.sum(x)) return x

```

add_metric()
add_variable(shape, initializer, dtype=None, trainable=True, autocast=True, regularizer=None, constraint=None, name=None)

Add a weight variable to the layer.

Alias of add_weight().

add_weight(shape=None, initializer=None, dtype=None, trainable=True, autocast=True, regularizer=None, constraint=None, aggregation='mean', name=None)

Add a weight variable to the layer.

Parameters:
  • shape – Shape tuple for the variable. Must be fully-defined (no None entries). Defaults to () (scalar) if unspecified.

  • initializer – Initializer object to use to populate the initial variable value, or string name of a built-in initializer (e.g. “random_normal”). If unspecified, defaults to “glorot_uniform” for floating-point variables and to “zeros” for all other types (e.g. int, bool).

  • dtype – Dtype of the variable to create, e.g. “float32”. If unspecified, defaults to the layer’s variable dtype (which itself defaults to “float32” if unspecified).

  • trainable – Boolean, whether the variable should be trainable via backprop or whether its updates are managed manually. Defaults to True.

  • autocast – Boolean, whether to autocast layers variables when accessing them. Defaults to True.

  • regularizer – Regularizer object to call to apply penalty on the weight. These penalties are summed into the loss function during optimization. Defaults to None.

  • constraint – Contrainst object to call on the variable after any optimizer update, or string name of a built-in constraint. Defaults to None.

  • aggregation – String, one of ‘mean’, ‘sum’, ‘only_first_replica’. Annotates the variable with the type of multi-replica aggregation to be used for this variable when writing custom data parallel training loops.

  • name – String name of the variable. Useful for debugging purposes.

build(input_shape)
build_from_config(config)

Builds the layer’s states with the supplied config dict.

By default, this method calls the build(config[“input_shape”]) method, which creates weights based on the layer’s input shape in the supplied config. If your config contains other information needed to load the layer’s state, you should override this method.

Parameters:

config – Dict containing the input shape associated with this layer.

property compute_dtype

The dtype of the computations performed by the layer.

compute_mask(inputs, previous_mask)
compute_output_shape(*args, **kwargs)
compute_output_spec(*args, **kwargs)
count_params()

Count the total number of scalars composing the weights.

Returns:

An integer count.

property dtype

Alias of layer.variable_dtype.

property dtype_policy
classmethod from_config(config)

Creates an operation from its config.

This method is the reverse of get_config, capable of instantiating the same operation from the config dictionary.

Note: If you override this method, you might receive a serialized dtype config, which is a dict. You can deserialize it as follows:

```python if “dtype” in config and isinstance(config[“dtype”], dict):

policy = dtype_policies.deserialize(config[“dtype”])

```

Parameters:

config – A Python dictionary, typically the output of get_config.

Returns:

An operation instance.

get_build_config()

Returns a dictionary with the layer’s input shape.

This method returns a config dict that can be used by build_from_config(config) to create all states (e.g. Variables and Lookup tables) needed by the layer.

By default, the config only contains the input shape that the layer was built with. If you’re writing a custom layer that creates state in an unusual way, you should override this method to make sure this state is already created when Keras attempts to load its value upon model loading.

Returns:

A dict containing the input shape associated with the layer.

get_weights()

Return the values of layer.weights as a list of NumPy arrays.

property input

Retrieves the input tensor(s) of a symbolic operation.

Only returns the tensor(s) corresponding to the first time the operation was called.

Returns:

Input tensor or list of input tensors.

property input_dtype

The dtype layer inputs should be converted to.

property input_spec
load_own_variables(store)

Loads the state of the layer.

You can override this method to take full control of how the state of the layer is loaded upon calling keras.models.load_model().

Parameters:

store – Dict from which the state of the model will be loaded.

property losses

List of scalar losses from add_loss, regularizers and sublayers.

property metrics

List of all metrics.

property metrics_variables

List of all metric variables.

property non_trainable_variables

List of all non-trainable layer state.

This extends layer.non_trainable_weights to include all state used by the layer including state for metrics and `SeedGenerator`s.

property non_trainable_weights

List of all non-trainable weight variables of the layer.

These are the weights that should not be updated by the optimizer during training. Unlike, layer.non_trainable_variables this excludes metric state and random seeds.

property output

Retrieves the output tensor(s) of a layer.

Only returns the tensor(s) corresponding to the first time the operation was called.

Returns:

Output tensor or list of output tensors.

property path

The path of the layer.

If the layer has not been built yet, it will be None.

property quantization_mode

The quantization mode of this layer, None if not quantized.

quantize(mode)
quantized_call(*args, **kwargs)
save_own_variables(store)

Saves the state of the layer.

You can override this method to take full control of how the state of the layer is saved upon calling model.save().

Parameters:

store – Dict where the state of the model will be saved.

set_weights(weights)

Sets the values of layer.weights from a list of NumPy arrays.

stateless_call(trainable_variables, non_trainable_variables, *args, return_losses=False, **kwargs)

Call the layer without any side effects.

Parameters:
  • trainable_variables – List of trainable variables of the model.

  • non_trainable_variables – List of non-trainable variables of the model.

  • *args – Positional arguments to be passed to call().

  • return_losses – If True, stateless_call() will return the list of losses created during call() as part of its return values.

  • **kwargs – Keyword arguments to be passed to call().

Returns:

A tuple. By default, returns (outputs, non_trainable_variables).

If return_losses = True, then returns (outputs, non_trainable_variables, losses).

Note: non_trainable_variables include not only non-trainable weights such as BatchNormalization statistics, but also RNG seed state (if there are any random operations part of the layer, such as dropout), and Metric state (if there are any metrics attached to the layer). These are all elements of state of the layer.

Example:

```python model = … data = … trainable_variables = model.trainable_variables non_trainable_variables = model.non_trainable_variables # Call the model with zero side effects outputs, non_trainable_variables = model.stateless_call(

trainable_variables, non_trainable_variables, data,

) # Attach the updated state to the model # (until you do this, the model is still in its pre-call state). for ref_var, value in zip(

model.non_trainable_variables, non_trainable_variables

):

ref_var.assign(value)

```

property supports_masking

Whether this layer supports computing a mask using compute_mask.

symbolic_call(*args, **kwargs)
property trainable

Settable boolean, whether this layer should be trainable or not.

property trainable_variables

List of all trainable layer state.

This is equivalent to layer.trainable_weights.

property trainable_weights

List of all trainable weight variables of the layer.

These are the weights that get updated by the optimizer during training.

property variable_dtype

The dtype of the state (weights) of the layer.

property variables

List of all layer state, including random seeds.

This extends layer.weights to include all state used by the layer including `SeedGenerator`s.

Note that metrics variables are not included here, use metrics_variables to visit all the metric variables.

property weights

List of all weight variables of the layer.

Unlike, layer.variables this excludes metric state and random seeds.

class dataprofiler.labelers.character_level_cnn_model.EncodingLayer(*args, **kwargs)

Bases: Layer

Encodes strings to integers.

Encode characters for the list of sentences.

Parameters:
  • max_char_encoding_id (int) – Maximum integer value for encoding the input

  • max_len (int) – Maximum char length in a sample

get_config()

Return a serializable config for saving the layer.

call(input_str_tensor: Tensor) Tensor

Encode characters for the list of sentences.

Parameters:

input_str_tensor (tf.tensor) – input list of sentences converted to tensor

:return : tensor containing encoded list of input sentences :rtype: tf.Tensor

add_loss(loss)

Can be called inside of the call() method to add a scalar loss.

Example:

```python class MyLayer(Layer):

… def call(self, x):

self.add_loss(ops.sum(x)) return x

```

add_metric()
add_variable(shape, initializer, dtype=None, trainable=True, autocast=True, regularizer=None, constraint=None, name=None)

Add a weight variable to the layer.

Alias of add_weight().

add_weight(shape=None, initializer=None, dtype=None, trainable=True, autocast=True, regularizer=None, constraint=None, aggregation='mean', name=None)

Add a weight variable to the layer.

Parameters:
  • shape – Shape tuple for the variable. Must be fully-defined (no None entries). Defaults to () (scalar) if unspecified.

  • initializer – Initializer object to use to populate the initial variable value, or string name of a built-in initializer (e.g. “random_normal”). If unspecified, defaults to “glorot_uniform” for floating-point variables and to “zeros” for all other types (e.g. int, bool).

  • dtype – Dtype of the variable to create, e.g. “float32”. If unspecified, defaults to the layer’s variable dtype (which itself defaults to “float32” if unspecified).

  • trainable – Boolean, whether the variable should be trainable via backprop or whether its updates are managed manually. Defaults to True.

  • autocast – Boolean, whether to autocast layers variables when accessing them. Defaults to True.

  • regularizer – Regularizer object to call to apply penalty on the weight. These penalties are summed into the loss function during optimization. Defaults to None.

  • constraint – Contrainst object to call on the variable after any optimizer update, or string name of a built-in constraint. Defaults to None.

  • aggregation – String, one of ‘mean’, ‘sum’, ‘only_first_replica’. Annotates the variable with the type of multi-replica aggregation to be used for this variable when writing custom data parallel training loops.

  • name – String name of the variable. Useful for debugging purposes.

build(input_shape)
build_from_config(config)

Builds the layer’s states with the supplied config dict.

By default, this method calls the build(config[“input_shape”]) method, which creates weights based on the layer’s input shape in the supplied config. If your config contains other information needed to load the layer’s state, you should override this method.

Parameters:

config – Dict containing the input shape associated with this layer.

property compute_dtype

The dtype of the computations performed by the layer.

compute_mask(inputs, previous_mask)
compute_output_shape(*args, **kwargs)
compute_output_spec(*args, **kwargs)
count_params()

Count the total number of scalars composing the weights.

Returns:

An integer count.

property dtype

Alias of layer.variable_dtype.

property dtype_policy
classmethod from_config(config)

Creates an operation from its config.

This method is the reverse of get_config, capable of instantiating the same operation from the config dictionary.

Note: If you override this method, you might receive a serialized dtype config, which is a dict. You can deserialize it as follows:

```python if “dtype” in config and isinstance(config[“dtype”], dict):

policy = dtype_policies.deserialize(config[“dtype”])

```

Parameters:

config – A Python dictionary, typically the output of get_config.

Returns:

An operation instance.

get_build_config()

Returns a dictionary with the layer’s input shape.

This method returns a config dict that can be used by build_from_config(config) to create all states (e.g. Variables and Lookup tables) needed by the layer.

By default, the config only contains the input shape that the layer was built with. If you’re writing a custom layer that creates state in an unusual way, you should override this method to make sure this state is already created when Keras attempts to load its value upon model loading.

Returns:

A dict containing the input shape associated with the layer.

get_weights()

Return the values of layer.weights as a list of NumPy arrays.

property input

Retrieves the input tensor(s) of a symbolic operation.

Only returns the tensor(s) corresponding to the first time the operation was called.

Returns:

Input tensor or list of input tensors.

property input_dtype

The dtype layer inputs should be converted to.

property input_spec
load_own_variables(store)

Loads the state of the layer.

You can override this method to take full control of how the state of the layer is loaded upon calling keras.models.load_model().

Parameters:

store – Dict from which the state of the model will be loaded.

property losses

List of scalar losses from add_loss, regularizers and sublayers.

property metrics

List of all metrics.

property metrics_variables

List of all metric variables.

property non_trainable_variables

List of all non-trainable layer state.

This extends layer.non_trainable_weights to include all state used by the layer including state for metrics and `SeedGenerator`s.

property non_trainable_weights

List of all non-trainable weight variables of the layer.

These are the weights that should not be updated by the optimizer during training. Unlike, layer.non_trainable_variables this excludes metric state and random seeds.

property output

Retrieves the output tensor(s) of a layer.

Only returns the tensor(s) corresponding to the first time the operation was called.

Returns:

Output tensor or list of output tensors.

property path

The path of the layer.

If the layer has not been built yet, it will be None.

property quantization_mode

The quantization mode of this layer, None if not quantized.

quantize(mode)
quantized_call(*args, **kwargs)
save_own_variables(store)

Saves the state of the layer.

You can override this method to take full control of how the state of the layer is saved upon calling model.save().

Parameters:

store – Dict where the state of the model will be saved.

set_weights(weights)

Sets the values of layer.weights from a list of NumPy arrays.

stateless_call(trainable_variables, non_trainable_variables, *args, return_losses=False, **kwargs)

Call the layer without any side effects.

Parameters:
  • trainable_variables – List of trainable variables of the model.

  • non_trainable_variables – List of non-trainable variables of the model.

  • *args – Positional arguments to be passed to call().

  • return_losses – If True, stateless_call() will return the list of losses created during call() as part of its return values.

  • **kwargs – Keyword arguments to be passed to call().

Returns:

A tuple. By default, returns (outputs, non_trainable_variables).

If return_losses = True, then returns (outputs, non_trainable_variables, losses).

Note: non_trainable_variables include not only non-trainable weights such as BatchNormalization statistics, but also RNG seed state (if there are any random operations part of the layer, such as dropout), and Metric state (if there are any metrics attached to the layer). These are all elements of state of the layer.

Example:

```python model = … data = … trainable_variables = model.trainable_variables non_trainable_variables = model.non_trainable_variables # Call the model with zero side effects outputs, non_trainable_variables = model.stateless_call(

trainable_variables, non_trainable_variables, data,

) # Attach the updated state to the model # (until you do this, the model is still in its pre-call state). for ref_var, value in zip(

model.non_trainable_variables, non_trainable_variables

):

ref_var.assign(value)

```

property supports_masking

Whether this layer supports computing a mask using compute_mask.

symbolic_call(*args, **kwargs)
property trainable

Settable boolean, whether this layer should be trainable or not.

property trainable_variables

List of all trainable layer state.

This is equivalent to layer.trainable_weights.

property trainable_weights

List of all trainable weight variables of the layer.

These are the weights that get updated by the optimizer during training.

property variable_dtype

The dtype of the state (weights) of the layer.

property variables

List of all layer state, including random seeds.

This extends layer.weights to include all state used by the layer including `SeedGenerator`s.

Note that metrics variables are not included here, use metrics_variables to visit all the metric variables.

property weights

List of all weight variables of the layer.

Unlike, layer.variables this excludes metric state and random seeds.

class dataprofiler.labelers.character_level_cnn_model.CharacterLevelCnnModel(label_mapping: dict[str, int], parameters: dict | None = None)

Bases: BaseTrainableModel

Class for training char data labeler.

Initialize CNN Model.

Initialize epoch_id.

Parameters:
  • label_mapping (dict) – maps labels to their encoded integers

  • parameters (dict) –

    Contains all the appropriate parameters for the model. Must contain num_labels. Other possible parameters are:

    max_length, max_char_encoding_id, dim_embed, size_fc dropout, size_conv, num_fil, optimizer, default_label

Returns:

None

requires_zero_mapping: bool = True
set_label_mapping(label_mapping: list[str] | dict[str, int]) None

Set the labels for the model.

Parameters:

label_mapping (dict) – label mapping of the model

Returns:

None

save_to_disk(dirpath: str) None

Save whole model to disk with weights.

Parameters:

dirpath (str) – directory path where you want to save the model to

Returns:

None

classmethod load_from_disk(dirpath: str) CharacterLevelCnnModel

Load whole model from disk with weights.

Parameters:

dirpath (str) – directory path where you want to load the model from

Returns:

None

reset_weights() None

Reset the weights of the model.

Returns:

None

fit(train_data: DataArray, val_data: DataArray | None = None, batch_size: int = None, epochs: int = None, label_mapping: dict[str, int] = None, reset_weights: bool = False, verbose: bool = True) tuple[dict, float | None, dict]

Train the current model with the training data and validation data.

Parameters:
  • train_data (Union[list, np.ndarray]) – Training data used to train model

  • val_data (Union[list, np.ndarray]) – Validation data used to validate the training

  • batch_size (int) – Used to determine number of samples in each batch

  • label_mapping (Union[dict, None]) – maps labels to their encoded integers

  • reset_weights (bool) – Flag to determine whether to reset the weights or not

  • verbose (bool) – Flag to determine whether to print status or not

Returns:

history, f1, f1_report

Return type:

Tuple[dict, float, dict]

predict(data: DataFrame | Series | ndarray, batch_size: int = 32, show_confidences: bool = False, verbose: bool = True) dict

Run model and get predictions.

Parameters:
  • data (Union[list, numpy.ndarray]) – text input

  • batch_size (int) – number of samples in the batch of data

  • show_confidences – whether user wants prediction confidences

  • verbose (bool) – Flag to determine whether to print status or not

Returns:

char level predictions and confidences

Return type:

dict

details() None

Print the relevant details of the model.

Details include summary, parameters, and label mapping.

add_label(label: str, same_as: str | None = None) None

Add a label to the data labeler.

Parameters:
  • label (str) – new label being added to the data labeler

  • same_as (str) – label to have the same encoding index as for multi-label to single encoding index.

Returns:

None

classmethod get_class(class_name: str) type[BaseModel] | None

Get subclasses.

get_parameters(param_list: list[str] | None = None) dict

Return a dict of parameters from the model given a list.

Parameters:

param_list (List[str]) – list of parameters to retrieve from the model.

Returns:

dict of parameters

classmethod help() None

Help describe alterable parameters.

Returns:

None

property label_mapping: dict[str, int]

Return mapping of labels to their encoded values.

property labels: list[str]

Retrieve the label.

Returns:

list of labels

property num_labels: int

Return max label mapping.

property reverse_label_mapping: dict[int, str]

Return reversed order of current labels.

Useful for when needed to extract Labels via indices.

set_params(**kwargs: Any) None

Set the parameters if they exist given kwargs.