Representing model metadata with a schema

A rubicon-ml schema is a YAML file defining how attributes of a Python object, generally representing a model, will be logged to a rubicon-ml experiment. Schema can be used to automatically instrument and standardize the rubicon-ml logging of commonly used model objects.

Schema are used to log experiments to an existing rubicon-ml project. Experiments consist of features, parameters, metrics, artifacts, and dataframes. More info on each of these can be found in rubicon-ml’s glossary.

A simple schema

Consider the following objects from a module called my_model:

import pandas as pd

class Optimizer:
    def optimize(X, y, target):
        self.optimized_ = True

        return "optimized"

class Model:
    def __init__(self, alpha=1e-3, gamma=1e-3):
        self.alpha = alpha
        self.gamma = gamma

    def fit(self, X, y):
        self.optimizer = Optimizer()
        self.target = "y"

        self.feature_names_in_ = X.columns
        self.feature_importances_ = [1.0 / len(X.columns)] * len(X.columns)

        self.learned_attribute_ = optimizer.optimize(X, y, target)

        return self

    def score(self, X):
        self.score_ = 1.0
        self.summary_ = pd.DataFrame(
            [[self.alpha, self.gamma, self.learned_attribute_, self.score_]],
            columns=["alpha", "gamma", "learned_attribute", "score"],
        )

        return self.score_

The following is a complete YAML representation of the Model object’s schema:

name: my_model__Model
verison: 1.0.0

compatibility:
  pandas:
    max_version:
    min_version: 1.0.5
docs_url: https://my-docs.com/my-model/Model.html

artifacts:
  - self
  - name: optimizer
    data_object_attr: optimizer
dataframes:
  - name: summary
    df_attr: summary_
features:
  - names_attr: feature_names_in_
    importances_attr: feature_importances_
    optional: true
  - name_attr: target
metrics:
  - name: learned_attribute
    value_attr: learned_attribute_
    optional: true
  - name: score
    value_attr: score_
  - name: env_metric
    value_env: METRIC
parameters:
  - name: alpha
    value_attr: alpha
  - name: gamma
    value_attr: gamma
  - name: env_param
    value_env: PARAMETER

Schema metadata

The first section of the schema defines metadata about the schema itself, like the name and version. The name of a schema should be the name of the library the class it represents comes from and the name of the Python class itself separated by a double underscore.

name: my_model__Model
verison: 1.0.0

The next section defines any dependencies the model object has on external Python libraries. Generally, this will be at least the library the object is imported from. Reference documentation for the object to be logged can also be included in this section.

compatibility:
  pandas:
    max_version:
    min_version: 1.0.5
docs_url: https://my-docs.com/my-model/Model.html

The remaining sections define how the attributes of the object will be logged to the rubicon-ml experiment. In general, each section is a list of attributes to log to rubicon-ml with a name for the logged metadata and the name of the attribute containing the value to log.

Artifacts

Define a rubicon_ml.Artifact for logging by providing a name for the logged artifact and the attribute data_object_attr containing the object to log. The special keyword self will log the full object the schema represents as an artifact with the same name as the object’s class.

artifacts:
  - self             # logs this Model as an artifact named "Model"
  - name: optimizer  # logs Optimizer in `optimizer` attribute as an artifact named "optimizer"
    data_object_attr: optimizer

Dataframes

Define a rubicon_ml.Dataframe for logging by providing a name for the logged dataframe and the attribute df_attr containing the DataFrame to log.

dataframes:
  - name: summary  # logs DataFrame in `summary_` attribute as a dataframe named "summary"
    df_attr: summary_

Features

Define a single rubicon_ml.Feature for logging by providing the attribute name_attr containing the name of the feature to log and optionally the attribute importance_attr containing the feature’s importance.

Lists of features can be defined for logging with the attributes names_attr containing a list of feature names to log and optionally importances_attr containing the corresponding importances.

features:
  - names_attr: feature_names_in_  # for each value in the `feature_names_in_` attribute, logs a feature named that
                                   # value with the corresponding importance in the `feature_importances_` attribute
    importances_attr: feature_importances_
    optional: true
  - name_attr: target              # logs a feature named the value of the `target` attribute

Metrics

Define a rubicon_ml.Metric for logging by providing a name for the logged metric and the attribute value_attr containing the metric value to log.

Metric values can also be extracted from the runtime environment. Replace value_attr with value_env to leverage os.environ to read the metric value from the available environment variables.

metrics:
  - name: learned_attribute  # logs value in `learned_attribute_` attribute as a metric named "learned_attribute"
    value_attr: learned_attribute_
    optional: true
  - name: score              # logs value in `score_` attribute as a metric named "score"
    value_attr: score_
  - name: env_metric         # logs value in `METRIC` environment varibale as a metric named "env_metric"
    value_env: METRIC

Parameters

Define a rubicon_ml.Parameter for logging by providing a name for the logged parameter and the attribute value_attr containing the parameter value to log.

Parameter values can also be extracted from the runtime environment. Replace value_attr with value_env to leverage os.environ to read the parameter value from the available environment variables.

parameters:
  - name: alpha      # logs value in `alpha` attribute as a parameter named "alpha"
    value_attr: alpha
  - name: gamma      # logs value in `gamma` attribute as a parameter named "gamma"
    value_attr: gamma
  - name: env_param  # logs value in `PARAMETER` environment varibale as a parameter named "env_param"
    value_env: PARAMETER

Optional attributes

In some cases, the attribute containing the value to log may not always be set on the underlying object. A model may have been trained on a dataset with no feature names, or perhaps some learned attributes are only learned if certain parameters have certain values while fitting.

By default, schema logging will raise an exception if the attribute to be logged is not set. To suppress the errors and simply move on, items in the artifacts, dataframes, features, metrics, parameters and schema lists may optionally contain a key optional with a true value.

The feature_names_in_ and learned_attribute_ attributes are both marked optional in the example schema above to handle cases where no feature names were present in the training data and learned_attribute_ was not learned:

features:
  - names_attr: feature_names_in_
    importances_attr: feature_importances_
    optional: true     # will not error if `feature_importances_` attribute is not set
  - name_attr: target  # **will** error if `target` attribute is not set
metrics:
  - name: learned_attribute
    value_attr: learned_attribute_
    optional: true     # will not error if `learned_attribute_` attribute is not set

Note: Optional items in artifacts, dataframes, features, and schema will omit the associated entity from logging entirely if an optional attribute is not set. Optional items in metrics and parameters will log the associated entity with the given name and a value of None if an optional attribute is not set.

Nested schema

The following is a complete YAML representation of the Optimizer object’s schema:

name: my_model__Optimizer
verison: 1.0.0

metrics:
  - name: optimized
    value_attr: optimized_

To apply another schema to one of the attributes of the original object, provide the schema name to be retrieved via registry.get_schema and the attribute attr containing the object to apply the schema to.

schema:
  - name: my_model__Optimizer  # logs a metric according to the above schema using the object in `optimizer`
  - attr: optimizer

Note: Nested schema will add the logged entities to the original experiment created by the parent schema, not a new experiment. Nested schema cannot have names that conflict with the entites logged by the parent schema.

The complete schema now looks like this and will log an additional metric optimized as defined by the Optimizer schema to the original experiment:

name: my_model__Model
verison: 1.0.0

compatibility:
  pandas:
    max_version:
    min_version: 1.0.5
docs_url: https://my-docs.com/my-model/Model.html

artifacts:
  - self
  - name: optimizer
    data_object_attr: optimizer
dataframes:
  - name: summary
    df_attr: summary_
features:
  - names_attr: feature_names_in_
    importances_attr: feature_importances_
    optional: true
  - name_attr: target
metrics:
  - name: learned_attribute
    value_attr: learned_attribute_
    optional: true
  - name: score
    value_attr: score_
  - name: env_metric
    value_env: METRIC
parameters:
  - name: alpha
    value_attr: alpha
  - name: gamma
    value_attr: gamma
  - name: env_param
    value_env: PARAMETER
schema:
  - name: my_model__Optimizer
  - attr: optimizer

Hierarchical schema

Some objects may contain a list of other objects that are already represented by a scehma, like a feature eliminator or hyperparameter optimizer that trained multiple iterations of an underlying model object.

The children key can be provided to log each of these underlying objects to a new experiment. This means that a single call to project.log_with_schema will log 1+n experiments to project where n is the number of objects in the list specified by children.

Within the children key, provide the schema name for the children objects to be retrieved via registry.get_schema and the attribute attr containing the list of child objects.

children:
  - name: my_model__Optimizer  # defines the children's schema
  - attr: optimizers           # logs an experiment according to the schema for each object in `optimizers`

If we replace the nested schema from the previous example with a list of children that adhere to the same Optimizer schema, the complete schema now looks like this. It will log a single experiment for Model containing all the information in the original Model schema, as well as an additional experiment as defined by the Optimizer schema for each of the objects in Model’s optimizers list.

name: my_model__Model
verison: 1.0.0

compatibility:
  pandas:
    max_version:
    min_version: 1.0.5
docs_url: https://my-docs.com/my-model/Model.html

artifacts:
  - self
  - name: optimizer
    data_object_attr: optimizer
children:
  - name: my_model__Optimizer
  - attr: optimizers
dataframes:
  - name: summary
    df_attr: summary_
features:
  - names_attr: feature_names_in_
    importances_attr: feature_importances_
    optional: true
  - name_attr: target
metrics:
  - name: learned_attribute
    value_attr: learned_attribute_
    optional: true
  - name: score
    value_attr: score_
  - name: env_metric
    value_env: METRIC
parameters:
  - name: alpha
    value_attr: alpha
  - name: gamma
    value_attr: gamma
  - name: env_param
    value_env: PARAMETER

Extending a schema

Consider an extension of Model named NewModel:

class NewModel(Model):
    def __init__(self, alpha=1e-3, gamma=1e-3, delta=1e-3):
        super().__init__(alpha=alpha, gamma=gamma)

        self.delta = delta

    def fit(self, X, y):
        super().fit(X, y)

        self.other_learned_attribute_ = self.delta * self.learned_attribute_

        return self

To extend an existing schema, provide the name of the schema to extend as the extends key’s value after the new schema’s name. This new schema will log everything in the schema represented by extends plus any additional values.

name: my_model__NewModel
extends: my_model__Model
verison: 1.0.0

The following is a complete YAML representation of the NewModel object’s schema. This schema will log everything that the Model schema would with the addition of the other_learned_attribute metric and delta parameter from NewModel.

name: my_model__NewModel
extends: my_model__Model
verison: 1.0.0

compatibility:
  pandas:
    max_version:
    min_version: 1.0.5
docs_url: https://my-docs.com/my-model/NewModel.html

metrics:
  - name: other_learned_attribute
    value_attr: other_learned_attribute_
parameters:
  - name: delta
    value_attr: delta

To see an extended schema in action, check out the “Register a custom schema” section.